-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate X unique similarity values for X fingerprints with respect to all molecules in the dataset. #1
Comments
Hi, thanks for your interest in our method! You are correct, if you just use the code in its original form, you'll only get a number for the whole set (e.g., the extended similarity of the set). I'd recommend looking at a related idea that we proposed recently: the complementary similarity (first described here: https://pubs.rsc.org/en/content/articlehtml/2021/cp/d1cp04019g). Basically, this will give you a single number for each different fingerprint. This will give you an idea of how related are all the objects in the original set. We have shown that this can be used to distinguish between "medoid"-like structures and "outlier"-like structures, so it could provide information about different subsets in the data. Let us know if you need help with this, or if you want to talk a bit more we can set a Zoom meeting. |
Thanks for the quick reply! Okay from that paper, I see three ways of obtaining element-wise similarities:
From what I recall; [1] and [2] give the same degree of information, but [2] has better scaling. I think method [1] will suffice for now, and I can use method [2] if I run into computation cost issues. Thanks! |
I do need a pointer on the code.
Fingerprints is a 2D array of 4026 molecules, each with a continous scalar fingerprint of 2500 elements, which I then normalise. In this, I am trying to get the continous extended similarity metric for all fingerprints in my data set. Does this work as intended for continuous data? I.e. does it automatically use the continuous JT metric, or do I need to code my own function for the cJT explicitly? Which I'm a bit confused by the notation looking at Table 1. |
No problem, it's a pleasure to help! You are right [1] and [2] are equivalent, with [2] having a much better scaling. I'd recommend to go with [2], while the improved performance might not be noticeable for sets ~10^4 molecules, ideally you'd want to have everything setup in case you'll need to handle much bigger sets. As for [3], we've never tested it, but my intuition is that it'd probably give relatively similar results to [1] and [2]. What we noticed with [2] in the paper I mentioned before is that it can cleanly separate your set into different areas depending how "well-organized" they are (see Fig. 5 in the paper, the pre-clustering step). This can be extremely helpful, and is an idea we've been playing with a lot in new chemical space visualization techniques right now. |
If you just normalize and then calculate the extended similarity on top of that it'll be equivalent to Variant 3 in this paper: https://link.springer.com/article/10.1007/s10822-022-00444-7 |
Hi,
Say I have a set of 10,000 atoms, each one with a fingerprint 1000 continuous (normalised) scalar values to describe them. Can I use this software to generate 10,000 scalar values, one for each atom, that represents the similarity of the respective fingerprint against all other fingerprints or some arbitrary reference simultaneously?
I've been playing with the code, but from my understanding it only generates a single scalar value to show the similarly of the dataset as a whole? I've gotten a bit lost!
Basically I have a used N-body Iteratively Contracted Equivariants to build up representations of the local atomic environments for all of the atoms in a set of 4000 organic molecules. A representation for a single atom can consist of many continuous scalar values (lets just say 10,000 atoms with 1000 elements in each atomic 'fingerprint' for sake of argument). I can treat these like fingerprints, but I don't want a pairwise comparison. I want to apply some similarity metric that compares these representations and returns an array of 'similarity' scores, one for each fingerprint. Then I can plot a heatmap like the one below, where the phi metric on colourbar scale has been replaced by the 'similarity of atomic environment'.
Obviously, I could just take the sum of all 1000 elements per atom and use that, but surely there is some sort of similarity metric that does a better job.
The text was updated successfully, but these errors were encountered: