-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Voronoi-tessellation-based features (i.e., de Jong, Isayev, Ward) #91
Comments
(maybe something for @Qi-max to contribute?) |
I'm gradually working on this issue. If anyone else is working on it or wants to, let me know and we can figure out how to combine efforts. |
Hi Logan, I am not now. Pls tell me If there is anything you want me to work on. |
Thanks, @Qi-max! I'm about halfway done with the features from the PRB article, which will cover most of the Sci Rep features. So, it is probably easiest for me to finish those two papers myself. Would you want to start on the Nat. Comm features? Also, the areas that I haven't figured out yet:
Any thoughts about these issues? |
Hi Logan, Thanks for the reply! for 1, I think adding an additional method to the NearNeighbor classes to get 2nd-nn is a good way to go. I am only a bit concerned about the performance issue, as it may includes some repeated calculation of 1st-nn in real applications. eg., if we want to get 2nd-nn of each atom in the structure and we already calculated the 1st-nn of all atoms (which is a more common case I think), we can just combine the 1st-nn of the 1st-nn atoms of each atom to get the 2nd-nn atoms, without further calculation. This requires solving the caching issues. But still, I think adding a standard way to get 2nd-nn is quite useful. Besides, I have considered to add a new site-featurizer to allow "coarse-graining" (eg. getting 'mean', 'std_dev', 'minimum' or 'maximum') any site fingerprint around the neighbors to serve as site features. This featurizer could also takes the 2nd-NN neighboring effect into account. The user can give the featurizer a type of site fingerprint and a neighbor finding algorithm (including Voronoi method) and the featurizer will return a coarse grained version of that site fingerprint. This works similarly as the SiteStatsFingerprint, except for serving as site features rather than structure features. But I am also quite concerned about the performance issue, especially if the neighbor finding is quite computationally expensive, as you have mentioned. I have also tested some matminer featurizers on the disordered systems that contain tens/hundreds of thousands of atoms and found that it indeed takes quite a long time to featurize. for 2 and 3 (and 1), I totally agree that it will be great if we can also share results among featurizers, in addition to within a single featurizer. This will solve many issues and save a lot of computation time. Some steps, e.g., getting atoms within a cutoff, getting neighborlist etc, are indeed frequently called by a number of featurizers. I think perhaps methods like |
2nd NN determination: I think you make a good point about avoiding repeatedly calculating the first nearest neighbors. I'll starting to work on that code soon, and I'll tag you on the PR in case you want to critique it. Neighboring Site Featurizer: Something like SiteStatsFeaturizer could work for that kind of feature. I'll keep that use case in mind when writing the 2nd NN code, so that you could use it effectively. Caching: I like your idea. After thinking some more, here are the options I'm considering.
I lean option 1. I don't think using a utility operation is a big deal and, while I recognize implementing |
Hi Logan, I like your implementation of Nth nearest neighbors and additional Voronoi weighting methods! For caching, I also perfer option 1, and I think your points about a new utility function and implementing |
Thanks, @Qi-max ! I did some performance testing of this implementation, and found it takes about 30s to compute the features of 10 structures. Turns out that about half of the featurization time is spent computing the tessellation, so I implemented the LRU caching mechanism (Option 1). I automatically cache the NN for all atoms in the structure, so we might want to institute an option to avoid caching (as you suggested). If curious: https://github.com/WardLT/matminer/tree/faster_prb After adding in the caching, we are down to about 22s for the same 10 structures. The average over 100 structures is about 1.8s/structure. We likely have some room to improve performance further, as I can do about 0.1s/structure for these features with Magpie. My current plan is to try to optimize the Voronoi/near-neighbor-finding routines in pymatgen improve performance further. If I can get performance of faster than 1s per structure, I'd be pretty happy. |
Well, I got it below 1s per structure (latest is 0.8s). I'll open a PR if/when my changes to pymatgen get approved. We can now featurize the FLLA dataset in ~50 minutes. |
All, sorry I haven't had a chance to review this thread in detail (and likely might not be able to for a bit, so feel free to continue w/o me). My suggestion based just on skimming is to look into some of the "memoize" decorators in Python to see if you can avoid recomputing. Not sure how this compares with the current solution |
Also an LRI cache might(?) be faster than an LRU one and be similar for our use case. See: |
The LRI caches do seem like a good option. Especially for the site featurizers which will call For our current use case - structure featurizers - the LRU cache is sufficient. In fact, the access pattern for structure-based featurizers is such that a cache size of 1 will achieve the minimum number of cache misses. I'll ask around for a site featurizer use case to benchmark different caching strategies. |
[Sorry for the slow response] I don't think this issue is quite closed. There is still a little work to go for the de Jong features (e.g., adding the variance of neighbor properties, building presets and examples), and a bit for the Isayev features (we might need to extend the bond fragment work). |
superseded by #635 |
talk to @WardLT if interested
https://www.nature.com/articles/srep34256
https://www.nature.com/articles/ncomms15679
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.96.024104
The text was updated successfully, but these errors were encountered: