Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voronoi-tessellation-based features (i.e., de Jong, Isayev, Ward) #91

Closed
computron opened this issue Dec 15, 2017 · 15 comments
Closed

Voronoi-tessellation-based features (i.e., de Jong, Isayev, Ward) #91

computron opened this issue Dec 15, 2017 · 15 comments

Comments

@computron
Copy link
Contributor

talk to @WardLT if interested

https://www.nature.com/articles/srep34256
https://www.nature.com/articles/ncomms15679
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.96.024104

@computron
Copy link
Contributor Author

(maybe something for @Qi-max to contribute?)

@WardLT
Copy link
Contributor

WardLT commented Mar 23, 2018

I'm gradually working on this issue. If anyone else is working on it or wants to, let me know and we can figure out how to combine efforts.

@Qi-max
Copy link
Contributor

Qi-max commented Mar 23, 2018

Hi Logan, I am not now. Pls tell me If there is anything you want me to work on.

@WardLT
Copy link
Contributor

WardLT commented Mar 23, 2018

Thanks, @Qi-max! I'm about halfway done with the features from the PRB article, which will cover most of the Sci Rep features. So, it is probably easiest for me to finish those two papers myself.

Would you want to start on the Nat. Comm features?

Also, the areas that I haven't figured out yet:

  1. Getting 2nd-nearest neighbors from the NearNeighbor pymatgen classes, which are needed for some of the PRB attributes. I was thinking of building an additional method on to the NearNeighbor classes for this, but am not sure if that is already done or if there is an easier way.
  2. Whether we can/should combine these features with your existing VoronoiFingerprint class. I am in favor of keeping them as separate featurizers to avoid having classes with too many options.
  3. How to cache the Voronoi tessellation, if it is used by multiple featurizers. When implementing the PRB features in Magpie, I found that running the tessellation comprised a significant amount of the featurization time. While I'm not doing anything about caching now, I'm worried that making many different featurizers (see 2) will exacerbate any performance issues.

Any thoughts about these issues?

@Qi-max
Copy link
Contributor

Qi-max commented Mar 23, 2018

Hi Logan,

Thanks for the reply!

for 1, I think adding an additional method to the NearNeighbor classes to get 2nd-nn is a good way to go. I am only a bit concerned about the performance issue, as it may includes some repeated calculation of 1st-nn in real applications. eg., if we want to get 2nd-nn of each atom in the structure and we already calculated the 1st-nn of all atoms (which is a more common case I think), we can just combine the 1st-nn of the 1st-nn atoms of each atom to get the 2nd-nn atoms, without further calculation. This requires solving the caching issues. But still, I think adding a standard way to get 2nd-nn is quite useful.

Besides, I have considered to add a new site-featurizer to allow "coarse-graining" (eg. getting 'mean', 'std_dev', 'minimum' or 'maximum') any site fingerprint around the neighbors to serve as site features. This featurizer could also takes the 2nd-NN neighboring effect into account. The user can give the featurizer a type of site fingerprint and a neighbor finding algorithm (including Voronoi method) and the featurizer will return a coarse grained version of that site fingerprint. This works similarly as the SiteStatsFingerprint, except for serving as site features rather than structure features. But I am also quite concerned about the performance issue, especially if the neighbor finding is quite computationally expensive, as you have mentioned. I have also tested some matminer featurizers on the disordered systems that contain tens/hundreds of thousands of atoms and found that it indeed takes quite a long time to featurize.

for 2 and 3 (and 1), I totally agree that it will be great if we can also share results among featurizers, in addition to within a single featurizer. This will solve many issues and save a lot of computation time. Some steps, e.g., getting atoms within a cutoff, getting neighborlist etc, are indeed frequently called by a number of featurizers. I think perhaps methods like joblib.Memory (for caching a function's return values each time it is called with the same input arguments, also used in sklearn) might be useful? or even create some tmp db files in a standard way (might be less preferable). How do you feel about this?

@WardLT
Copy link
Contributor

WardLT commented Mar 25, 2018

2nd NN determination: I think you make a good point about avoiding repeatedly calculating the first nearest neighbors. I'll starting to work on that code soon, and I'll tag you on the PR in case you want to critique it.

Neighboring Site Featurizer: Something like SiteStatsFeaturizer could work for that kind of feature. I'll keep that use case in mind when writing the 2nd NN code, so that you could use it effectively.

Caching: I like your idea. After thinking some more, here are the options I'm considering.

  1. Introducing a utility function to matminer that takes the NearNeighbors class, structure, and site number as input and uses lru_cache/joblib.Memory to serve repeated calls. We would call that rather than the NearestNeighbor operations directly.
  • Multiple featurizers would call the same ultility function. And, if someone uses MultiFeaturizer, the access pattern will result in plenty of cache hits.
  • Problem: Not sure how burdensome calling the utility function rather than pymatgen directly would be.
  • Problem: lru_cache will not natively recognize if you pass two different instantiations of the same NearNeighbor class as the same. To do so, we'd need to implement __eq__ and __hash__ for each NearNeighbor class.
  1. Storing the NN information as an attribute of the Structure object, and our featurizers would check for the presence of that data before re-computing it. (I'd write also write a utility function for each of these)
  • This is how I cache tessellations in Magpie
  • Problem: Could run into problems with memory for larger datasets (I did with Magpie). We would need to write code that clears out the NN attributes after they are no longer useful.

I lean option 1. I don't think using a utility operation is a big deal and, while I recognize implementing __eq__ and __hash__ introduces a vector for difficult-to-debug errors, I do not see it as too burdensome to police. Any thoughts/other ideas?

@Qi-max
Copy link
Contributor

Qi-max commented Mar 28, 2018

Hi Logan,

I like your implementation of Nth nearest neighbors and additional Voronoi weighting methods! For caching, I also perfer option 1, and I think your points about a new utility function and implementing __eq__ and __hash__ in related classes are quite reasonable. I think it is worth doing, despite introducing some extra complexity, but we at least leave an option (maybe default as False) for users to cache some computationally expensive calls, which can be beneficial when they are dealing with quite large datasets.

@WardLT
Copy link
Contributor

WardLT commented Apr 1, 2018

Thanks, @Qi-max !

I did some performance testing of this implementation, and found it takes about 30s to compute the features of 10 structures. Turns out that about half of the featurization time is spent computing the tessellation, so I implemented the LRU caching mechanism (Option 1). I automatically cache the NN for all atoms in the structure, so we might want to institute an option to avoid caching (as you suggested). If curious: https://github.com/WardLT/matminer/tree/faster_prb

After adding in the caching, we are down to about 22s for the same 10 structures. The average over 100 structures is about 1.8s/structure. We likely have some room to improve performance further, as I can do about 0.1s/structure for these features with Magpie.

My current plan is to try to optimize the Voronoi/near-neighbor-finding routines in pymatgen improve performance further. If I can get performance of faster than 1s per structure, I'd be pretty happy.

@WardLT
Copy link
Contributor

WardLT commented Apr 2, 2018

Well, I got it below 1s per structure (latest is 0.8s). I'll open a PR if/when my changes to pymatgen get approved. We can now featurize the FLLA dataset in ~50 minutes.

@computron
Copy link
Contributor Author

All, sorry I haven't had a chance to review this thread in detail (and likely might not be able to for a bit, so feel free to continue w/o me).

My suggestion based just on skimming is to look into some of the "memoize" decorators in Python to see if you can avoid recomputing. Not sure how this compares with the current solution

@computron
Copy link
Contributor Author

Also an LRI cache might(?) be faster than an LRU one and be similar for our use case. See:

http://boltons.readthedocs.io/en/latest/cacheutils.html

@WardLT
Copy link
Contributor

WardLT commented Apr 8, 2018

The LRI caches do seem like a good option. Especially for the site featurizers which will call get_neighbors frequently, we might benefit from a faster caching mechanism.

For our current use case - structure featurizers - the LRU cache is sufficient. In fact, the access pattern for structure-based featurizers is such that a cache size of 1 will achieve the minimum number of cache misses. I'll ask around for a site featurizer use case to benchmark different caching strategies.

@computron
Copy link
Contributor Author

@WardLT @Qi-max should this issue be considered closed?

@WardLT
Copy link
Contributor

WardLT commented Jun 7, 2018

[Sorry for the slow response]

I don't think this issue is quite closed. There is still a little work to go for the de Jong features (e.g., adding the variance of neighbor properties, building presets and examples), and a bit for the Isayev features (we might need to extend the bond fragment work).

@ardunn
Copy link
Contributor

ardunn commented Jun 7, 2021

superseded by #635

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants