-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape some sort of physical properties database #2
Comments
I'm actually not even sure at all what the heck the |
I guess what I want is basically the totality of [insert chemical database here], accessible via |
Water seems to be the weight per volume of water, and it is used to define mmH2O by multiplying it with length. Confusingly, "mercury" and "Hg" actually refer to different things. I would like to source data from a chemical database, but I haven't really been able to find one which either lets me download the data or has an API. However, whether or not I have such a database I should still be able to resolve this problem by introducing an explicit notion of substances with multiple properties. |
Summoning @bofh453 |
I've added substances in a branch. It looks like this:
(test being a made up substance) Next I have to update the definitions file to use them. |
This turns out to be shockingly hard. Like, in general (especially for "engineering" properties such as the bulk/elastic/shear/Young's moduli), the only solution is scraping papers and patents (this often requires OCRing first, especially for the patents. Okay, less so now, but you still sometimes need to double-check Google Patents OCR'd the thing correctly). That being said, there's a bunch of stopgaps. That's the good thing. The bad thing is most require at minimum registration. The two big ones to start with are http://www.chemnetbase.com/ (ChemNetBase) and http://www.chemspider.com/ (RSC ChemSpider). Both have fairly comprehensive web APIs for bulk-fetching of data, I believe the latter's is still open to anyone without login, but I'm not sure. Edit: NIST has already nicely scraped the CRC handbook into a DB for you. Available here: https://www.nist.gov/pml/productsservices/physical-reference-data For simple stuff, such as atoms and simple compounds, just scrape all of the CRC Handbook of Chemistry and Physics into a textfile. It's probably been done before to something structured, though you can already get something almost easily parsable just by grabbing the 2014 copy of it from libgen and running pdftotext -raw CRCHandbook.pdf CRCHandbook.txt. This handbook, btw, is the source of most of the periodic table data you've seen anywhere, though it may have hopped through 3-15 reprints to get there. Turns out both getting and aggregating experimental data is hard. Other useful things:
|
Oh, one more thing: NIST has a ton of spectral data easily available: http://webbook.nist.gov/chemistry/name-ser.html No official API, but seeing as I can basically programmatically fetch things by hand using extremely trivial curl POST requests, and there are no ratelimits, well, yeah. |
Not quite as convenient as I was hoping for, but I'll definitely having a go at obtaining the data from these sources. |
I've pushed support for substances to master. The original issue should be resolved now, but I will leave this open for the second part about sourcing data.
|
@bofh453 I'm having some difficulty with these sources. The NIST data only seems to have a small subset of what the CRC handbook offers - it doesn't seem to have any properties other than stuff like molar mass and ionization energy of the elements. I already have molar masses for all the elements, but the data isn't cited. CHEMnetBase wants me to login with a subscribing organization to view data, and Chemspider seems to only have predicted properties for the queries I've tried so far - are these predicted properties accurate? Unless I'm missing something, I may have to obtain a PDF of the CRC handbook and get the data out like you said. For reference, here's some of the properties I'm interested in (don't necessarily need or want all of them at the same time):
As far as what I'm interested in the properties of, I'd like to get all of the elements (possibly for more than one isotope? e.g. uranium-238 and uranium-235) as well as a number of common materials like stone, wood, glass, steel, oil, gasoline. Does the CRC handbook even have this data? You did say engineering properties are difficult to come by, and that's pretty much exactly what I'm looking for... I'm not sure where to start with OCRing patents, but that sounds like quite a lot of manual work to extract that for 118 elements. Should I give up on getting this data for elements in general and focus on the materials I mentioned? That way the data set is small enough that I can enter it by hand. |
Dwarf Fortress has been slowly building a list of material properties with help from the players on the forums, I'm not sure about the license on that collection tho: https://dwarffortresswiki.org/index.php/DF2014:Material_definition_token |
I tried to do this today, expecting something with densities:
Then I tried to see what various substance names map to, and it's kind of a mess...
The text was updated successfully, but these errors were encountered: