-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INV: Handling multiple MIP projects in activity_id #41
Comments
True story. The great Aristotle once wrote "Δεν μου αρέσουν τα κενά στα ονόματα αρχείων." If there is a good reason to keep "AerChemMIP", I'd agree with hyphens. If we are only downloading "ScenarioMIP" (and their historical counterparts), I would say we drop the other names it into |
Is If we want to keep both, I'd say that for both the folder structure & the catalog, using an hyphen would quickly turn into a nightmare due to all the possible combinations. |
As we specifically downloaded that data for ScenarioMIP, I would only keep that. In the catalog, people might search for all ScenarioMIP data and would not find whatever is in ScenarioMIP-AerChemMIP. Also, I think there are other experiments that are part of more then 2 MIPS. It would get complicated quickly... |
I would be OK with the option of dropping (setting a preferred order would work well) if that behaviour could be configured for multiple use cases. Having some kind of option in the restructure_datasets function would be best, but how best to specify this ? When decoding, the validation step demands a string that is a member of the CMIP6 controlled vocabulary. I can change this to allow for a list of allowed values, then check that they members of the controlled vocabulary. This would better handle cases of files being shared between 3 or more MIPs (do those exist?). Another option would be to create two entries for the file, one according to each MIP, and hard-link those files so that they can be found in either filetree (ScenarioMip/this/that/file.nc and AerChemMip/this/that/file.nc). This solves the catalogue issue by creating two entries while not increasing the disk space used. This approach is a bit overkill, but would be surprisingly easy to implement. I feel like we all have opinions on this. |
I like the magical symlink solution! If it is easy to implement! |
I thought there might be experiements with more than 2 MIPs, because I has seen a well populated column called |
If it is easy to have it on both ScenarioMIp and AerMIP without taking too much space, that is great!! |
My understanding is that the 'real' file would only be at one location, but both filetrees would see it. So it takes the same space as only having it once in ScenarioMIP. |
The only major issue with hard links is that if you perform certain operations (like copying hard linked files to another host), unless you specify to preserve hard links, you will break them (i.e. you will have two separate files) or if you modify one file, the other is modified as well. It's something that needs to be taken into consideration. I can open a PR to address this in the coming weeks. |
Just a reminder that we still have ScenarioMIP-AerChemMIP in the path. I think the conclusion here is to have ScenarioMIP and AerChemMIP with everything in ScenarioMIP-AerChemMIP in both directories with a hard link. Not crucial as my catalog sees everything as ScenarioMIP. But this is a reminder that for the final form of /datasets, this needs to be addressed. |
The decoder currently treats the entire string of a
attrs.ativity_id
for CMIP6-endorsed MIPs as the activity, however I ran into this today in our database:Since this field is used in creating the filetree, while it is technically valid to have spaces in a path, the idea of creating POSIX paths with escaped spaces runs counter to all known ethics and reason.
Proposal - hyphening:
ScenarioMIP AerChemMIP
→ScenarioMIP-AerChemMIP
Thoughts?
The text was updated successfully, but these errors were encountered: