Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update soils data used for surface dataset #1303

Closed
wwieder opened this issue Mar 17, 2021 · 154 comments
Closed

Update soils data used for surface dataset #1303

wwieder opened this issue Mar 17, 2021 · 154 comments
Labels
enhancement new capability or improved behavior of existing capability science Enhancement to or bug impacting science
Milestone

Comments

@wwieder
Copy link
Contributor

wwieder commented Mar 17, 2021

It would be nice to update the soils data we're using to generate the surface dataset to something from this century. This will introduce a number of answer changes to the code, but it seems worth having a discussion about what we need here.

@dlawrenncar suggested using SoilGrids data, which just released a version 2.0 of their dataset https://doi.org/10.5194/soil-2020-65. SoilGrids2.0 contains information on soil texture, OC content, pH, bulk density, coarse fragments, CEC, and soil N at 250 m resolution for 6 soil layers (0-200 cm). This high resolution data also includes uncertainty estimates! According to the data providers, v2.0 has changed significantly from previous releases of the dataset, but is currently only available at 250m resolution.

Laura Poggio and Niels Batjes at ISRIC are interested in and willing to provide a coarser resolution data product for our purposes and wondered what we wanted. I've basically told them we'd like the whole dataset, but to prioritize texture and soil C information. Is a 5km data product adequate for NWP applications, but not too unwieldy for climate simulations? Do we need 1km resolution mapping flies?

I also wondered if we should think about how to generate soil properties for the hillslope model? Does this happen in our own tool chain, or could it be generated in the mapping files from ISRIC? This is likely of secondary concern, but may be worth discussion?

@wwieder wwieder added enhancement new capability or improved behavior of existing capability support user or developer needs help type: -discussion labels Mar 17, 2021
@wwieder wwieder added this to the ctsm5.2.0 milestone Mar 17, 2021
@billsacks billsacks removed type: -discussion support user or developer needs help labels Mar 18, 2021
@wwieder
Copy link
Contributor Author

wwieder commented Mar 26, 2021

ISRIC is suggesting they produce a 1 km and 5 km SoilGrids product as (web optimized) geotiff format. Is this something we can use in the toolchain @slevisconsulting and @negin513? How hard should I push for a 3 arc minute product instead and a .nc format?

@billsacks
Copy link
Member

I'm pretty sure we're going to want netcdf for our tool-chain: even if we could read geotiff directly, I'm not sure it's a good idea to have different raw data in different formats: I think that's going to cause pain long-term. That said, I don't have feelings on whether we ask them to produce netcdf or if we convert their geotiff file to netcdf as an initial one-time thing.

Regarding resolution: First, I realized that our existing 1km file may not actually be uniform 1km: looking at the file name and metadata, I'm remembering that @swensosc merged 10' data from some regions with 1km data from most of the globe; my sense (maybe wrong) is that the resulting dataset is therefore an unstructured mix of resolutions. Regarding 5km vs. 3 arc-minute: Maybe we need to discuss as a group how much to push for conformity to a few standard resolutions vs. accepting whatever we get. I suspect that, if we use 5km, it will be the only dataset on this exact grid, somewhat increasing the time it takes to go through the toolchain – though probably not too terribly for 5km (as opposed to 1km, which is worse).

@dlawrenncar
Copy link
Contributor

dlawrenncar commented Mar 26, 2021 via email

@slevis-lmwg
Copy link
Contributor

And with the long term in mind, it's probably best to accept the highest resolution that they have to offer. Then, as @billsacks and @dlawrenncar said, we can spend the time once to get the data in the exact form that we can work with.

@wwieder
Copy link
Contributor Author

wwieder commented Mar 28, 2021 via email

@wwieder
Copy link
Contributor Author

wwieder commented Feb 8, 2022

New 1km and 5km resolution product are now available from SoilGrids.

you can find the data here: https://files.isric.org/soilgrids/latest/data_aggregated/
The metadata (including the DOI for citations) can be found here: https://data.isric.org/

The data producers have asked for input on these data products, which I am happy to provide. What should be our workflow to start testing these data in new surface datasets?

@wwieder wwieder added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Feb 8, 2022
@dlawrenncar
Copy link
Contributor

dlawrenncar commented Feb 8, 2022 via email

@wwieder
Copy link
Contributor Author

wwieder commented Feb 8, 2022

files are in geotiff format.

I'm assuming we'll need to merge datasets into a single .nc file first.

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 8, 2022

Yes, we'll need to convert to NetCDF, and make sure we have the fields needed by mksurfdata_map on them. So @wwieder are there several data files for different global regions? If so as you suggest we'd need to merge them to one global dataset. All of the datasets are in one global file.

@wwieder
Copy link
Contributor Author

wwieder commented Feb 8, 2022

The variables of interest includes data on clay, sand and soil organic C that we need now, but also data on soil N, pH, CEC, bulk density, etc. that may be useful down the road? I'm somewhat inclined to include more fields than we need in generating our 'raw' dataset.

Each variable has 6 tiff files that are provided (one for each soil layer 0-5, 5-15,... 100-200 cm). These should be concatenated with a depth coordinate.

We'll just have to maintain the metadata, or adjust units as appropriate, because my recollection is that units, especially for soil C are kind of odd.

Translating the .tif files into .nc seems pretty trivial.
https://nsidc.org/support/faq/how-can-i-convert-geotiff-netcdf

@wwieder
Copy link
Contributor Author

wwieder commented Feb 12, 2022

This isn't a finished product, as I need to bring in metadata somehow (it's listed elsewhere on the soilgrids website), and a bunch of other detailed things, but here's my first attempt at converting a geotiff into a .nc projection for sand that seems reasonable
/glade/scratch/wwieder/SoilGrids/ncMerged/sand_0-300_mean_5000.nc

This projection is not wall to wall (lat != -90 to 90). Does this matter for mksrf? What other considerations need to be made?

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 12, 2022

Looks good to see! In principle I think it's OK for mksurfdata_map, that it doesn't cover the entire globe, the mapping will be done for the part of the grid that it does cover. I thought it might be a problem that it doesn't cover Antarctica, but neither does the current file we use, so I guess that's OK.

Another thing that will need to be done is to create a SCRIP grid file that describes the grid and its vertices for each gridcell. This just has the center grid coordinates. Since, it's almost exactly a regular grid, we can calculate the vertices.

@wwieder
Copy link
Contributor Author

wwieder commented Feb 14, 2022

OK, here's a full 5000m dataset with soil properties from SoilGrids.
/glade/scratch/wwieder/SoilGrids/ncMerged/SoilGrids_mean_5000_merged.nc

We can add additional metadata and talk about where to put my notebook that generated these plots.
It may be worth discussing implementation of certain fields as we generate surface datasets, but hopefully this is enough to get us started.

@wwieder
Copy link
Contributor Author

wwieder commented Feb 14, 2022

Notebook with code can be found here https://github.com/wwieder/ctsm_py/blob/master/notebooks/tiff2nc.ipynb

@wwieder
Copy link
Contributor Author

wwieder commented Feb 15, 2022

Sorry, I'm still struggling to understand what's needed here?

There are a bunch of ways to reproject the orig. tiff data, see this website, but I can't really find anything that would be better that what's already provided?

Moreover, the spacing for lon seems pretty regular, and lats are identical. Below are longitude spacing.

-0.04551960876054295

-0.04551960876057137

from here, can't we calculate the corners of each grid?

@wwieder
Copy link
Contributor Author

wwieder commented Feb 15, 2022

@swensosc can you have a look at the dataset below to see what we can do to calculate the corners of each gridcell in a way that can be read into mksurfdata_map?

/glade/scratch/wwieder/SoilGrids/ncMerged/SoilGrids_mean_5000_merged.nc

@wwieder
Copy link
Contributor Author

wwieder commented Feb 17, 2022

@uturuncoglu, @mvertens mentioned that you have a tool that generates a mesh file from a raw dataset.
(sorry @kauff, this was supposed to go to Ufuk.)

I'm wondering if the dataset below has the information for what your script needs?
/glade/scratch/wwieder/SoilGrids/ncMerged/SoilGrids_mean_5000_merged.nc

@mvertens
Copy link

@uturuncoglu - I was referring to the ncl/python code you have to take a lat/lon grid (or logically rectangular grid) and create a mesh file.

@mvertens
Copy link

@uturuncoglu - it would be great to make this available to the TSS group - even if its not totally finished.

@uturuncoglu
Copy link

Hi All,

The Python tool is in my personal Gist repository. You could find it in here,

https://gist.github.com/uturuncoglu/4fdf7d4253b250dcf3cad2335651f162

The NCL one is in,

https://gist.github.com/uturuncoglu/1da852ffe2e0247aa4bb0caf2e79df7a

BTW, just note that those are not working for the all the cases and let me know if you need anything.

@uturuncoglu
Copy link

We could try the tools with /glade/scratch/wwieder/SoilGrids/ncMerged/SoilGrids_mean_5000_merged.nc to see what happens. @wwieder Do you want to try yourself or I could try for you.

@olyson
Copy link
Contributor

olyson commented Apr 14, 2022

Thanks @slevisconsulting , that file worked. There is a 5-year climo comparison (10 year trends) here:

https://webext.cgd.ucar.edu/I2000/ctsm51sp_ctsm51d090_2deg_GSWP3V1_soiltex_hist/lnd/ctsm51sp_ctsm51d090_2deg_GSWP3V1_soiltex_hist.2005_2009-ctsm51sp_ctsm51d090_2deg_GSWP3V1_hist.2005_2009/setsIndex.html

I haven't looked in detail, but I don't see any major differences or problems.

@dlawrenncar
Copy link
Contributor

dlawrenncar commented Apr 14, 2022 via email

@mvertens
Copy link

@dlawrenncar @olyson - this is really encouraging news. @slevisconsulting - we should meet to discuss how to integrate this new dataset into the main branch for mksurfdata_esmf.
I want to emphasize that what I have done is create an intermediate resolution mapunit dataset (soiltex_mapunits_4320x2160_c220329.nc) by mapping the 30 second original mapunit dataset provided by @wwieder to this grid. The original mapping to 2 degrees took over 2 hours - which was not feasible for most of our resolutions. Any model resolution higher than 4320x2160 should use the 30 second dataset should use the original grid and we should put this capability into the namelist generation.
@slevisconsulting - it would be good to talk about how we want to proceed in the next day or two.

@wwieder
Copy link
Contributor Author

wwieder commented Apr 20, 2022

Next up: ORGANIC using the same mapping unit information being used for texture.

I'm used to thinking about calculating organic matter stocks (kg C/m2), but the model only cares about organic matter density (kg OM/m3)

Technically, this should just be calculated on the fine earth fraction (1-coarse fragments).

the WISE lookup table has all of this information

Property units long_name
ORGC gC kg^-1 soil organic carbon content
BULK g soil cm^-3 bulk density
CFRAG volumetric, % coarse fragment

Additionally we'll assume 1g OM = 0.58 gC. NOTE I have no idea where this conversion factor is from, but I'm assuming it was used for the old calculation of ORGANIC we've been using?

Thus:

ORGANIC = ORGC*BULK*(100-CFRAG)/100  *1/0.58 

This should provide ORGANIC (kgC m^-3 soil)
I think all the units, converting g to kg and cm3 to m2, cancel out.

Here are samples for two different profiles (not sure where they are?)
image

@dlawrenncar can you check this matches your expectations (and unit conversions). Also, I think we're more correct to just use the 'fine early fraction' by removing the coarse fragment (rocks), but don't know what you think?

@dlawrenncar
Copy link
Contributor

dlawrenncar commented Apr 20, 2022 via email

@mvertens
Copy link

@wwieder @slevisconsulting
I have the following two datasets in my own sandbox that I'd like to put into inputdata

  1. /glade/u/home/mvertens/src/ctsm.new_mksurfdata/tools/mksurfdata_esmf/soiltex_mapunits_4320x2160_c220329.nc
  2. /glade/u/home/mvertens/src/ctsm.new_mksurfdata/tools/mksurfdata_esmf/wise_30sec_v1_lookup2.nc
    which points to
    /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_lookup2.nc

I'd like to move both of these to inputdata and put them in the xml file so that we are not pointing into datasets in scratch.
What should we call these files when I move them?

@slevis-lmwg
Copy link
Contributor

Probably we should discuss.

I'm happy to set up a follow-up meeting, though @wwieder could you share your calendar with me so that I may pick a time that works for all? Or feel free to schedule this next meeting.

@wwieder
Copy link
Contributor Author

wwieder commented Apr 21, 2022

I made a meeting for tomorrow. Hopefully 30 minutes is enough.

before uploading files to input data I also wondered if the metadata on these files is sufficient or if additional information is needed (e.g. the script used to generate the .nc files from the raw data we're getting from WISE)?

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 22, 2022

@wwieder we haven't always saved the scripts that create the raw data files. I do think it's always important to save some metadata that describes what was done though. And if the manipulations that were done were straightforward that someone could recreate the process later, it's not strictly necessary to save the script. In general we don't have to reconstruct these datasets, but for scientific reproducibility the instructions should be good enough that you could create it again. So if you do lots of complex manipulations that you can't describe easily it might be best to archive the script. The only other reason to archive the script is if we think we'll use it again fairly soon. That seems unlikely to me.

But, you've already archived the script in your repo in github, and that's sufficient to me. It doesn't look like you do anything that is unduly complex. The link above already connects the script to this issue, so it'll be straight forward to find it again.

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 22, 2022

In advance of our meeting and in terms of things I find important to have in filenames:

  • Description of the horizontal resolution
  • Description of the vertical resolution, number of levels and maybe depth
  • Description of the data source
  • Possibly the years that the data represents
  • Possibly some type of version number if applicable
  • The creation date of the file in our standard CESM format _cYYYYMMDD

For mksurfdata we've prepended a "mksrf_" in front of our filenames. Since, almost all of the files in that directory have that prefix I don't think that's very helpful. So moving away from that would be reasonable now.

@wwieder
Copy link
Contributor Author

wwieder commented Apr 22, 2022

Sorry for the delay on this, but here are 3 arctic map units, two from AK and one in Norway
image Was this one concern you had about the data we're providing.

The other suggesting was that we just include some measure of soil organic carbon content (ORGC) in the WISE dataset. Maybe @mvertens can include both values on the surface dataset?

@dlawrenncar
Copy link
Contributor

Those all look reasonable and it is comforting to see that the are all within the ballpark of the assumed 100% organic matter density of 130 kg OM/m3. But, with the code as it exists now, just using these values won't work, unfortunately, which is why I am recommending that we switch to calculating %ORGANIC content and then using that directly, if it is possible (I'm not sure how to calculate this). Not sure what the other options are. @wwieder This probably requires a chat.

@wwieder
Copy link
Contributor Author

wwieder commented Apr 22, 2022

just taking the field ORGC*0.1 will give the units in %C
Are these values you'd expect for organic soils, @dlawrenncar ?
image

@wwieder
Copy link
Contributor Author

wwieder commented Apr 22, 2022

@mvertens I hope these two files have the right information. Their global attributes are likely worth bringing over into the files you're archiving.

The lookup table, used for all resolutions /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_lookup2.nc

and the 30 resolution of the mapUnits
/glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_grid2.nc

@slevis-lmwg
Copy link
Contributor

I merged #1732 so closing this issue.

@dlawrenncar
Copy link
Contributor

dlawrenncar commented Oct 11, 2022 via email

@slevis-lmwg slevis-lmwg reopened this Oct 11, 2022
@ekluzek
Copy link
Collaborator

ekluzek commented Oct 20, 2022

@slevisconsulting this is completed on the ctsm5.2 branch now correct?

@slevis-lmwg
Copy link
Contributor

Yes, and I had closed the issue, BUT I reopened it when I saw @dlawrenncar 's post a few lines up from here on Oct 11.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 18, 2024

I'm just noting that one thing we did here is to bring in the soils data in single rather than double precision. This saves space and obviously the scientific accuracy of the data is way less than even single precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability science Enhancement to or bug impacting science
Projects
No open projects
Development

No branches or pull requests