Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTC-3055 Added new GHG (green-house gas) analysis #252

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

danscales
Copy link
Collaborator

GTC-3055 Added new GHG (green-house gas) analysis

This analysis computes the DLUC (direct land use change) GHG emissions factors for each location, based on the tree loss information for the location and the specified yield or commodity (which gives a default yield value). The emissions are calculated by a 20-year discounted formula, as described in GHGRawDataGroup.scala. We compute the emissions factors for each year in 2020-2023.

The new set of GHG Analysis files are most closely modeled on the ForestChangeDiagnostic analysis files, but are simplified as much as possible. The GHGSummary file has the logic that computes the crop yield for a particular pixel location, including making use of the primary crop yield datasets or the backup yield CSV file. It also computes the emissions due to tree loss. The GHGRawDataGroup file does the emissions factor computation.

Here are some of the other needed changes:

  • Added gross emissions datasets (already in use for some carbon flux analyses) to the Pro dataset catalog.
  • Added a new dataset per commodity (currently 6 commodities, will add more later) that gives the average crop yield in each 10km area. Added a generic MapspamYield layer to access these commodity datasets.
  • Added a CSV file with "backup" crop yields per GADM2 area, when the primary crop yield rasters don't have a value for a particular pixel. This CSV file (which is < 26 Mbytes) is broadcast to all nodes (not each task), so it is available for lookup, and avoids Spark tasks swamping the Data API with requests. The CSV file is specified by command-line option and will be placed on S3.
  • Added a new Feature and FeatureId "gfwpro_ext" that includes "commodity" and "yield" fields, which are needed to specify the exact crop yield or commodity grown at each location.
  • Changed ErrorSummaryRDD to allow passing in the featureId to the polygonalSummary code as part of kwargs. This is needed so we can pass the yield and commodity information into the GHGSummary code. We need the yield/commodity during the per-pixel analysis.
  • Added a GHG test. The input file for the test case (ghg.tsv) tests a variety of commodities and yields, and even the error case where a default yield for a commodity is not found at all for a location. A partial backup GADM2 yield file is includes in the test files (part_yield_spam_gadm2.csv)

This analysis computes the DLUC (direct land use change) GHG emissions
factors for each location, based on the tree loss information for the
location and the specified yield or commodity (which gives a default
yield value). The emissions are calculated by a 20-year discounted
formula, as described in GHGRawDataGroup.scala. We compute the emissions
factors for each year in 2020-2023.

The new set of GHG Analysis files are most closely modeled on the
ForestChangeDiagnostic analysis files, but are simplified as much as
possible. The GHGSummary file has the logic that computes the crop yield
for a particular pixel location, including making use of the primary
crop yield datasets or the backup yield CSV file. It also computes the
emissions due to tree loss. The GHGRawDataGroup file does the emissions
factor computation.

Here are some of the other needed changes:
 - Added gross emissions datasets (already in use for some carbon flux analyses) to
   the Pro dataset catalog.
 - Added a new dataset per commodity (currently 6 commodities, will add
   more later) that gives the average crop yield in each 10km area. Added
   a generic MapspamYield layer to access these commodity datasets.
 - Added a CSV file with "backup" crop yields per GADM2 area, when the
   primary crop yield rasters don't have a value for a particular pixel.
   This CSV file (which is < 26 Mbytes) is broadcast to all nodes (not
   each task), so it is available for lookup, and avoids Spark tasks
   swamping the Data API with requests. The CSV file is specified by
   command-line option and will be placed on S3.
 - Added a new Feature and FeatureId "gfwpro_ext" that includes
   "commodity" and "yield" fields, which are needed to specify the
   exact crop yield or commodity grown at each location.
 - Changed ErrorSummaryRDD to allow passing in the featureId to the
   polygonalSummary code as part of kwargs. This is needed so we can
   pass the yield and commodity information into the GHGSummary code. We
   need the yield/commodity during the per-pixel analysis.
 - Added a GHG test. The input file for the test case (ghg.tsv) tests a
   variety of commodities and yields, and even the error case where a
   default yield for a commodity is not found at all for a location. A
   partial backup GADM2 yield file is includes in the test files
   (part_yield_spam_gadm2.csv)
@danscales danscales requested a review from jterry64 January 16, 2025 21:59
jterry64
jterry64 previously approved these changes Jan 24, 2025
Copy link
Member

@jterry64 jterry64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpicks and suggestions, but otherwise looks good enough to merge to me. Let me know if you take any of these suggestion and I can look again.

// if there are any full window intersections, we only need to calculate
// the summary for the window, and then tie it to each feature ID
val fullWindowIds = fullWindowFeatures.map { case feature => feature.data}.toList
//if (fullWindowIds.size >= 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it comment meant to be here still?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it. Left over from a while ago.

val environmentVars = System.getenv().forEach {
case (key, value) => println(s"$key = $value")
// Print out environment variables (if needed for debugging)
if (false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be here still?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I left the code in there (with 'if (false)') in case sometime later we want to print out environment variables for debugging purposes.

val treeCoverDensity2000: TreeCoverDensityPercent2000 = TreeCoverDensityPercent2000(gridTile, kwargs)
val grossEmissionsCo2eNonCo2: GrossEmissionsNonCo2Co2eBiomassSoil = GrossEmissionsNonCo2Co2eBiomassSoil(gridTile, kwargs = kwargs)
val grossEmissionsCo2eCo2Only: GrossEmissionsCo2OnlyCo2BiomassSoil = GrossEmissionsCo2OnlyCo2BiomassSoil(gridTile, kwargs = kwargs)
val mapspamCOCOYield: MapspamYield = MapspamYield("COCO", gridTile, kwargs = kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the whole list include multiple commodities, or is it just one commodity type per list? If the latter, it might simplify things to only pass one commodity in the tile, that's just parametrized by the commodity for that analysis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each location in the list can have a different commodity (or no commodity, but a yield specified). So, we need all the commodities as sources. But it is all lazy, so they are not opened or fetched for any particular window unless they are actually needed.

val efList = for (i <- minLossYear to maxLossYear) yield {
val diff = i - umdTreeCoverLossYear
if (diff >= 0 && diff < 20) {
(i -> ((0.0975 - diff * 0.005) * emissionsCo2e) / (cropYield * totalArea))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick since we use magic numbers elsewhere, but named constants would be good here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, replaced them with constants. Thanks for all the comments!

val gadmAdm1: Integer = raster.tile.gadmAdm1.getData(col, row)
val gadmAdm2: Integer = raster.tile.gadmAdm2.getData(col, row)
val gadmId: String = s"$gadmAdm0.$gadmAdm1.${gadmAdm2}_1"
//println(s"Empty ${featureId.commodity} default yield, checking gadm yield for $gadmId")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

val gadmAdm2: Integer = raster.tile.gadmAdm2.getData(col, row)
val gadmId: String = s"$gadmAdm0.$gadmAdm1.${gadmAdm2}_1"
//println(s"Empty ${featureId.commodity} default yield, checking gadm yield for $gadmId")
val backupArray = kwargs("backupYield").asInstanceOf[Broadcast[Array[Row]]].value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, I'd call this "backupYieldArray" consistently since "backup" can mean a lot of things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, done!


val groupKey = GHGRawDataGroup(umdTreeCoverLossYear, cropYield)

// if (umdTreeCoverLossYear > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case feature =>
getSummaryForGeom(List(feature.data), feature.geom)
}
if (kwargs.get("includeFeatureId").isDefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be messy, but if you wanted to keep the optimization and not split the logic, theoretically every tile should produce the same results per commodity unless the user specifies the yield manually. In which case, couldn't you just apply the yield constant to your summary result per feature after doing runPolygonalSummary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is analysis-independent code, so I would not want to put anything in here that is specific to yield/commodity/GHG, etc.

Since GHG specifies a commodity or yield, it must be done on specific farms with specific crops. So, it will not be called on very large areas (which would not all be one farm with one kind of crop). So, the full window optimization would never actually be applicable for GHG anyway.

So, I don't think it would be worth trying to get this optimization to work in a general way in the case the featureId is passed down, since it would never be used in the one case (GHG) that we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants