GTC-3055 Added new GHG (green-house gas) analysis #252

danscales · 2025-01-16T21:59:34Z

GTC-3055 Added new GHG (green-house gas) analysis

This analysis computes the DLUC (direct land use change) GHG emissions factors for each location, based on the tree loss information for the location and the specified yield or commodity (which gives a default yield value). The emissions are calculated by a 20-year discounted formula, as described in GHGRawDataGroup.scala. We compute the emissions factors for each year in 2020-2023.

The new set of GHG Analysis files are most closely modeled on the ForestChangeDiagnostic analysis files, but are simplified as much as possible. The GHGSummary file has the logic that computes the crop yield for a particular pixel location, including making use of the primary crop yield datasets or the backup yield CSV file. It also computes the emissions due to tree loss. The GHGRawDataGroup file does the emissions factor computation.

Here are some of the other needed changes:

Added gross emissions datasets (already in use for some carbon flux analyses) to the Pro dataset catalog.
Added a new dataset per commodity (currently 6 commodities, will add more later) that gives the average crop yield in each 10km area. Added a generic MapspamYield layer to access these commodity datasets.
Added a CSV file with "backup" crop yields per GADM2 area, when the primary crop yield rasters don't have a value for a particular pixel. This CSV file (which is < 26 Mbytes) is broadcast to all nodes (not each task), so it is available for lookup, and avoids Spark tasks swamping the Data API with requests. The CSV file is specified by command-line option and will be placed on S3.
Added a new Feature and FeatureId "gfwpro_ext" that includes "commodity" and "yield" fields, which are needed to specify the exact crop yield or commodity grown at each location.
Changed ErrorSummaryRDD to allow passing in the featureId to the polygonalSummary code as part of kwargs. This is needed so we can pass the yield and commodity information into the GHGSummary code. We need the yield/commodity during the per-pixel analysis.
Added a GHG test. The input file for the test case (ghg.tsv) tests a variety of commodities and yields, and even the error case where a default yield for a commodity is not found at all for a location. A partial backup GADM2 yield file is includes in the test files (part_yield_spam_gadm2.csv)

This analysis computes the DLUC (direct land use change) GHG emissions factors for each location, based on the tree loss information for the location and the specified yield or commodity (which gives a default yield value). The emissions are calculated by a 20-year discounted formula, as described in GHGRawDataGroup.scala. We compute the emissions factors for each year in 2020-2023. The new set of GHG Analysis files are most closely modeled on the ForestChangeDiagnostic analysis files, but are simplified as much as possible. The GHGSummary file has the logic that computes the crop yield for a particular pixel location, including making use of the primary crop yield datasets or the backup yield CSV file. It also computes the emissions due to tree loss. The GHGRawDataGroup file does the emissions factor computation. Here are some of the other needed changes: - Added gross emissions datasets (already in use for some carbon flux analyses) to the Pro dataset catalog. - Added a new dataset per commodity (currently 6 commodities, will add more later) that gives the average crop yield in each 10km area. Added a generic MapspamYield layer to access these commodity datasets. - Added a CSV file with "backup" crop yields per GADM2 area, when the primary crop yield rasters don't have a value for a particular pixel. This CSV file (which is < 26 Mbytes) is broadcast to all nodes (not each task), so it is available for lookup, and avoids Spark tasks swamping the Data API with requests. The CSV file is specified by command-line option and will be placed on S3. - Added a new Feature and FeatureId "gfwpro_ext" that includes "commodity" and "yield" fields, which are needed to specify the exact crop yield or commodity grown at each location. - Changed ErrorSummaryRDD to allow passing in the featureId to the polygonalSummary code as part of kwargs. This is needed so we can pass the yield and commodity information into the GHGSummary code. We need the yield/commodity during the per-pixel analysis. - Added a GHG test. The input file for the test case (ghg.tsv) tests a variety of commodities and yields, and even the error case where a default yield for a commodity is not found at all for a location. A partial backup GADM2 yield file is includes in the test files (part_yield_spam_gadm2.csv)

jterry64

Small nitpicks and suggestions, but otherwise looks good enough to merge to me. Let me know if you take any of these suggestion and I can look again.

jterry64 · 2025-01-24T19:23:58Z

src/main/scala/org/globalforestwatch/summarystats/ErrorSummaryRDD.scala

+                  // if there are any full window intersections, we only need to calculate
+                  // the summary for the window, and then tie it to each feature ID
+                  val fullWindowIds = fullWindowFeatures.map { case feature => feature.data}.toList
+                  //if (fullWindowIds.size >= 2) {


is it comment meant to be here still?

Removed it. Left over from a while ago.

jterry64 · 2025-01-24T19:24:30Z

src/main/scala/org/globalforestwatch/summarystats/SummaryMain.scala

-    val environmentVars = System.getenv().forEach {
-      case (key, value) => println(s"$key = $value")
+    // Print out environment variables (if needed for debugging)
+    if (false) {


is this meant to be here still?

Yes, I left the code in there (with 'if (false)') in case sometime later we want to print out environment variables for debugging purposes.

jterry64 · 2025-01-24T19:28:02Z

src/main/scala/org/globalforestwatch/summarystats/ghg/GHGGridSources.scala

+  val treeCoverDensity2000: TreeCoverDensityPercent2000 = TreeCoverDensityPercent2000(gridTile, kwargs)
+  val grossEmissionsCo2eNonCo2: GrossEmissionsNonCo2Co2eBiomassSoil = GrossEmissionsNonCo2Co2eBiomassSoil(gridTile, kwargs = kwargs)
+  val grossEmissionsCo2eCo2Only: GrossEmissionsCo2OnlyCo2BiomassSoil = GrossEmissionsCo2OnlyCo2BiomassSoil(gridTile, kwargs = kwargs)
+  val mapspamCOCOYield: MapspamYield = MapspamYield("COCO", gridTile, kwargs = kwargs)


Does the whole list include multiple commodities, or is it just one commodity type per list? If the latter, it might simplify things to only pass one commodity in the tile, that's just parametrized by the commodity for that analysis.

Each location in the list can have a different commodity (or no commodity, but a yield specified). So, we need all the commodities as sources. But it is all lazy, so they are not opened or fetched for any particular window unless they are actually needed.

jterry64 · 2025-01-24T19:30:10Z

src/main/scala/org/globalforestwatch/summarystats/ghg/GHGRawDataGroup.scala

+    val efList = for (i <- minLossYear to maxLossYear) yield {
+      val diff = i - umdTreeCoverLossYear
+      if (diff >= 0 && diff < 20) {
+        (i -> ((0.0975 - diff * 0.005) * emissionsCo2e) / (cropYield * totalArea))


nitpick since we use magic numbers elsewhere, but named constants would be good here

Done, replaced them with constants. Thanks for all the comments!

jterry64 · 2025-01-24T19:31:48Z

src/main/scala/org/globalforestwatch/summarystats/ghg/GHGSummary.scala

+            val gadmAdm1: Integer = raster.tile.gadmAdm1.getData(col, row)
+            val gadmAdm2: Integer = raster.tile.gadmAdm2.getData(col, row)
+            val gadmId: String = s"$gadmAdm0.$gadmAdm1.${gadmAdm2}_1"
+            //println(s"Empty ${featureId.commodity} default yield, checking gadm yield for $gadmId")


remove comment

jterry64 · 2025-01-24T19:32:49Z

src/main/scala/org/globalforestwatch/summarystats/ghg/GHGSummary.scala

+            val gadmAdm2: Integer = raster.tile.gadmAdm2.getData(col, row)
+            val gadmId: String = s"$gadmAdm0.$gadmAdm1.${gadmAdm2}_1"
+            //println(s"Empty ${featureId.commodity} default yield, checking gadm yield for $gadmId")
+            val backupArray = kwargs("backupYield").asInstanceOf[Broadcast[Array[Row]]].value


nitpick, I'd call this "backupYieldArray" consistently since "backup" can mean a lot of things

Good suggestion, done!

jterry64 · 2025-01-24T19:33:03Z

src/main/scala/org/globalforestwatch/summarystats/ghg/GHGSummary.scala

+
+        val groupKey = GHGRawDataGroup(umdTreeCoverLossYear, cropYield)
+
+        // if (umdTreeCoverLossYear > 0) {


remove comment

jterry64 · 2025-01-24T19:37:50Z

src/main/scala/org/globalforestwatch/summarystats/ErrorSummaryRDD.scala

-                case feature =>
-                  getSummaryForGeom(List(feature.data), feature.geom)
-              }
+              if (kwargs.get("includeFeatureId").isDefined) {


Maybe this would be messy, but if you wanted to keep the optimization and not split the logic, theoretically every tile should produce the same results per commodity unless the user specifies the yield manually. In which case, couldn't you just apply the yield constant to your summary result per feature after doing runPolygonalSummary?

This is analysis-independent code, so I would not want to put anything in here that is specific to yield/commodity/GHG, etc.

Since GHG specifies a commodity or yield, it must be done on specific farms with specific crops. So, it will not be called on very large areas (which would not all be one farm with one kind of crop). So, the full window optimization would never actually be applicable for GHG anyway.

So, I don't think it would be worth trying to get this optimization to work in a general way in the case the featureId is passed down, since it would never be used in the one case (GHG) that we have.

danscales requested a review from jterry64 January 16, 2025 21:59

jterry64 previously approved these changes Jan 24, 2025

View reviewed changes

danscales added 2 commits January 24, 2025 15:10

Changes from Justin's review comments.

6856c00

Merge remote-tracking branch 'origin/master' into gtc-3055

1656df0

danscales dismissed jterry64’s stale review via 1656df0 January 24, 2025 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTC-3055 Added new GHG (green-house gas) analysis #252

GTC-3055 Added new GHG (green-house gas) analysis #252

danscales commented Jan 16, 2025

jterry64 left a comment

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025

jterry64 Jan 24, 2025

danscales Jan 24, 2025


		val groupKey = GHGRawDataGroup(umdTreeCoverLossYear, cropYield)

		// if (umdTreeCoverLossYear > 0) {

GTC-3055 Added new GHG (green-house gas) analysis #252

Are you sure you want to change the base?

GTC-3055 Added new GHG (green-house gas) analysis #252

Conversation

danscales commented Jan 16, 2025

jterry64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment