README.rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  error = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

<!-- badges: start -->

<!-- badges: end -->

# mscomposer - An R package for creating and analysing multi-spectral image composites

## Authors

Larissa Torney<sup>1</sup>, Michael Blaschek<sup>2</sup>, Darius Görgen<sup>1</sup>

<sup>1</sup>Federal Institute for Geosciences and Natural Resources, Stilleweg 2, 30655 Hannover, Germany

<sup>2</sup>State Authority for Geology, Resources and Mining, Albertstraße 5, 79104 Freiburg, Germany

## About

The `{mscomposer}` package was created by BGR with the single use-case of creating 
a composite image of multi-spectral satellite imagery over a given area-of-interest 
in mind. This mutli-spectral composite serves than as input to a machine learning 
model which is trained based on reference points from the ground. It consists only 
of three functions, which streamline this process. While the scope of the package 
is thus rather limited, we have put substantial efforts to allow you to adapt the 
function to your specific analysis. More specifically, this means that you can use 
it with almost any satellite data set and supply custom spectral indices and
thresholds for composite creation. For the model training process you can choose 
between several cross-validation routines. However, it is advised to further analyse 
the model validity for your study area, for instance via the [CAST](https://hannameyer.github.io/CAST/) package.

This package was created at the BGR - Federal Institute for Geosciences and Natural Resources - in the frame of the BopaBW project ([„Near-surface soil parameters Baden-Wuerttemberg“](https://www.bgr.bund.de/EN/Themen/GG_Fernerkundung/Projekte/laufend/Multispektral/Boden-Stoffgehalte/BOPA_Baden_Wuerttemberg_en.html?nn=1551022)).

## How to cite

Please cite as follows:

Torney, L., Blaschek, M., & Görgen, D. (2023). mscomposer - An R package for creating and analysing multi-spectral image composites (Version 0.0.1). https://doi.org/10.25928/5yx7-x825

## Installation

You can install `{mscomposer}` directly from GitHub with:

```{r, install, eval = FALSE}
remotes::install_github("BGR-Geowissenschaften-Rohstoffe/mscomposer)
```

## Usage

We will showcase the general workflow of `{mscomposer}` based on a small Sentinel-2 
image collection and a simulated reference dataset. Note, that the provided 
routines rely heavily on functionalities provided by `{gdalcubes}`, `{sf}`,
`{terra}`, and `{caret}` under the hood. For more control over the processing,
experienced users of these packages might consider using them directly to create
their own custom workflow.

## Big-Data processing

In the case your study area covers a large spatio-temporal extent you might 
consider parallel processing in the composite creation. In this case be advised 
to use `gdalcubes_options()` to enable parallel processing before calling 
`compose()`. This way you can use the cores available on your machine to speed 
up the composite generation process. Also, in case the original image collection 
is quite large and you wish to re-run your analysis with different settings, make
sure to specify the `col` argument in the `compose()` function so that you create 
the image collection only once. 

## Sentinel-2 Use Case

Within the `{mscomposer}` package we included a small subset of two Sentinel-2 
tiles for the year 2018 over an agricultural area in Baden-Württemberg. In this 
use-case, we are interested to retrieve a median composite of the visual channels 
containing only pixels when we assume that the bare soil is visible. We are going 
to use a simple heuristic to identify pixels with an NDVI pixel below a value of 
0.3 to represent bare soil pixels.

The data set was originally downloaded from the Copernicus Hub which provides 
zip-files. In order for the data to fit into this package, we included only a 
very small area between the overlapping tiles T32TNT and T32TMT and flattened 
the directory structure so that we only keep those band files we need in the 
scope of this tutorial. That also requires that we write our own collection
format. See this [link](https://github.com/gdalcubes/collection_formats) for 
pre-defined formats that come with `{gdalcubes}`


### True-Color Composite 

In the first step, we will create a true-color-composite over the entire year in
order to get a first impression of the area of interest. For that, we first have
to define a cube view, i.e. a definition of the spatio-temporal extent of our 
output data set. We provided a routine called `view()` that allows you to 
specify an sf object to automatically derive the bounding box of your study area. 
You can specify any valid CRS, but we also allow you to derive a 
[LAEA](https://proj.org/operations/projections/laea.html) or 
[UTM](https://proj.org/operations/projections/utm.html) 
projection automatically from your AOI. 

First, we are going to load the libraries required for the processing.

```{r setup}
library(sf)
library(caret)
library(terra)
library(gdalcubes)
library(mscomposer)
```

In the present case, we set the 
temporal resolution of our cube view to `"P3M"`, meaning that the intial calculation 
will be based on a temporal resolution of three months. Since we also specify a 
final reducer, the actual output raster will represent a single layer per band
based on the median of the three-month composites. The spatial resolution of the 
visible Sentinel-2 channels is set to 10 meters. We also specify methods 
understood by GDAL to aggregate pixels falling into the same time-step as well 
a method to resample grid cells to the new output reference grid.

We then provide the input files, format, and view to the `compose()` function.
We also specify an image mask which corresponds to bad quality pixel in the Sentinel-2 
[Scene Classification Layer](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). 
Since we are only interested in the color composite for now, we only include the 
three visible channels in the composite generation. Finally, we declare a method
for the final reduction of the temporal dimension and declare the output filename
and directory.


```{r s2-comp-rgb}
outdir <- file.path(tempdir(), "mscomposer")
dir.create(outdir)
collection_path <- file.path(outdir, "s2_collection.sqlite")
s2_files <- list.files("inst/extdata/Sentinel2/", pattern = ".jp2$", full.names = TRUE, recursive = TRUE)
format <- system.file("Sentinel2_L2A_flat.json", package = "mscomposer")
head(basename(s2_files), 5)
aoi <- st_read(system.file("aoi_S2.gpkg", package = "mscomposer"))

cv <- ms_view(
  aoi = aoi,
  srs = "utm",
  dt = "P3M",
  dx = 10,
  dy = 10,
  start = "2018-01-01",
  end = "2018-12-31",
  agg = "min",
  rsmp = "bilinear")

s2_rgb <- ms_compose(
  s2_files,
  format = format,
  view = cv,
  col = collection_path,
  mask = image_mask("SCL", values=c(1,3,6,8,9,10,11)),
  bands = c("B02", "B03", "B04"),
  reducers = "median",
  prefix = "rgb",
  outdir = outdir
)

(s2_rgb <- rast(s2_rgb[1]))
plotRGB(s2_rgb, 3,2,1, stretch = "lin")

```

We can see that the dominant land use type in this area is agriculture. We also 
observe some forest areas as well as a village with built-up areas. In the next 
code example, we show how to calculate the NDVI and filter valid pixels based on
a NDVI threshold. This way, we wish to include only those pixels in the composite 
calculation where we assume that the satellite saw bare soil.

### Bare Soil Composite

We supply the formula of the NDVI and also make sure
that we include all bands necessary for its calculation in the processing.
Additionally, we supply a filtering statement. You can supply your own spectral 
indices formula and filters. The only restriction is that the bands you are using
in your index calculation are also present in the data set.

For the NDVI calculation, we need the Near-Infrared channel, which corresponds to 
band 8 in the case of Sentinel-2. We include this band in the processing of the 
composite, thus we can refer to it in the definition of our spectral index. We 
also include a filter based on our index, meaning that only those pixel will be 
used for the computation where the expression evaluates to `TRUE`.

```{r s2-comp-soil}
s2_filt <- ms_compose(
  s2_files,
  format = format,
  view = cv,
  col = collection_path,
  mask = image_mask("SCL", values=c(1,3,6,8,9,10,11)),
  bands = c("B02", "B03", "B04", "B08"),
  indices = list(NDVI = "(B08-B04)/(B08+B04)"),
  filters = list(NDVI = "NDVI < 0.3"),
  reducers = "median",
  prefix = "soil",
  outdir = outdir
)

(s2_filt <- rast(s2_filt[1]))
plotRGB(s2_filt, 3,2,1, stretch = "lin")
```

In the result, we see that we mainly retained valid pixels for agricultural 
fields. There are also pixels retained for a larger street, which is 
characterized by low NDVI values during the entire year. We also see some remaining
artifacts. potentially originating from clouds or snow cover which we should
take care of in a real-world analysis. While not being perfect, 
the above composite can well serve as the input to further analysis, e.g. by 
supplying a vector data set of field boundaries to only include pixels which 
actually lie on fields. Also, several spectral indices and filters might be 
combined to suit the needs of the analysis you are conducting.


### Model Training and Prediction

Continuing with our example, we now load a simulated reference data set that
we are using to train a Random Forest model. 

```{r model-setup}
(reference <- system.file("corg-simulated-sample.gpkg", 
                          package = "mscomposer") |>
   read_sf())

plot(s2_filt[[1]])
plot(reference, add = TRUE, pch = 3, col = "red")
```
To train a model and use it for spatial prediction, we provide the `raster_predict()`
function. It expects an sf object containing only POINT geometries as input as 
well as the specification of a column containing numeric values as the target
variable. Additionally, a multi-layer SpatRaster object is required to extract
the predictor variables at the reference locations.
Then a model is trained based on the outcome variable and the extracted predictors.
You can choose any machine learning model supported by `{caret}` (see [here](https://topepo.github.io/caret/available-models.html)),
but note that you might need to install additional dependencies to use them. 
Also, you can chose between simple cross-validation patterns provided via 
`trainControl()`'s `method` argument. Here we are going to train a Random Forest 
model based on Leave-One-Observation-Out Cross-Validation.

```{r train}
results <- ms_predict(
  reference,
  colname = "corg",
  predictors = s2_filt,
  model_method = "rf",
  cv_method = "LOOCV"
)

```

As you can see from the above warnings and messages, the function takes
care of reprojecting the sf object in case its CRS differs from that of the 
predictor raster. In the above example, there are also some reference locations
for which no valid pixels values are available. Those locations are excluded
before model training. The output of the function is a list with several objects.
We can look at the structure of the list like this:

```{r results-str}
str(results, 1)
```
The first object is the predicted raster dataset while the trained model is the
second object. We also find some accuracy metrics in the third object while the
last object contains those reference locations for which valid pixels could be 
extracted.

Let's have a look at the prediction raster. We see that we obtain predicted
values for all those raster cells that did not contain NA values.

```{r vis-prediction}
plot(results$prediction)
```

Next, we can inspect the model and the associated accuracies.

```{r model-accuracy}
print(results$model)
print(results$metrics)
```
We see that the cross-validation results indicated an R-squared value of about
0.41 while for the complete training data set we obtain an R-squared value of 
about 0.90. Note, that the package currently does not provide means to compare 
the model against a validation dataset, however, since the model is an output
you can validate against a left-out dataset yourself.

The final output is an sf object with valid pixels in the predictor raster.
We include the extracted values per layer as well as the reference variable and
predictions (called `.preds` and `.obs` respectively.)

```{r extracted-values}
results$data
plot(results$data$.obs, results$data$.preds)

```