This project, undertaken during mentorship at Alforriah, focuses on leveraging data science techniques to monitor cyanobacteria in Lake Guaíba, considering its significant implications for both health and the economy. The model is deployed on Streamlit.
Cyanobacteria, commonly known as "blue-green algae," can excessively proliferate in reservoirs and water bodies, especially those with a stagnant water regime, in a phenomenon known as algal blooms. These events can result in negative economic and health impacts1. With global warming, an increase in the frequency and intensity of these events is expected2. The increase in water quality monitoring data has the potential to assist the public and decision-makers in understanding the state of water resources in the face of this problem.
The goal of this project is to extract a historical series of cyanobacteria density from surface water resources through the analysis of satellite images. The chosen study area was Lake Guaíba, a vital water source for Porto Alegre, Rio Grande do Sul.
The system extracts Sentinel 2A data from Google Earth Engine and calculates the NDVI (Normalized Difference Vegetation Index) and NDCI (Normalized Difference Chlorophyll Index) for Lake Guaíba, using them as predictors in a regression to estimate cyanobacteria densities obtained from water quality monitoring conducted by the health sector (SISAGUA). Specifically, the analysis was performed at one of the city's intake points (near the coordinates -30.012175, -51.215679). This methodology has been adopted in some studies3, 4, 5, especially for chlorophyll-a monitoring.
Data updates occur weekly for Sentinel 2A and monthly for water quality monitoring. Weekly batch inference is performed to estimate cyanobacteria density at the specified monitoring point.
- Authenticate with your Google and AWS account
- Extract data from SISAGUA using the Glue Job at
src/glue_jobs/vigi_to_s3.py
- Extract data from Google Earth Engine
python3 src/data/make_s2a_dataset.py
- Create a labeled dataset
python3 src/stages/data_label.py --config="params.yaml"
- Feature engineering
python3 src/stages/feat_eng.py --config="params.yaml"
- Training and evaluation
python3 src/stages/train.py --config="params.yaml"
- Training the selected model with the full dataset.
python3 src/stages/train_full_data.py --config="params.yaml"
- Predicting new data
python3 src/stages/train_full_data.py --config="params.yaml"
- Extract data from SISAGUA using the Glue Job at
src/glue_jobs/vigi_to_s3.py
- Extract data from Google Earth Engine (via crontab)
python3 src/data/make_s2a_dataset.py
- Inference
- Locally:
python3 src/data/make_s2a_dataset.py
- Cloud: running Glue Job at
src/glue_jobs/predict_cyano.py
- Deploy it on Streamlit
streamlit run app.py
[1] CETESB. Manual de cianobactérias planctônicas : legislação, orientações para o monitoramento e aspectos ambientais. 2013. https://cetesb.sp.gov.br/laboratorios/wp-content/uploads/sites/24/2015/01/manual-cianobacterias-2013.pdf
[2] Huisman, Jef; Codd, Geoffrey A.; Paerl, Hans W.; Ibelings, Bas W.; Verspagen, Jolanda M. H.; Visser, Petra M. 2018. Cyanobacterial blooms. Nature. https://www.nature.com/articles/s41579-018-0040-1
[3] Zhato, H. et al. Monitoring Cyanobacteria Bloom in Dianchi Lake Based on Ground-Based Multispectral Remote-Sensing Imaging: Preliminary Results. Remote Sensing. 2021 https://www.mdpi.com/2072-4292/13/19/3970
[4] Lobo, F.d.L.; Nagel, G.W.; Maciel, D.A.; Carvalho, L.A.S.d.; Martins, V.S.; Barbosa, C.C.F.; Novo, E.M.L.d.M. AlgaeMAp: Algae Bloom Monitoring Application for Inland Waters in Latin America. Remote Sens. 2021, 13, 2874. https://doi.org/10.3390/rs13152874
[5] Ventura, D. et al. Long-Term Series of Chlorophyll-a Concentration in Brazilian Semiarid Lakes from Modis Imagery. 2022. https://www.mdpi.com/2073-4441/14/3/400