Skip to content

NINAnor/scandcam-data-analysis

Repository files navigation

📊 scandcam-analysis

This repository consists of an analysis of the data in the viltkamera (or scandcam) portal. It analyses the timeseries with respect to the previous aggregation method to the validated predictions.

There has also been created an API in order to be able to interact with the model.

📈 About the analysis

The data analysis can be found in notebooks/00_analysis. The main findings have been summarized:

  • The dataset consists of 71% which has been validated as "nothing".
  • The other classes which are most prevalent are "menneske", "raadyr", "rev" and "kjoeretoy".
  • The longest timeseries with the largest variation of species seems to give the highest amount of incorrect predictions.
  • The previous aggregation method is correct 88% of the time and wrong 12% of the time. This isn't too bad.
  • The classes it seems to struggle with the most are "hjort", "raadyr", "menneske", "sau", "elg", "ku".
  • It seems to confuse "elg" and "raadyr", "elg" and "raadyr", "raadyr" and "hjort", "hjort" and "villsvin" and "raadyr" and "villsvin".

Given this information, is there a way to find an algorithm which can make a better aggregation given the timeseries? YES, this is explained in the next section.

🤖 About the machine learning model

In order to find a better aggregation method, a machine learning model has been created. There are different models which have been explored: random forest, catboost, xgboost, and a decision tree. The best model for this case was the random forest model.

The features it expects are the following, the order of this list is important:

features = ['array_agg_length', 'set_of_species', 'ekorn', 'nothing', 'fugl',
    'grevling', 'ulv', 'villsvin', 'hjort', 'elg', 'raadyr', 'rev', 'maar',
    'katt', 'sau', 'rein', 'gaupe', 'hare', 'hest', 'hund', 'menneske',
    'kjoeretoey', 'motorsykkel', 'bjorn', 'ku',
    'most_consecutive_encoded', 'dominant_species_encoded',
    'diversity_category', 'unique_species_ratio', 'first_species_encoded',
    'last_species_encoded', 'transition_count']
Feature Name Description
array_agg_length Total number of elements in the array_agg timeseries.
set_of_species Number of unique species present in the array_agg timeseries.
ekorn, nothing, fugl, etc. Columns representing the counts of each species in the array_agg timeseries. Each column corresponds to one of the species in the dataset.
most_consecutive_encoded Encoded value of the species that appears most consecutively in the array_agg timeseries.
dominant_species_encoded Encoded value of the species with the highest total count in the timeseries.
diversity_category A categorical value derived from the set_of_species to indicate species diversity (e.g., low, medium, high).
unique_species_ratio Ratio of unique species (set_of_species) to the total number of elements (array_agg_length).
first_species_encoded Encoded value of the first species in the array_agg timeseries.
last_species_encoded Encoded value of the last species in the array_agg timeseries.
transition_count Number of transitions between different species in the array_agg timeseries (e.g., from "fugl" to "nothing").

🔄 Order importance

The order of these features is critical when passing data to the model, as the model expects them in this exact sequence.

🌐 API

The API uses the trained Random Forest model to predict outcomes based on timeseries data of species. It applies feature engineering to transform the raw input data into a format suitable for model inference. The docs are available at http://localhost:8000/docs if you have the API running.

The code can be found in src/, and it consists of three Python files:

  1. app.py: The endpoints are defined here.
  2. feature_engineering.py: Defines and adds the same features which have been used during training.
  3. species_constants.py: Consists of various constants for features and feature names.
  4. models.py: Contains data types for the input and output of the API.

🛠️ Development

There has been created a Dockerfile which can easily run the model by doing the following commands:

  1. Building the Docker image: docker build -t random-forest-api .
  2. Next up is running the container:
docker run -p 8000:8000 -v $(pwd)/src:/app/src -e MODEL_PATH="/app/src/random_forest_model.pkl" random-forest-api

This will mount the src folder to the container and run the API on port 8000, along with the model which should also be located in the src folder.

📍 Endpoints

There are two endpoints provided by this API: health and predict.

🩺 Health Check Endpoint

Endpoint: /health

Method: GET

Description: This endpoint checks the health status of the API. It is useful for ensuring that the API is running and accessible.

Response:

  • 200 OK: Indicates the API is healthy and operational.
  • 500 Internal Server Error: If an error occurs while checking the health status.

Example Request:

curl -X GET "http://localhost:8000/health"

🔮 Prediction Endpoint

Endpoint: /predict

Method: POST

Description: This endpoint accepts timeseries data as input and applies feature engineering to generate predictions using a pre-trained Random Forest model.

Request body:

The input should be a JSON object with the following key:

  • timeseries: A list of species representing the timeseries data (e.g., ["fugl", "nothing", "fugl"]).

Example Request:

For making a prediction with the given timeseries, you can use the following curl command:

curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"timeseries": ["fugl", "nothing", "fugl", "nothing"]}'

Response:

  • 200 OK: Indicates the API is healthy and operational.
  • 500 Internal Server Error: Will be returned if an error occurs during model predicition.
{
  "prediction": 23,
  "label": "nothing"
}

About

Just looking at data from the viltkamera

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages