📊 scandcam-analysis

This repository consists of an analysis of the data in the viltkamera (or scandcam) portal. It analyses the timeseries with respect to the previous aggregation method to the validated predictions.

There has also been created an API in order to be able to interact with the model.

📈 About the analysis

The data analysis can be found in notebooks/00_analysis. The main findings have been summarized:

The dataset consists of 71% which has been validated as "nothing".
The other classes which are most prevalent are "menneske", "raadyr", "rev" and "kjoeretoy".
The longest timeseries with the largest variation of species seems to give the highest amount of incorrect predictions.
The previous aggregation method is correct 88% of the time and wrong 12% of the time. This isn't too bad.
The classes it seems to struggle with the most are "hjort", "raadyr", "menneske", "sau", "elg", "ku".
It seems to confuse "elg" and "raadyr", "elg" and "raadyr", "raadyr" and "hjort", "hjort" and "villsvin" and "raadyr" and "villsvin".

Given this information, is there a way to find an algorithm which can make a better aggregation given the timeseries? YES, this is explained in the next section.

🤖 About the machine learning model

In order to find a better aggregation method, a machine learning model has been created. There are different models which have been explored: random forest, catboost, xgboost, and a decision tree. The best model for this case was the random forest model.

The features it expects are the following, the order of this list is important:

features = ['array_agg_length', 'set_of_species', 'ekorn', 'nothing', 'fugl',
    'grevling', 'ulv', 'villsvin', 'hjort', 'elg', 'raadyr', 'rev', 'maar',
    'katt', 'sau', 'rein', 'gaupe', 'hare', 'hest', 'hund', 'menneske',
    'kjoeretoey', 'motorsykkel', 'bjorn', 'ku',
    'most_consecutive_encoded', 'dominant_species_encoded',
    'diversity_category', 'unique_species_ratio', 'first_species_encoded',
    'last_species_encoded', 'transition_count']

Feature Name	Description
`array_agg_length`	Total number of elements in the `array_agg` timeseries.
`set_of_species`	Number of unique species present in the `array_agg` timeseries.
`ekorn`, `nothing`, `fugl`, etc.	Columns representing the counts of each species in the `array_agg` timeseries. Each column corresponds to one of the species in the dataset.
`most_consecutive_encoded`	Encoded value of the species that appears most consecutively in the `array_agg` timeseries.
`dominant_species_encoded`	Encoded value of the species with the highest total count in the timeseries.
`diversity_category`	A categorical value derived from the `set_of_species` to indicate species diversity (e.g., low, medium, high).
`unique_species_ratio`	Ratio of unique species (`set_of_species`) to the total number of elements (`array_agg_length`).
`first_species_encoded`	Encoded value of the first species in the `array_agg` timeseries.
`last_species_encoded`	Encoded value of the last species in the `array_agg` timeseries.
`transition_count`	Number of transitions between different species in the `array_agg` timeseries (e.g., from "fugl" to "nothing").

🔄 Order importance

The order of these features is critical when passing data to the model, as the model expects them in this exact sequence.

🌐 API

The API uses the trained Random Forest model to predict outcomes based on timeseries data of species. It applies feature engineering to transform the raw input data into a format suitable for model inference. The docs are available at http://localhost:8000/docs if you have the API running.

The code can be found in src/, and it consists of three Python files:

app.py: The endpoints are defined here.
feature_engineering.py: Defines and adds the same features which have been used during training.
species_constants.py: Consists of various constants for features and feature names.
models.py: Contains data types for the input and output of the API.

🛠️ Development

There has been created a Dockerfile which can easily run the model by doing the following commands:

Building the Docker image: docker build -t random-forest-api .
Next up is running the container:

docker run -p 8000:8000 -v $(pwd)/src:/app/src -e MODEL_PATH="/app/src/random_forest_model.pkl" random-forest-api

This will mount the src folder to the container and run the API on port 8000, along with the model which should also be located in the src folder.

📍 Endpoints

There are two endpoints provided by this API: health and predict.

🩺 Health Check Endpoint

Endpoint: /health

Method: GET

Description: This endpoint checks the health status of the API. It is useful for ensuring that the API is running and accessible.

Response:

200 OK: Indicates the API is healthy and operational.
500 Internal Server Error: If an error occurs while checking the health status.

Example Request:

curl -X GET "http://localhost:8000/health"

🔮 Prediction Endpoint

Endpoint: /predict

Method: POST

Description: This endpoint accepts timeseries data as input and applies feature engineering to generate predictions using a pre-trained Random Forest model.

Request body:

The input should be a JSON object with the following key:

timeseries: A list of species representing the timeseries data (e.g., ["fugl", "nothing", "fugl"]).

Example Request:

For making a prediction with the given timeseries, you can use the following curl command:

curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"timeseries": ["fugl", "nothing", "fugl", "nothing"]}'

Response:

200 OK: Indicates the API is healthy and operational.
500 Internal Server Error: Will be returned if an error occurs during model predicition.

{
  "prediction": 23,
  "label": "nothing"
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
.vscode		.vscode
notebooks		notebooks
src		src
.copier-answers.yml		.copier-answers.yml
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 scandcam-analysis

📈 About the analysis

🤖 About the machine learning model

🔄 Order importance

🌐 API

🛠️ Development

📍 Endpoints

🩺 Health Check Endpoint

🔮 Prediction Endpoint

About

Releases

Packages

Languages

NINAnor/scandcam-data-analysis

Folders and files

Latest commit

History

Repository files navigation

📊 scandcam-analysis

📈 About the analysis

🤖 About the machine learning model

🔄 Order importance

🌐 API

🛠️ Development

📍 Endpoints

🩺 Health Check Endpoint

🔮 Prediction Endpoint

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages