This repository consists of an analysis of the data in the viltkamera (or scandcam) portal. It analyses the timeseries with respect to the previous aggregation method to the validated predictions.
There has also been created an API in order to be able to interact with the model.
The data analysis can be found in notebooks/00_analysis
. The main findings have been summarized:
- The dataset consists of 71% which has been validated as "nothing".
- The other classes which are most prevalent are "menneske", "raadyr", "rev" and "kjoeretoy".
- The longest timeseries with the largest variation of species seems to give the highest amount of incorrect predictions.
- The previous aggregation method is correct 88% of the time and wrong 12% of the time. This isn't too bad.
- The classes it seems to struggle with the most are "hjort", "raadyr", "menneske", "sau", "elg", "ku".
- It seems to confuse "elg" and "raadyr", "elg" and "raadyr", "raadyr" and "hjort", "hjort" and "villsvin" and "raadyr" and "villsvin".
Given this information, is there a way to find an algorithm which can make a better aggregation given the timeseries? YES, this is explained in the next section.
In order to find a better aggregation method, a machine learning model has been created. There are different models which have been explored: random forest, catboost, xgboost, and a decision tree. The best model for this case was the random forest model.
The features it expects are the following, the order of this list is important:
features = ['array_agg_length', 'set_of_species', 'ekorn', 'nothing', 'fugl',
'grevling', 'ulv', 'villsvin', 'hjort', 'elg', 'raadyr', 'rev', 'maar',
'katt', 'sau', 'rein', 'gaupe', 'hare', 'hest', 'hund', 'menneske',
'kjoeretoey', 'motorsykkel', 'bjorn', 'ku',
'most_consecutive_encoded', 'dominant_species_encoded',
'diversity_category', 'unique_species_ratio', 'first_species_encoded',
'last_species_encoded', 'transition_count']
Feature Name | Description |
---|---|
array_agg_length |
Total number of elements in the array_agg timeseries. |
set_of_species |
Number of unique species present in the array_agg timeseries. |
ekorn , nothing , fugl , etc. |
Columns representing the counts of each species in the array_agg timeseries. Each column corresponds to one of the species in the dataset. |
most_consecutive_encoded |
Encoded value of the species that appears most consecutively in the array_agg timeseries. |
dominant_species_encoded |
Encoded value of the species with the highest total count in the timeseries. |
diversity_category |
A categorical value derived from the set_of_species to indicate species diversity (e.g., low, medium, high). |
unique_species_ratio |
Ratio of unique species (set_of_species ) to the total number of elements (array_agg_length ). |
first_species_encoded |
Encoded value of the first species in the array_agg timeseries. |
last_species_encoded |
Encoded value of the last species in the array_agg timeseries. |
transition_count |
Number of transitions between different species in the array_agg timeseries (e.g., from "fugl" to "nothing"). |
The order of these features is critical when passing data to the model, as the model expects them in this exact sequence.
The API uses the trained Random Forest model to predict outcomes based on timeseries data of species. It applies feature engineering to transform the raw input data into a format suitable for model inference. The docs are available at http://localhost:8000/docs
if you have the API running.
The code can be found in src/
, and it consists of three Python files:
app.py
: The endpoints are defined here.feature_engineering.py
: Defines and adds the same features which have been used during training.species_constants.py
: Consists of various constants for features and feature names.models.py
: Contains data types for the input and output of the API.
There has been created a Dockerfile
which can easily run the model by doing the following commands:
- Building the Docker image:
docker build -t random-forest-api .
- Next up is running the container:
docker run -p 8000:8000 -v $(pwd)/src:/app/src -e MODEL_PATH="/app/src/random_forest_model.pkl" random-forest-api
This will mount the src
folder to the container and run the API on port 8000, along with the model which should also be located in the src
folder.
There are two endpoints provided by this API: health
and predict
.
Endpoint: /health
Method: GET
Description: This endpoint checks the health status of the API. It is useful for ensuring that the API is running and accessible.
Response:
- 200 OK: Indicates the API is healthy and operational.
- 500 Internal Server Error: If an error occurs while checking the health status.
Example Request:
curl -X GET "http://localhost:8000/health"
Endpoint: /predict
Method: POST
Description: This endpoint accepts timeseries data as input and applies feature engineering to generate predictions using a pre-trained Random Forest model.
Request body:
The input should be a JSON object with the following key:
timeseries
: A list of species representing the timeseries data (e.g.,["fugl", "nothing", "fugl"]
).
Example Request:
For making a prediction with the given timeseries, you can use the following curl
command:
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"timeseries": ["fugl", "nothing", "fugl", "nothing"]}'
Response:
- 200 OK: Indicates the API is healthy and operational.
- 500 Internal Server Error: Will be returned if an error occurs during model predicition.
{
"prediction": 23,
"label": "nothing"
}