University of Buenos Aires
Faculty of Exact and natural sciences
Master in Data Mining and Knowledge Discovery
This study aims to compare different approaches to recommendation based on collaborative and hybrid filtering (i.e., a combination of collaborative and content-based filters), explaining the advantages and disadvantages of each approach, as well as their architecture and operation for each proposed model. In the realm of hybrid models or ensembles, experiments were conducted with ensembles of different types including LLM(Large language models), content-based models, and collaborative filtering-based models. The MovieLens and TMDB datasets were chosen as the basis for defining a dataset, as they are classic datasets commonly used for comparing recommendation models.
- Requisites
- Hypothesis
- Documents
- Models
- Metrics
- Data
- Notebooks
- Getting started
- Build dataset
- Recommendation Chatbot API
- References
- anaconda / miniconda / mamba
- mongodb
- chromadb
- airflow
- mongosh (Optional)
- Studio3T (Optional)
- Postman (Optional)
- 6/10GB GPU to have reasonable execution times (Optional)
- Do deep learning-based models achieve better results than non-deep learning-based models? What are the advantages and disadvantages of each approach?
- How can the cold-start problem be solved in a collaborative filtering-based recommendation approach? Any proposed solutions?
- Specialization: Collaborative recommendation systems
- Thesis (In progress)
The following are the models to be compared. For more details, it is recommended to refer to the thesis document in the previous section.
- Memory based CF: Baseline or reference model.
- KNN (Cosine Distance)
- User-Based.
- Item-Based.
- Ensemble User/Item-Based.
- Model-Based CF: Collaborative filter models based on neural networks.
- Generalized Matrix Factorization (GMF): User/Item embeddings dot product.
- Biased Generalized Matrix Factorization (B-GMF): User/Item embeddings dot product + user/item biases.
- Neural Network Matrix Factorization: User/Item Embedding + flatten + Fully Connected.
- Deep Factorization Machine
- *Ensembles
- Content-based and Collaborative-based models Stacking.
- Feature Weighted Linear Stacking.
- Multi-Bandit approach based on beta distribution.
- LLM's + Collaborative filtering ensemble.
To compare collaborative filtering models, the metrics Mean Average Precision at k (mAP@k) y Normalized Discounted Cumulative Gain At K (NDCG@k) are used. Ratings between 4 and 5 points belong to the positive class, and the rest belong to the negative class.
Other metrics used:
- FBetaScore@K
- Precision@K
- Recall@K
- RMSE
To conduct the necessary tests with both collaborative filtering (CF) and content-based (CB) approaches, we need:
- Ratings of each item (movies) by the users (CF).
- Item-specific features (CB).
Based on these requirements, the following datasets were combined:
- MovieLens 25M Dataset: It has practically no information about the movies, but it does have user ratings.
- TMDB Movie Dataset: It does not have personalized ratings like the previous dataset, but it has several features corresponding to the movies or items which will be necessary when training content-based models.
- Recommendation Models Comparative
- Recommendation Chatbot API Evaluation
-
Ensemble using Llama 2 as content based sub model
-
Ensemble using Llama 3 as content based sub model
-
- Memory based
- Model based
- Supervised
- Generalized Matrix Factorization (GMF): Embedding's + dot product.
- Biased Generalized Matrix Factorization (B-GMF): Embedding's + dot product + user/item bias.
- Neural Network Matrix Factorization: User/Item Embedding + flatten + Full Connected.
- Deep Factorization Machine
- Unsupervised
- Supervised
- Supervised Stacking Ensemble
- User Profile
- Item to Item
- Sparse Auto-Encoder + Distance Weighted Mean
- Sentence Transformer + Distance Weighted Mean
-
- Load movie items and interactions to chatbot database
- Update Users and Items embeddings using DeepFM model
- LLM/Collaborative Filtering recommender ensemble
- LLM Tests
- LLM Output Parser Tests
- Llama2 Ensable Evaluation Results
- Llama3 Ensable Evaluation Results
- Llama 2 vs Llama 3 Ensable Evaluation Results
Step 1: Clone repo.
$ git clone https://github.com/adrianmarino/thesis-paper.git
$ cd thesis-paper
Step 2: Create environment.
$ conda env create -f environment.yml
Step 1: Enable project environment.
$ conda activate thesis
Step 2: Under the project directory boot jupyter lab.
$ jupyter lab
Jupyter Notebook 6.1.4 is running at:
http://localhost:8888/?token=45efe99607fa6......
Step 3: Go to http://localhost:8888.... as indicated in the shell output.
To carry out this process, it is necessary to have MongoDB database engine installed and listen into localhost:27017
which is the default host & port for a homemade installation. For more instructions see:
Now is necessary to run the next two notebooks in order:
This creates two files in datasets
path:
movies.json
interactions.json
These files conform to the project dataset and are used for all notebooks.
A chatbot API that recommends movies based on a user's text request, their profile data, and ratings. Papers on which the chatbot was based:
- Install
cha-bot-api
as asystemd
daemon. - Run daemon with your regular user.
Note: systemd
is an initialization and service management system for Unix-like operating systems. It is responsible for starting the system and managing the running processes and services. systemd
has replaced traditional initialization systems like SysV init
in many Linux distributions due to its greater efficiency and advanced features.
Step 1: Copy chat-bot-api.service
to user system
config path:
$ cp chat-bot-api/chat-bot-api.service ~/.config/systemd/user/
Step 2: Refresh systemd
daemon with updated config.
$ systemctl --user daemon-reload
Step 3: Start chat-bot-api
daemon on boot.
$ systemctl --user enable chat-bot-api
Step 6: Start chat-bot-api
as systemd
daemon.
$ systemctl --user start chat-bot-api
Step 7: Check chat-bot-api
health.
$ chat-bot-api/bin/./health
{
"airflow" : {
"metadatabase" : true,
"scheduler" : true
},
"chatbot_api" : true,
"ollama_api" : true,
"choma_database" : true,
"mongo_database" : true
}
config.conf
:
# -----------------------------------------------------------------------------
# Python
# -----------------------------------------------------------------------------
CONDA_PATH="/opt/miniconda3"
CONDA_ENV="thesis"
# -----------------------------------------------------------------------------
#
#
#
# -----------------------------------------------------------------------------
# API
# -----------------------------------------------------------------------------
HOME_PATH="$(pwd)"
PARENT_PATH="$(dirname "$HOME_PATH")"
SERVICE_NAME="Recommendation ChatBot API"
PROCESS_NAME="uvicorn"
export API_HOST="0.0.0.0"
export API_PORT="8080"
# -----------------------------------------------------------------------------
#
#
#
# -----------------------------------------------------------------------------
# Mongo DB
# -----------------------------------------------------------------------------
export MONGODB_DATABASE="chatbot"
export MONGODB_HOST="0.0.0.0"
export MONGODB_PORT="27017"
export MONGODB_URL="mongodb://$MONGODB_HOST:$MONGODB_PORT"
# -----------------------------------------------------------------------------
#
#
#
# -----------------------------------------------------------------------------
# Chroma DB
# -----------------------------------------------------------------------------
export CHROMA_HOST="0.0.0.0"
export CHROMA_PORT="9090"
# -----------------------------------------------------------------------------
#
#
#
# -----------------------------------------------------------------------------
# Training Jobs
# -----------------------------------------------------------------------------
export TMP_PATH="$PARENT_PATH/tmp"
export DATASET_PATH="$PARENT_PATH/datasets"
export WEIGHTS_PATH="$PARENT_PATH/weights"
export METRICS_PATH="$PARENT_PATH/metrics"
# -----------------------------------------------------------------------------
#
#
#
cp dags/cf_emb_update_dag.py $AIRFLOW_HOME/dags
Step 1: Create a user profile.
curl --location 'http://nonosoft.ddns.net:8080/api/v1/profiles' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Adrian",
"email": "adrianmarino@gmail.com",
"metadata": {
"studies" : "Engineering",
"age" : 42,
"genre" : "Male",
"nationality" : "Argentina",
"work" : "Software Engineer",
"prefered_movies": {
"release": {
"from" : "1970"
},
"genres": [
"thiller",
"suspense",
"science fiction",
"love",
"comedy"
]
}
}
}'
Step 2: Query supported llm
models.
curl --location 'http://nonosoft.ddns.net:8080/api/v1/recommendations/models'
{
"models": [
"phi3:mini",
"llama3-rec:latest",
"mxbai-embed-large:latest",
"snowflake-arctic-embed:latest",
"llama3:text",
"llama3:instruct",
"llama3-8b-instruct:latest",
"mistral:latest",
"gemma-7b:latest",
"gemma:7b",
"llama2-7b-chat:latest",
"mistral-instruct:latest",
"mistral:instruct",
"mixtral:latest",
"llama2:7b-chat"
]
}
Step 2: Ask for recommendations.
curl --location 'http://nonosoft.ddns.net:8080/api/v1/recommendations' \
--header 'Content-Type: application/json' \
--data-raw '{
"message": {
"author": "adrianmarino@gmail.com",
"content": "I want to see marvel movies"
},
"settings": {
"llm" : "llama3:instruct",
// llama2:7b-chat, mistral:instruct
"retry" : 3,
"plain" : false,
"include_metadata" : true,
"rag": {
"shuffle" : false,
"candidates_limit" : 30,
"llm_response_limit" : 30,
"recommendations_limit" : 5,
"similar_items_augmentation_limit" : 5,
"not_seen" : true
},
"collaborative_filtering": {
"shuffle" : false,
"candidates_limit" : 100,
"llm_response_limit" : 30,
"recommendations_limit" : 5,
"similar_items_augmentation_limit" : 2,
"text_query_limit" : 5000,
"k_sim_users" : 10,
"random_selection_items_by_user" : 0.5,
"max_items_by_user" : 10,
"min_rating_by_user" : 3.5,
"not_seen" : true,
"rank_criterion" : "user_sim_weighted_pred_rating_score"
// user_sim_weighted_rating_score
// user_item_sim
// pred_user_rating
}
}
}'
{
"items": [
{
"title": "Thor",
"poster": "http://image.tmdb.org/t/p/w500/pIkRyD18kl4FhoCNQuWxWu5cBLM.jpg",
"release": "2011",
"description": "Chris hemsworth stars as the norse god of thunder, who must reclaim his rightful place on the throne and defeat an evil nemesis.",
"genres": [
"action",
"adventure",
"drama",
"fantasy",
"imax"
],
"votes": [
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/86332/1",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/86332/2",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/86332/3",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/86332/4",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/86332/5"
]
},
{
"title": "Avengers, The",
"poster": "http://image.tmdb.org/t/p/w500/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg",
"release": "2012",
"description": "Earth's mightiest heroes team up to save the world from an alien invasion in this epic superhero movie.",
"genres": [
"action",
"adventure",
"sci-fi",
"imax"
],
"votes": [
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/89745/1",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/89745/2",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/89745/3",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/89745/4",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/89745/5"
]
},
{
"title": "Marvel One-Shot: A Funny Thing Happened on the Way to Thor's Hammer",
"poster": "http://image.tmdb.org/t/p/w500/njrOqsmFH4pxBrhcoslqLfw2OGk.jpg",
"release": "2011",
"description": "Chris hemsworth stars as the norse god of thunder, who must reclaim his rightful place on the throne and defeat an evil nemesis.",
"genres": [
"fantasy",
"sci-fi"
],
"votes": [
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/168040/1",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/168040/2",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/168040/3",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/168040/4",
"http://nonosoft.ddns.net:8080/api/v1/interactions/make/adrianmarino@gmail.com/168040/5"
]
}
]
}
Step 1: Backup user interactions in mongodb
.
mongoexport -d chatbot -c interactions --out interactions.json --jsonArray
Step 2: Remove all chatbot user interactions in mongodb
.
db.getCollection('interactions').deleteMany({ 'user_id': { $regex: /@/ }})
Step 3: Backup and Remove all users profiles in mongodb
.
mongoexport -d chatbot -c profiles --out profiles.json --jsonArray
db.getCollection('profiles').drop();
Step 4: Backup and Remove all predicted interactions in mongodb
.
mongoexport -d chatbot -c pred_interactions --out pre_interactions.json --jsonArray
db.getCollection('pred_interactions').drop();
Step 5: Remove users search history in mongodb
.
db.getCollection('histories').drop();
Step 6: Remove all collections in chroma
database.
cd chat-bot-api
bin/./chroma-delete-all
ENV: thesis
2024-06-08 13:31:53,826 - INFO - Start: Delete all chroma db collections...
2024-06-08 13:31:58,376 - INFO - ==> "items_cf" collection deleted...
2024-06-08 13:31:58,685 - INFO - ==> "items_content" collection deleted...
2024-06-08 13:31:59,130 - INFO - ==> "users_cf" collection deleted...
2024-06-08 13:31:59,130 - INFO - Finish: 3 collections deleted
Step 7: Restart charbot API.
systemctl --user restart chat-bot-api
Step 8: Rebuild item text embeddings used to search items by free text (Retrieval Augmented Generation).
curl --location --request PUT 'http://nonosoft.ddns.net:8080/api/v1/items/embeddings/content/build?batch_size=5000'
Could use next command to see reindex process logs:
tail -f /var/tmp/chat-bot-api.log
Step 9: Restart chat-bot-api
systemctl --user restart chat-bot-api
systemctl --user status chat-bot-api
● chat-bot-api.service - Recommendation Chatbot API for adrian user
Loaded: loaded (/home/adrian/.config/systemd/user/chat-bot-api.service; enabled; preset: enabled)
Active: active (exited) since Sat 2024-06-08 13:35:12 -03; 3s ago
Process: 4092833 ExecStart=/home/adrian/chat-bot-api/bin/start (code=exited, status=0/SUCCESS)
Main PID: 4092833 (code=exited, status=0/SUCCESS)
Tasks: 26 (limit: 38212)
Memory: 514.4M (peak: 515.0M)
CPU: 4.855s
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/chat-bot-api.service
├─4092894 python -m uvicorn api:app --reload --host 0.0.0.0 --port 8080
├─4092897 /home/adrian/.conda/envs/thesis/bin/python -c "from multiprocessing.resource_tracker import main;main(4)"
└─4092898 /home/adrian/.conda/envs/thesis/bin/python -c "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7)" --multiprocessing-fork
jun 08 13:35:12 skynet systemd[1467]: Starting Recommendation Chatbot API for adrian user...
jun 08 13:35:12 skynet start[4092833]: ENV: thesis
jun 08 13:35:12 skynet start[4092833]: Start Recommendation ChatBot API...
jun 08 13:35:12 skynet systemd[1467]: Finished Recommendation Chatbot API for adrian user.
jun 08 13:35:12 skynet start[4092894]: INFO: Will watch for changes in these directories: ['/home/adrian/development/personal/maestria/thesis-paper/chat-bot-api']
jun 08 13:35:12 skynet start[4092894]: INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
jun 08 13:35:12 skynet start[4092894]: INFO: Started reloader process [4092894] using WatchFiles
Step 10: Remove previos model weights.
rm -rf weights
Step 11: Start Jupyter Lab, go to notebooks/chat-bot/6_evaluation-llama3.ipynb
and start notebook.
cd ..
jupyterlab
Note: The evaluation process takes between 4 to 5 days.
- References
- Using or based on