GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.
Recommendation Engines try to make a product or service recommendation to people. In a way, Recommenders try to narrow down choices for people by presenting them with suggestions that they are most likely to buy or use. These datasets can be used to model and recommend users with the movie choices based on their interest. we have used Content based, Collaborative based and Hybrid models to implement the recommendation system. We have used surprise API to measure the RMSE and the acuracy of the model.
- we have used Movie lens dataset for the building recommender systems
- Dataset can be downloaded from link: https://www.kaggle.com/rounakbanik/the-movies-dataset
Full dataset: This dataset consists of 26,000,000 ratings by 270,000 users for 45,000 movies.
Sample dataset: we have taken The subset of 100,000 ratings from 700 users on 9,000 movies for building recommender system.
The data are contained in the following files:
- credits.csv --> This dataset contains columns - cast, crew, ID
- keywords.csv --> This dataset contains columns - id, keywords
- links_small.csv --> This dataset contains columns - movieId, imdbId, tmdbId
- movies_metadata.csv --> This dataset contains columns - adult, belongs_to_collection, budget, genres, homepage, id, imdb_id, original_language, original_title, overview, release_date, revenue, runtime, spoken_languages, status, tagline, title, video, vote_average, vote_count
- ratings_small.csv --> This dataset contains columns - userId, movieId, rating, timestamp
- ratings.csv --> This dataset contains columns - adult, belongs_to_collection, budget, genres, homepage, id, imdb_id, original_language, original_title, overview, release_date, revenue, runtime, spoken_languages, status, tagline, title, video, vote_average, vote_count
Jupiter notebooks and script have been developed for this project. After pre-processing the dataset we have done the following analysis with the proposed models:
Content Based Model: Content based recommendation model by Using movie description & taglines. We have used TF-IDF to implemt this model.
Colloborative Model: Colloborative based by Using movie description,taglines,keywords, cast, director and genres. By using all these variables We have implemented the Item-Based Collaborate filtering.
Suprise Library: we have used Collaborate filtering using Suprise Library API to measure the RMSE and Acuuracy.
Hybrid Model: By using the features from content based and Colloborative models we have implemented Hybrid Recommendation System.
Jupyter Notebook Python >=3.5 pandas numpy scipy ast module nltk scikit-learn scikit-surprise