Movie-Rating-Data-Analysis

Movie Rating Data Analysis using Apache Spark (pyspark)💥🐝

📝 Gain the skills

Languages and Tools:

Cloud:

Version Control System:

Programming Language - PYTHON:

BIG DATA TOOL AND SOFTWARES:

📙 Project Structures :

Problem Statement:
The objective of this project is to analyze movie data using Hadoop, Spark, and Hive. We aim to derive insights from the data, such as ratings, tags, and movie details, to understand user preferences and popular trends in the world of movies.
Project Introduction:
Welcome to my movie data analysis project. To kick things off, I established a Hadoop cluster with Hadoop YARN, Hive, and Apache Spark. I accomplished this either using Docker for a local setup or on a cloud platform like AWS, GCP, or Azure.
Data Loading:
First and foremost, I loaded three crucial data files - movies.csv, ratings.csv, and tags.csv - into my Hadoop Distributed File System (HDFS).
Spark Data Analysis:
The heart of the project! I wrote Spark jobs to tackle specific analytical challenges in the movie data.
I showed the aggregated number of ratings per year.
I displayed the average monthly number of ratings.
I visualized the distribution of rating levels.
I identified the 18 movies that were tagged but not rated.
I found movies that had ratings but no tags.
For rated untagged movies with more than 30 user ratings, I displayed the top 10 movies in terms of average rating and number of ratings.
I calculated the average number of tags per movie in tagsDF and the average number of tags per user, comparing it with the average number of tags a user assigns to a movie.
I also identified the users that tagged movies without rating them.
I calculated the average number of ratings per user in the ratings DataFrame and the average number of ratings per movie.
I determined the predominant (frequency-based) genre per rating level.
I found the predominant tag per genre and the most tagged genres.
I identified the most predominant (popularity-based) movies.
Finally, I listed the top 10 movies in terms of average rating (provided more than 30 users reviewed them).
Data Storage:
At the end of each problem statement, I ensured that the output was stored neatly in a single CSV file with headers in the output HDFS path.
Key Takeaway:
This project embodies the essence of utilizing Hadoop, Spark, and Hive to extract valuable insights from movie data. It emphasizes the importance of data organization for further exploration, decision-making, and a better understanding of user preferences and trends in the movie industry.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
movies.csv		movies.csv
ratings.csv		ratings.csv
spark movie rating project.ipynb		spark movie rating project.ipynb
tags.csv		tags.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie-Rating-Data-Analysis

Movie Rating Data Analysis using Apache Spark (pyspark)💥🐝

📝 Gain the skills

Languages and Tools:

📙 Project Structures :

About

Releases

Packages

Languages

aeronaut2001/Movie-Rating-Analysis

Folders and files

Latest commit

History

Repository files navigation

Movie-Rating-Data-Analysis

Movie Rating Data Analysis using Apache Spark (pyspark)💥🐝

📝 Gain the skills

Languages and Tools:

📙 Project Structures :

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages