Cloud:
Version Control System:
Programming Language - PYTHON:
BIG DATA TOOL AND SOFTWARES:
-
Problem Statement:
-
The objective of this project is to analyze movie data using Hadoop, Spark, and Hive. We aim to derive insights from the data, such as ratings, tags, and movie details, to understand user preferences and popular trends in the world of movies.
-
Project Introduction:
-
Welcome to my movie data analysis project. To kick things off, I established a Hadoop cluster with Hadoop YARN, Hive, and Apache Spark. I accomplished this either using Docker for a local setup or on a cloud platform like AWS, GCP, or Azure.
-
Data Loading:
-
First and foremost, I loaded three crucial data files -
movies.csv
,ratings.csv
, andtags.csv
- into my Hadoop Distributed File System (HDFS). -
Spark Data Analysis:
-
The heart of the project! I wrote Spark jobs to tackle specific analytical challenges in the movie data.
-
I showed the aggregated number of ratings per year.
-
I displayed the average monthly number of ratings.
-
I visualized the distribution of rating levels.
-
I identified the 18 movies that were tagged but not rated.
-
I found movies that had ratings but no tags.
-
For rated untagged movies with more than 30 user ratings, I displayed the top 10 movies in terms of average rating and number of ratings.
-
I calculated the average number of tags per movie in
tagsDF
and the average number of tags per user, comparing it with the average number of tags a user assigns to a movie. -
I also identified the users that tagged movies without rating them.
-
I calculated the average number of ratings per user in the ratings DataFrame and the average number of ratings per movie.
-
I determined the predominant (frequency-based) genre per rating level.
-
I found the predominant tag per genre and the most tagged genres.
-
I identified the most predominant (popularity-based) movies.
-
Finally, I listed the top 10 movies in terms of average rating (provided more than 30 users reviewed them).
-
Data Storage:
-
At the end of each problem statement, I ensured that the output was stored neatly in a single CSV file with headers in the output HDFS path.
-
Key Takeaway:
-
This project embodies the essence of utilizing Hadoop, Spark, and Hive to extract valuable insights from movie data. It emphasizes the importance of data organization for further exploration, decision-making, and a better understanding of user preferences and trends in the movie industry.