This Python project extracts comprehensive movie data from IMDb, focusing on the top-grossing movies from 1960 to 2024. It collects detailed information, including box office performance, cast & crew, awards, and other key metrics, organizing the data in a structured format for analysis.
I uploaded the Dataset to Kaggle. Check it out.
Here is the Merged Dataset alongside several notebooks related to its study: Check this Kaggle link
The scraper collects the following details for each year:
- Basic Information: Title, year, duration, MPA rating, IMDb rating, and user votes.
- Financial Data: Budget, worldwide gross, US/Canada gross, and opening weekend revenue.
- Credits: Directors, writers, and main cast.
- Additional Details: Genres, production countries, filming locations, production companies, and available languages.
- Awards: Wins, nominations, and Oscar statistics.
- Release Information: Official release dates.
For each year, the script generates:
imdb_movies_[year].csv
: Contains basic movie information.advanced_movies_details_[year].csv
: Contains detailed movie data.merged_movies_data_[year].csv
: Combines data from the other two files for a complete dataset.
All files follow a consistent template across years, enabling seamless analysis. Below are the templates for each file:
-
imdb_movies_[year].csv
:
Title, Movie Link, Year, Duration, MPA, Rating, Votes
-
advanced_movies_details_[year].csv
:
Movie Link, budget, grossWorldWide, gross_US_Canada, opening_weekend_Gross, directors, writers, stars, genres, countries_origin, filming_locations, production_companies, Languages, wins, nominations, oscars, release_date
-
merged_movies_data_[year].csv
:
Title, Movie Link, Year, Duration, MPA, Rating, Votes, budget, grossWorldWide, gross_US_Canada, opening_weekend_Gross, directors, writers, stars, genres, countries_origin, filming_locations, production_companies, Languages, wins, nominations, oscars, release_date
The dataset is updated annually every December to include the most recent data.
Ensure you have the following installed:
python 3
Jupyter
selenium
beautifulsoup4
pandas
-
Install the required Python packages:
pip install selenium beautifulsoup4 pandas
-
Download a WebDriver:
- The script uses Microsoft Edge WebDriver by default.
- Modify the code to use Chrome, Firefox, or another WebDriver if preferred.
- Place the WebDriver executable in the project directory or add it to your PATH.
The scraper organizes the output into a Data
directory, structured by year. For each year, it generates 3 CSV files:
imdb_movies_[year].csv
: Basic movie informationadvanced_movies_details_[year].csv
: Detailed movie datamerged_movies_data_[year].csv
: Combined dataset
Use the following commands to run the scraper:
# Run for a specific year
crawl_imdb_movies(2023)
# Run for multiple years
years_to_crawl = range(1960, 2024)
for year in years_to_crawl:
crawl_imdb_movies(year)
Contains essential movie information, including:
- Title: Movie title.
- Movie Link: IMDb URL for the movie.
- Year: Year of release.
- Duration: Movie runtime (in minutes).
- MPA: Motion Picture Association rating (e.g., PG, R).
- Rating: IMDb rating (on a scale of 1–10).
- Votes: Number of user votes on IMDb.
Provides detailed movie data, including:
- Movie Link: IMDb URL for the movie.
- budget: Production budget (in USD).
- grossWorldWide: Total global box office earnings.
- gross_US_Canada: North American box office earnings.
- opening_weekend_Gross: Opening weekend revenue.
- directors: List of directors.
- writers: List of writers.
- stars: Main cast members.
- genres: Movie genres.
- countries_origin: Countries of production.
- filming_locations: Primary shooting locations.
- production_companies: Associated production companies.
- Languages: Available languages for the movie.
- wins: Number of award wins.
- nominations: Number of award nominations.
- Oscars: Number of Oscar nominations.
- release_date: Official release date.
Combines all columns from imdb_movies_[year].csv
and advanced_movies_details_[year].csv
for a unified dataset.
This dataset is ideal for:
- Analyzing trends in the movie industry over time.
- Building predictive models for box office performance or awards.
- Developing recommendation systems based on movie attributes.
-
WebDriver Errors:
Adjust browser options:options = Options() options.add_argument("--disable-gpu") options.add_argument("--no-sandbox")
-
Rate Limiting:
- Increase
time.sleep()
delays between requests. - Use proxy rotation for high-volume scraping.
- Increase
-
Incomplete Data:
- Some older movies may lack detailed financial or award information.
For large date ranges:
- Process data in smaller batches.
- Clear DataFrame objects regularly using
del
andgc.collect()
.
We welcome contributions to improve the scraper or extend its functionality!
-
Fork this repository.
-
Create a feature branch:
git checkout -b feature-name
-
Commit changes:
git commit -m "Add feature"
-
Submit a pull request.
This project is licensed under the MIT License. See the LICENSE
file for details.
For issues, suggestions, or feature requests:
- Open an issue on GitHub.
- Reach out with ideas to improve code readability, robustness, and performance.