IMDB Movie Data Scraper

This Python project extracts comprehensive movie data from IMDb, focusing on the top-grossing movies from 1960 to 2024. It collects detailed information, including box office performance, cast & crew, awards, and other key metrics, organizing the data in a structured format for analysis.

I uploaded the Dataset to Kaggle. Check it out.

Here is the Merged Dataset alongside several notebooks related to its study: Check this Kaggle link

Features

Data Collection

The scraper collects the following details for each year:

Basic Information: Title, year, duration, MPA rating, IMDb rating, and user votes.
Financial Data: Budget, worldwide gross, US/Canada gross, and opening weekend revenue.
Credits: Directors, writers, and main cast.
Additional Details: Genres, production countries, filming locations, production companies, and available languages.
Awards: Wins, nominations, and Oscar statistics.
Release Information: Official release dates.

Output Structure

For each year, the script generates:

imdb_movies_[year].csv: Contains basic movie information.
advanced_movies_details_[year].csv: Contains detailed movie data.
merged_movies_data_[year].csv: Combines data from the other two files for a complete dataset.

Consistent Data Structure

All files follow a consistent template across years, enabling seamless analysis. Below are the templates for each file:

imdb_movies_[year].csv:
Title, Movie Link, Year, Duration, MPA, Rating, Votes
advanced_movies_details_[year].csv:
Movie Link, budget, grossWorldWide, gross_US_Canada, opening_weekend_Gross, directors, writers, stars, genres, countries_origin, filming_locations, production_companies, Languages, wins, nominations, oscars, release_date
merged_movies_data_[year].csv:
Title, Movie Link, Year, Duration, MPA, Rating, Votes, budget, grossWorldWide, gross_US_Canada, opening_weekend_Gross, directors, writers, stars, genres, countries_origin, filming_locations, production_companies, Languages, wins, nominations, oscars, release_date

Updates

The dataset is updated annually every December to include the most recent data.

Requirements

Ensure you have the following installed:

python 3
Jupyter
selenium
beautifulsoup4
pandas

Setup

Install the required Python packages:

pip install selenium beautifulsoup4 pandas

Download a WebDriver:
- The script uses Microsoft Edge WebDriver by default.
- Modify the code to use Chrome, Firefox, or another WebDriver if preferred.
- Place the WebDriver executable in the project directory or add it to your PATH.

Usage

Run the Scraper

The scraper organizes the output into a Data directory, structured by year. For each year, it generates 3 CSV files:

imdb_movies_[year].csv: Basic movie information
advanced_movies_details_[year].csv: Detailed movie data
merged_movies_data_[year].csv: Combined dataset

Use the following commands to run the scraper:

# Run for a specific year
crawl_imdb_movies(2023)

# Run for multiple years
years_to_crawl = range(1960, 2024)
for year in years_to_crawl:
    crawl_imdb_movies(year)

Data Structure and Files

`imdb_movies_[year].csv`

Contains essential movie information, including:

Title: Movie title.
Movie Link: IMDb URL for the movie.
Year: Year of release.
Duration: Movie runtime (in minutes).
MPA: Motion Picture Association rating (e.g., PG, R).
Rating: IMDb rating (on a scale of 1–10).
Votes: Number of user votes on IMDb.

`advanced_movies_details_[year].csv`

Provides detailed movie data, including:

Movie Link: IMDb URL for the movie.
budget: Production budget (in USD).
grossWorldWide: Total global box office earnings.
gross_US_Canada: North American box office earnings.
opening_weekend_Gross: Opening weekend revenue.
directors: List of directors.
writers: List of writers.
stars: Main cast members.
genres: Movie genres.
countries_origin: Countries of production.
filming_locations: Primary shooting locations.
production_companies: Associated production companies.
Languages: Available languages for the movie.
wins: Number of award wins.
nominations: Number of award nominations.
Oscars: Number of Oscar nominations.
release_date: Official release date.

`merged_movies_data_[year].csv`

Combines all columns from imdb_movies_[year].csv and advanced_movies_details_[year].csv for a unified dataset.

Applications

This dataset is ideal for:

Analyzing trends in the movie industry over time.
Building predictive models for box office performance or awards.
Developing recommendation systems based on movie attributes.

Troubleshooting

Common Issues

WebDriver Errors:
Adjust browser options:

options = Options()
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")

Rate Limiting:
- Increase time.sleep() delays between requests.
- Use proxy rotation for high-volume scraping.
Incomplete Data:
- Some older movies may lack detailed financial or award information.

Memory Management

For large date ranges:

Process data in smaller batches.
Clear DataFrame objects regularly using del and gc.collect().

Contributing

We welcome contributions to improve the scraper or extend its functionality!

Fork this repository.
Create a feature branch:
```
git checkout -b feature-name
```
Commit changes:
```
git commit -m "Add feature"  
```
Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For issues, suggestions, or feature requests:

Open an issue on GitHub.
Reach out with ideas to improve code readability, robustness, and performance.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Data		Data
LICENSE		LICENSE
README.MD		README.MD
edgedriver.exe		edgedriver.exe
scrapping.ipynb		scrapping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDB Movie Data Scraper

Features

Data Collection

Output Structure

Consistent Data Structure

Updates

Requirements

Setup

Usage

Run the Scraper

Data Structure and Files

`imdb_movies_[year].csv`

`advanced_movies_details_[year].csv`

`merged_movies_data_[year].csv`

Applications

Troubleshooting

Common Issues

Memory Management

Contributing

License

Contact

About

Releases 2

Packages

Languages

License

RaedAddala/Scraping-IMDB

Folders and files

Latest commit

History

Repository files navigation

IMDB Movie Data Scraper

Features

Data Collection

Output Structure

Consistent Data Structure

Updates

Requirements

Setup

Usage

Run the Scraper

Data Structure and Files

imdb_movies_[year].csv

advanced_movies_details_[year].csv

merged_movies_data_[year].csv

Applications

Troubleshooting

Common Issues

Memory Management

Contributing

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`imdb_movies_[year].csv`

`advanced_movies_details_[year].csv`

`merged_movies_data_[year].csv`

Packages