Skip to content

Latest commit

 

History

History
33 lines (20 loc) · 2.17 KB

File metadata and controls

33 lines (20 loc) · 2.17 KB

Building a Refined Movie Dataset: A Streamlined Approach

We have refined our approach to create a more precise and reliable movie dataset. Here's a concise overview of the steps we took to construct this new dataset.

1. Initial Dataset and Strategy

  • Original Plan: Initially, we integrated four datasets: "The Movies Industry Dataset," "IMDb 5000 Movies Multiple Genres Dataset," "IMDb 5000 Movies Dataset," and "Top 500 Movies Budget," as defined in the README for the old dataset.
  • Challenges: This approach led to suboptimal performance in our predictive models due to forced integration and potential confusion from movies with similar titles.

2. Transition to a New Dataset

  • Re-evaluation: We paused to re-evaluate our methodology, focusing on a deeper analysis of the data to improve model accuracy.
  • New Approach: We opted for a refined dataset derived solely from "The Movies Industry Dataset," excluding entries with missing values.

3. Rationale Behind the Transition

  • Performance Issues: The initial dataset's performance was hindered by lower accuracy rates due to inadequate variable selection and forced dataset integration.
  • Complexities: Merging datasets with similar movie titles added unnecessary complexity and noise.

4. Benefits of the Optimized Dataset

  • Enhanced Accuracy: The new dataset includes additional input variables such as Year, Production Company, and Votes, which enhance the accuracy of our machine learning model in predicting movie revenues.
  • Single Source: Sourcing the dataset from a single origin eliminated extraneous noise and improved data integrity.
  • Reliability: By eliminating entries with null values, we ensured the dataset's reliability before subjecting it to our predictive models.

5. Final Input and Output Features

  • Input Features: name, rating, genre, year, released, score, votes, director, writer, star, country, budget, company, runtime
  • Output Feature: gross

Our refined approach has resulted in a more robust and accurate dataset, better suited for predicting movie revenues and providing valuable insights into the cinematic world. 🎬📊