Module 16 Big Data
-
Cover what constitutes big data and how it's handled.
-
Use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin.
-
Use PySpark, Pandas, or SQL to determine if there is any bias toward favorable reviews from Vine members in your dataset.
1- How many Vine reviews and non-Vine reviews were there?
Total reviews were 104,975
2- How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
5 Star reviews were 52,255
3- What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
Around 50% of Vine reviews were 5 star
1- Based on the results there seems to be no bias.
2- There do not appear to be evident positive bias for reviews in the vine program. Using R we may do many different tests to support this statement.