Analyze whether reviews from Amazon's Vine program are trustworthy
Many of Amazon's shoppers depend on product reviews to make a purchase. Amazon makes these datasets publicly available. However, they are quite large and can exceed the capacity of local machines to handle. One dataset alone contains over 1.5 million rows; with over 40 datasets, this can be quite taxing on the average local computer.
The first goal for this assignment will be to perform the ETL process completely in the cloud (Google Colab) and upload a DataFrame to an RDS instance. The second goal will be to use PySpark or SQL to perform a statistical analysis of selected data.
Following files are attached:
-
Big_Data_Level_1.ipynb: Level 1 ETL with Luggage Reivews
-
Big_Data_Level_1_2.ipynb: Level 1 ETL with Gift Card Reivews
-
Big_Data_Level_2.ipynb: Big Data Analysis on Vine Reviews
-
We can see that the percentage of 5-star reviews in Vine is very close to non-Vine reviews (51% to 50.5%).
-
Although the number of Vine reviews is pretty low, so far it can still represent the product. However, the average rating from Vine customers is 4.38 with std deviation of 0.78, and this is much higher than the 3.77 (std deviation: 1.51) from non-Vine customers. It is obvious that non-Vine reviews are more diverse than Vine reviews which got motivated to give higher ratings.
-
I believe the Vine customers tend to give higher ratings and pretty focusing on the higher ratings too. So reviews from Vine customers are not that trustworthy for me.