Utilize cloud tools to create database, storage, and run analysis on amazon reviews.
- Spark, SQL, AWS, RDS, S3, PostgresSQL. JDBC, HDFS, pgAdmin
AWS Setup
- Create a new RDS for project
- Created with public access
- Setup Security group rules
- PostgreSQL database
- Create a new Bucket for project
- Enable ACLs
- Modify access of bucket
- Upload data and make it readable for importing into project
Google Colab Setup
# Spark
spark = SparkSession.builder.appName("AWS-Amazon-Reviews").config("spark.driver.extraClassPath","/content/postgresql-42.2.16.jar").getOrCreate()
# Get files
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"), sep="\t", header=True, inferSchema=True)
df.show()
Spark
- Review ID Table
review_id_df = df.select(["review_id", "customer_id", "product_id", "product_parent", to_date("review_date", 'yyyy-MM-dd').alias("review_date")])
- Product ID Table
products_df = df.select(["product_id", "product_title"])
products_df = df.select(["product_id", "product_title"]).drop_duplicates()
- Customer Table
customers_df = df.groupby("customer_id").agg({"customer_id": "count"}).withColumnRenamed("count(customer_id)", "customer_count")
- Vine Table
vine_df = df.select(["review_id", "star_rating", "helpful_votes", "total_votes", "vine", 'verified_purchase'])
Database
- Create schema for data
CREATE TABLE review_id_table (
review_id TEXT PRIMARY KEY NOT NULL,
customer_id INTEGER,
product_id TEXT,
product_parent INTEGER,
review_date DATE
);
CREATE TABLE products_table (
product_id TEXT PRIMARY KEY NOT NULL UNIQUE,
product_title TEXT
);
CREATE TABLE customers_table (
customer_id INT PRIMARY KEY NOT NULL UNIQUE,
customer_count INT
);
CREATE TABLE vine_table (
review_id TEXT PRIMARY KEY,
star_rating INTEGER,
helpful_votes INTEGER,
total_votes INTEGER,
vine TEXT,
verified_purchase TEXT
);
- Use JDBC to write spark data to AWS database.
from config import aws_db_conn, aws_username, aws_password
# Configure settings for RDS
mode = "append"
jdbc_url="aws_db_conn"
config = {"user":"aws_username",
"password": "aws_password",
"driver":"org.postgresql.Driver"}
# Write review_id_df to the table in RDS
review_id_df.write.jdbc(url=jdbc_url, table='review_id_table', mode=mode, properties=config)
# Write products_df to the table in RDS
products_df.write.jdbc(url=jdbc_url, table='products_table', mode=mode, properties=config)
# Write customers_df to the table in RDS
customers_df.write.jdbc(url=jdbc_url, table='customers_table', mode=mode, properties=config)
# Write vine_df to the table in RDS
vine_df.write.jdbc(url=jdbc_url, table='vine_table', mode=mode, properties=config)
Analysis
I ran the analysis from both SQL and Spark for practice.
In order to explore bias in the dataset I want to Analyze the Vine Table by filtering the data by paid and unpaid reviews.
Vine program contributes about 1000 reviews compared to the almost 50000 reviews from unpaid reviewers.
42% of the paid reviews were 5 stars.
46% of the unpaid reviews were 5 stars.
I believe these are within acceptable range to declare no bias in the dataset.