Movies-ETL

Utilizing python and SQL to build-out ETL pipelines that clean, transform, and load datasets into a database.

Resources

Python 3.7.6, JupyterLab 2.26
PostgreSQL 12.2, Pgadmin 4.20
Movie Data sourced from IMDB, Kaggle (note: due to size of the raw data files, they are not included within this repo)

Overview

The purpose of this project is to create a refactorable and intuitive ETL Pipeline that helps automate processing large sets of data.

Primary steps & stages of the pipeline

Extract
This stage involves the initial retrieval and reading of data in various formats (csv, json) by using a python environment that can interpret the data.

Transform
This stage involves several more granular step including but not limited to:
- Cleaning data: assessing missing values and any corrupt data, formating
- Transforming: filtering , formatting, classifying (data type is redefined/changed to better suit analysis interpretation), merging data

Load
this stage involves connecting to a database/server from the python environment and loading the data into the appropriate tables/schemas.