Utilizing python and SQL to build-out ETL pipelines that clean, transform, and load datasets into a database.
- Python 3.7.6, JupyterLab 2.26
- PostgreSQL 12.2, Pgadmin 4.20
- Movie Data sourced from IMDB, Kaggle (note: due to size of the raw data files, they are not included within this repo)
The purpose of this project is to create a refactorable and intuitive ETL Pipeline that helps automate processing large sets of data.
- Extract
This stage involves the initial retrieval and reading of data in various formats (csv, json) by using a python environment that can interpret the data.
- Transform
This stage involves several more granular step including but not limited to:- Cleaning data: assessing missing values and any corrupt data, formating
- Transforming: filtering , formatting, classifying (data type is redefined/changed to better suit analysis interpretation), merging data
- Load
this stage involves connecting to a database/server from the python environment and loading the data into the appropriate tables/schemas.