⚙️ Automated SQL data cleaning

This repository contains an automated data cleaning project using SQL within MySQL Workbench.

Table of content

Introduction

Task automation consists of using technology to manage repetitive, mundane and time-consuming tasks in order to improve productivity. Automation aims to optimize workflows, reduce manual effort and human intervention, and improve efficiency while freeing the individual for higher value tasks.

The advantages of automating boring and time-consuming tasks are:

Clean and standardize data
Error reduction
Efficiency
Reliability
Productivity
Scalability

Goal

The main objective of this project is to automate the cleaning and preprocessing workflow of a given data set.

The goal of this automation is to ensure data quality and consistency by removing duplicates, correcting typos and standardizing data. The process is implemented using SQL stored procedures, events and triggers, ensuring that the data cleaning workflow is run automatically each time data is added to the dataset and periodically every 30 days.

Project overview

Create stored procedure
Data copy and transformation
Data cleaning
- duplicate removal
- data standardization
Data cleaning scheduling
Data cleaning in real-time
Validation

Dependencies

The following software is required to carry out this project:

MySQL Workbench 8.0 CE

Technical skills

The following skills were used throughout the implementation of this project:

Data definition
Data manipulation
Data querying

Dataset

The US-geographic-location dataset from the U.S household income project is used for this project. This dataset consists of 32292 rows and 16 columns.

The dataset for this project can be found uploaded in this repository as US-geographic-location.cvs.

Create stored procedure

A stored procedure named copy_and_clean_data was used to automate the data cleaning process.

Stored procedures in MySQL are precompiled SQL code that is stored on the database server. They allow the user to organize complex SQL queries and logic into a single unit that can be executed multiple times without having to recompile the code each time.

Data copy and transformation

This automated project is part of the US Household Income project, so the database was already created and the original datasets loaded into MySQL Workbench. To create the schema and load the dataset into the workbench, see the readme file for that project HERE.

The first step of the project was to create a new table that will contain the cleaned data. All the data contained in the original dataset is copied into the new table and a timestamp is added to keep track of when the data was processed.

Data cleaning: duplicate removal

Duplicate data were identified and removed using a subquery that assigned row numbers based on the Id and timestamp. Rows with a number greater than one were removed to maintain unique entries.

Data cleaning: data standardization

To maintain data uniformity and integrity, typos have been corrected and data have been standardized:

Fix of misspellings in state names
Standardize type names
Fix typos in place names
Convert all text entries to uppercase for consistency

Data cleaning scheduling

To schedule the data cleaning process, an SQL event name run_data_cleaning was implemented to run the cleaning procedure copy_and_clean_data every 30 days. This ensures that the data remains up-to-date and is cleaned regularly.

Data cleaning in real-time

To ensure that new data added to the data set is immediately cleaned, a SQL trigger named transfer_clean_data is implemented. This procedure ensures that all the data (recentrly added or not) in cleaned.

Validation

At the end of the project, SQL queries are provided to verify the effectiveness of the cleaning process by checking for remaining duplicates and summarizing data counts.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
SQL-automated-data-cleaning.sql		SQL-automated-data-cleaning.sql
US-geographic-location.csv		US-geographic-location.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚙️ Automated SQL data cleaning

Table of content

Introduction

Goal

Project overview

Dependencies

Technical skills

Dataset

Create stored procedure

Data copy and transformation

Data cleaning: duplicate removal

Data cleaning: data standardization

Data cleaning scheduling

Data cleaning in real-time

Validation

About

herrerovir/SQL-automated-data-cleaning

Folders and files

Latest commit

History

Repository files navigation

⚙️ Automated SQL data cleaning

Table of content

Introduction

Goal

Project overview

Dependencies

Technical skills

Dataset

Create stored procedure

Data copy and transformation

Data cleaning: duplicate removal

Data cleaning: data standardization

Data cleaning scheduling

Data cleaning in real-time

Validation

About

Topics

Resources

Stars

Watchers

Forks