Skip to content

This project sets up an ETL pipeline to load Citibike trip data from an AWS S3 bucket into Snowflake. It establishes a secure integration with S3, defines a CSV file format, stages the data, and loads it into a Snowflake table for analysis.

Notifications You must be signed in to change notification settings

Undisputed-jay/AWS-S3-Integration-with-snowflake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Citibike Data Integration and ETL Pipeline

This repository contains SQL scripts and configurations for a data integration and ETL pipeline that loads Citibike trip data from an Amazon S3 bucket into a Snowflake database. This pipeline uses Snowflake's external storage integration and staging capabilities to access, format, and transform Citibike data for analysis.

Features

  • AWS S3 and Snowflake Integration: Securely connects to a designated S3 bucket using Snowflake's STORAGE INTEGRATION and IAM roles, ensuring a reliable and efficient data connection.
  • Data Staging and File Formatting: Defines a reusable file format for consistent parsing of CSV files, including error handling for null values and missing data.
  • Automated Data Loading: Copies raw CSV data from S3 into a Snowflake stage, transforming it for storage in a structured Snowflake table.
  • Scalable ETL Process: Easily scalable for additional Citibike datasets or other data formats, supporting future data ingestion and transformations.

Table Structure

The pipeline loads data into a trips table, which includes columns for trip duration, start and end times, station details, bike IDs, and user demographic information. This structured format enables quick querying and supports further analytics.

Usage

  1. Configure Storage Integration: Ensure Snowflake has permissions to access the S3 bucket by configuring the STORAGE_AWS_ROLE_ARN with appropriate IAM roles.
  2. Run SQL Scripts: Execute the SQL scripts in the repository to create the integration, file format, stage, and table, then load the data from S3.
  3. Query the Data: Run queries on the trips table to analyze Citibike trip data.

About

This project sets up an ETL pipeline to load Citibike trip data from an AWS S3 bucket into Snowflake. It establishes a secure integration with S3, defines a CSV file format, stages the data, and loads it into a Snowflake table for analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published