Integration Strategy

Description: this data warehouse was designed follow Inmon approach that integrated all of data into a single warehouse and it created several data marts associating sectors in government system

Data Source: Multi-databases from different systems in governmental sector
Medallion Architecture: Refining data across layers that has the goal of improving the structure and quality of data for better insights and analysis - bronze -> silver -> gold
Staging Area: Ensuring independence between source database and data warehouse when performing transformations and aggergrates

Data Pipline Automation

All of the step in this project was design to a data pipeline which can be automated to load raw data from source that then go in medallion procedure for ensuring the quality of information. Finally, it was passed into warehouse and data marts.

Scheduler: leveraging Apache Airflow to automate end-to-end integration process
Transformation: using Apache Spark engine which was Pyspark package in Python to process and aggregate information
Environment: this process was deployed on Docker containers including Database Server and Airflow

Docker setup

Dockerfile for Airflow and Spark

FROM apache/airflow:2.9.1-python3.11

USER root

# Install OpenJDK-17
RUN apt update && \
    apt-get install -y openjdk-17-jdk && \
    apt-get install -y ant && \
    apt-get clean;

# Set JAVA_HOME
ENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64/
RUN export JAVA_HOME

USER airflow

# Sync files from local to Docker image
COPY ./airflow/dags /opt/airflow/dags
COPY requirements.txt .

# Pyspark package
RUN pip install --no-cache-dir -r requirements.txt
RUN rm requirements.txt

DAGs of data warehouse integration

DAGs of Resident data mart integration

DAGs of Time and Location integration

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.vscode		.vscode
airflow		airflow
jars		jars
samples		samples
warehouse		warehouse
.gitignore		.gitignore
Dockerfile.Airflow		Dockerfile.Airflow
Dockerfile.Spark		Dockerfile.Spark
README.md		README.md
demo.ipynb		demo.ipynb
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Integration Strategy

Data Pipline Automation

Docker setup

About

Releases

Packages

Languages

Narius2030/MOLISA-Data-Warehouse-Integration

Folders and files

Latest commit

History

Repository files navigation

Integration Strategy

Data Pipline Automation

Docker setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages