Data Engineering Challenges

Assignment Overview
- Scenario A: Input from Kafka, Output to Amazon S3
- Scenario B: Input from BigQuery (BQ), Output to Amazon S3
Part 1: Data Pipeline Implementation
- Data Description
- Optional Step: Monitoring and Alerting
Part 2: Infrastructure Deployment
Deliverables
How to Submit the Assignment
Additional Guidelines
Evaluation Criteria

Assignment Overview

Scenario A: Input from Kafka, Output to Amazon S3 [Selected]

Input Source: Apache Kafka
Output Destination: Amazon S3
Daily Data Volume: Approximately 400-500 GiB
Task: Ingest low-latency streaming data from Kafka, process it, and store the transformed data in Amazon S3.
Pipeline

Part 1: Data Pipeline Implementation (Done)

Data Description

Data Type: User action events.

Data Fields

timestamp: The time at which the event occurred.
event_name: The type of the event (e.g., click, view, purchase).
user_identifiers:
- platform: The platform used by the user (e.g., web, mobile).
- id: Unique identifier for the user.
product_information:
- name: Name of the product involved in the event.
- SKU: Stock Keeping Unit identifier of the product.

Sample Data

{
  "timestamp": "2023-10-23T12:34:56Z",
  "event_name": "view",
  "user_identifiers": {
    "platform": "web",
    "id": "user_12345"
  },
  "product_information": {
    "name": "Wireless Mouse",
    "SKU": "SKU-1001"
  }
}

{
  "timestamp": "2023-10-23T12:35:10Z",
  "event_name": "click",
  "user_identifiers": {
    "platform": "mobile",
    "id": "user_67890"
  },
  "product_information": {
    "name": "Bluetooth Headphones",
    "SKU": "SKU-2002"
  }
}

{
  "timestamp": "2023-10-23T12:36:05Z",
  "event_name": "purchase",
  "user_identifiers": {
    "platform": "web",
    "id": "user_12345"
  },
  "product_information": {
    "name": "Mechanical Keyboard",
    "SKU": "SKU-3003"
  }
}

Data Processing

Transformation Requirements:
- Parsing and Validation: Each record ensure that contain all fields and has data types as below
```
schema = {
  "event_datetime": Datetime,
  "event_date": Date,
  "event_name": String,
  "user_id": String,
  "platform": String,
  "product_name": String,
  "SKU": String,
}
```
- Data Enrichment (Optional): event_date added more after transforming
- Handling Missing or Null Values: Fields don't exist that is null/None
Output Format:
- Data is stored in Amazon S3 in a partitioned parquet format (partitioned event_date).

Example of Transformed Data

{
  "event_datetime": "2023-10-23T12:34:56Z",
  "event_date": "2023-10-23",
  "event_name": "view",
  "user_id": "user_12345",
  "platform": "web",
  "product_name": "Wireless Mouse",
  "SKU": "SKU-1001"
}

Optional Step: Monitoring and Alerting (Done)

Tools and Mechanisms:
- Monitoring:
  
  Prometheus
  
  Grafana
- Alerting:
  - Telegram notifications with success/error count

Part 2: Infrastructure Deployment (Done)

Tasks:

Airflow Setup Result on Kubernetes:

CI/CD Integration:

Additional Deployment

Kafka
Minio

Emulate for AWS S3 storage

Result after storing partitioned parquet

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Challenges

Table of Contents

Assignment Overview

Scenario A: Input from Kafka, Output to Amazon S3 [Selected]

Part 1: Data Pipeline Implementation (Done)

Data Description

Data Fields

Sample Data

Data Processing

Example of Transformed Data

Optional Step: Monitoring and Alerting (Done)

Part 2: Infrastructure Deployment (Done)

Additional Deployment

About

Releases

Packages

lutuantai95/data-engineering-challenges

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Challenges

Table of Contents

Assignment Overview

Scenario A: Input from Kafka, Output to Amazon S3 [Selected]

Part 1: Data Pipeline Implementation (Done)

Data Description

Data Fields

Sample Data

Data Processing

Example of Transformed Data

Optional Step: Monitoring and Alerting (Done)

Part 2: Infrastructure Deployment (Done)

Additional Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages