- Assignment Overview
- Part 1: Data Pipeline Implementation
- Part 2: Infrastructure Deployment
- Deliverables
- How to Submit the Assignment
- Additional Guidelines
- Evaluation Criteria
- Input Source: Apache Kafka
- Output Destination: Amazon S3
- Daily Data Volume: Approximately 400-500 GiB
- Task: Ingest low-latency streaming data from Kafka, process it, and store the transformed data in Amazon S3.
- Pipeline
- Data Type: User action events.
- timestamp: The time at which the event occurred.
- event_name: The type of the event (e.g., click, view, purchase).
- user_identifiers:
- platform: The platform used by the user (e.g., web, mobile).
- id: Unique identifier for the user.
- product_information:
- name: Name of the product involved in the event.
- SKU: Stock Keeping Unit identifier of the product.
{
"timestamp": "2023-10-23T12:34:56Z",
"event_name": "view",
"user_identifiers": {
"platform": "web",
"id": "user_12345"
},
"product_information": {
"name": "Wireless Mouse",
"SKU": "SKU-1001"
}
}
{
"timestamp": "2023-10-23T12:35:10Z",
"event_name": "click",
"user_identifiers": {
"platform": "mobile",
"id": "user_67890"
},
"product_information": {
"name": "Bluetooth Headphones",
"SKU": "SKU-2002"
}
}
{
"timestamp": "2023-10-23T12:36:05Z",
"event_name": "purchase",
"user_identifiers": {
"platform": "web",
"id": "user_12345"
},
"product_information": {
"name": "Mechanical Keyboard",
"SKU": "SKU-3003"
}
}
- Transformation Requirements:
- Parsing and Validation: Each record ensure that contain all fields and has data types as below
schema = { "event_datetime": Datetime, "event_date": Date, "event_name": String, "user_id": String, "platform": String, "product_name": String, "SKU": String, }
- Data Enrichment (Optional): event_date added more after transforming
- Handling Missing or Null Values: Fields don't exist that is null/None
- Output Format:
- Data is stored in Amazon S3 in a partitioned parquet format (partitioned event_date).
{
"event_datetime": "2023-10-23T12:34:56Z",
"event_date": "2023-10-23",
"event_name": "view",
"user_id": "user_12345",
"platform": "web",
"product_name": "Wireless Mouse",
"SKU": "SKU-1001"
}
- Tools and Mechanisms: