This repo is dedicated to testing out data technologies as well as highlighting my proficiency at building various types of data pipelines.
To start, I will be exploring simpler use cases with a combination of technologies that I have varying amounts of experience with. This will allow me to learn nuances and functionality of certain data technologies I have less experience with (i.e. streaming data use cases) while also learning how to piece them together with other technologies I have more experience with (i.e. batch data processing).
I will leverage the power of the cloud to simulate "production" conditions for these pipelines as much as possible.
I plan on using the information gained to tackle more complex use cases (including domains I am generally interested in), which will be placed in separate repos.
All of these pipelines will be guided by simulated "business use cases" that might be posed to a data engineer by a product organization, team of analysts, etc.
Option for generating pseudo-real data (real data generated in a fake way): EventSim
This section will be updated as I build out each of the pipelines:
- Kafka Spark Streaming Pipeline with data from Coinbase API