Data lineage (simple) #4935

limx0 · 2021-09-03T01:40:44Z

limx0
Sep 3, 2021

I'd like to have a discussion (brain dump) about data lineage in prefect; how best to implement something simple, and whether this is on the core roadmap at all (or out of scope).

Background

Prefect is a fantastic tool, and for my personal use case is only missing a couple of features related to data lineage. What exactly do I mean by "data lineage"? I'm referring to storing metadata about task and flow runs that tracks the inputs (parameters/kwargs) and outputs (results) of task and flow runs to determine the exact chain of events (in prefects case; the DAG) that produced a specific output. Some examples of projects that currently implement this kind of thing are DVC or Pachyderm. Some details about the DVC implementation can be found here. I like the use of the term lock to refer to an immutable instance that many projects and package managers use (DVC, poetry, npm etc). I'll use the term below in my own way (hopefully it makes sense).

So why not just use these tools? Basically I think both tools require much more effort on top my own code than using just prefect (happy to elaborate more here), and I think I could incorporate this into prefect reasonably easily.

Motivation

There are many reasons for wanting data lineage (reproducibility, compliance reasons etc) but my own specific reasons are more related to data science workflows; wanting to knowing I have the "most up to date / correct" data and reducing computation time. Typically these tasks produce data that I would like to cache, and the final output from the flows is either a dataset I will use elsewhere, or a model (but I believe this has wider applicability)

A few specific (very simplified) examples:

Graph Optimisation

Currently prefect checks for cached tasks via the target functionality, loading values as it walks the DAG. I believe we could do some simple graph optimisations if we implemented some data lineage.

For example, given a DAG:

A(a) -> B(b) -> C(c)

Say we run this DAG with some input parameter a=1, assuming each stage is cached, and we produce a "lock" file. If we rerun this flow, we should be able to substitute any parameters we know at the start of the graph build (a=1) and walk the lock file, realising that we can totally skip/drop A from the graph; B is cached and all hashes will match up to this point, so no need to even load the data for A. This may not matter in this trivial example, but we may be able to walk the majority of a huge graph using this methodology, saving lots of IO/compute.

Better/faster Experimentation (branching)

Another thing I would love to be able to use confidently is branching my data like I branch my code; to be able to determine in my flow that the source code on a particular task has changed, and to run the flow and compute new data (persisting to a Result under a new hash). I think this would require munging Result locations to include hashes of the source code.

This related to graph optimisation above because the "faster" above refers to being able to fire off a flow run and only re-compute those tasks for which I have made changes in my branch.

Implementation

I think most of the above could be implemented by subclassing FlowRunner and TaskRunner, but there is a question of whether this should live in prefect itself, or if it's a better fit in a separate package/implementation due to its specificity. It will need access to some sort of storage for the lock files. This could be a whole host of things (local/network store, KV store, redis etc) - TBD.

I've started a WIP implementation here that includes some basic functionality and a couple of tests that I will be playing with the next few days. It's a very quick and dirty and doesn't support all of the above, but should give a little bit of an idea of what I'm thinking of.

I'd love to hear if anyone else has similar issues, ideas for implementations or any other thoughts on the above.

shearer12345 · 2021-09-03T08:58:20Z

shearer12345
Sep 3, 2021

Hi @limx0
Have you come across https://openlineage.io/? I came across it only last week.
Designed in the same vein as opentelemetry and already being integrated into airflow.
Wrt data versioning and have a datalake, you might be interested in https://projectnessie.org/ along with Apache iceberg or https://delta.io/

5 replies

jacobdanovitch Sep 5, 2021

Hi @limx0
Have you come across https://openlineage.io/? I came across it only last week.
Designed in the same vein as opentelemetry and already being integrated into airflow.
Wrt data versioning and have a datalake, you might be interested in https://projectnessie.org/ along with Apache iceberg or https://delta.io/

The hard part about using Delta/Iceberg/Hudi is that they're all dependent on Spark, which means we can't really use it from within Prefect aside from launching a shell task that runs spark-submit. This proposal would be a really nice start as an alternative.

limx0 Sep 7, 2021
Author

Hey @shearer12345 - I hadn't seen openlineage, nice find! I'm going to have a play with this and see what I think. It does mean deploying another component (but realistically that likely needs to happen under my proposal in a production setting anyway - the question is whether people would benefit from being able to just hack around with a local cache/lock files or using the builtin KV store)

I have seen nessie and delta.io, and +1 to @jacobdanovitch they're very spark and table-centric from what I can see - where do I dump my sklearn models?

What I really like about prefect is it doesn't get in your way and that it's "just python"; making it quite hackable. I can imagine extending my proposal in the future to support locking/hashing things that aren't necessarily data (example: my workflow depends in this docker image for whatever reason -> I'm going to define a lock method linked to the docker SHA so I can trigger a recalc if the image gets updated) - I wonder if we will have the flexibility to do this in something like openlineage.io?

Appreciate the comments (and I'm really just thinking aloud here - so definitely keen to keep hearing other peoples thoughts)

jacobdanovitch Sep 14, 2021

@limx0 Another tool that does this in a pretty general way is Azure ML pipelines. Each pipeline is a sequence of steps with inputs and outputs, and each step only re-runs when either the code or the input data changes. That can be a typical ETL kind of job (have a step that joins two tables, and only re-run it when the code changes or one of the input tables changes), but it can also be closer to what you're saying (have a step that takes an sklearn model and a dataset and does batch inference, and only re-runs it when the model, dataset, or inference code change).

Also, Ray Workflows look like they're headed in this direction, but it's still very new.

limx0 Sep 15, 2021
Author

Hey @jacobdanovitch - Ray workflows definitely does look interesting! I'm less keen on any specific cloud provider stuff due to it generally being quite rigid in their implementations and not hackable to my own needs.

After doing some more thinking & investigation (playing with DVC some more) I keep coming back to wanting more control that can be achieved from existing projects (from what I see).

@shearer12345 I've been looking at openlineage.io in a little more detail and it looks like a really impressive project. I really like the flexibility of Facets the project in general. I think I'm going to have a go at implementing a prefect integration in my https://github.com/limx0/caching_flow_runner repo

davzucky Sep 15, 2021

We explore the option of using OpenLineage on our side. This look to be a good open standard and it support already Apache Spark that we are using.

The way I imagine that at the moment is the following:

Flow.lineage : On the flow they are a new property that allow to set the Lineage underline we want. We could have json as a default and a OpenLineage implementation. This property should be added as well on the flow context so you could retrieve it from your custom task like you do with your log today
Task extension : Task should be extended with a new property Lineage (or Tracking) on which we can set some of the following input
- TaskActionType: Source, Destination, Transform
- Location: Define the action, could be an SQL Query, our a table name
- Action: Explain the task transform (This is your business logic)

They are an implementation of OpenLineage for airflow that could give us an idea about what we would like to extract. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow.

The current limitation with Prefect is that the extension point are limited, and I expect this is something we will need to be implemented on the Prefect Core

limx0 · 2021-09-17T00:42:03Z

limx0
Sep 17, 2021
Author

I've opened a very WIP PR in openlineage (PR: OpenLineage/OpenLineage#293, Issue: OpenLineage/OpenLineage#81) to discuss a prefect integration. Would appreciate any thoughts anyone has!

0 replies

wmeints · 2023-01-02T06:13:13Z

wmeints
Jan 2, 2023

I'm looking at openlineage and quite happy with the idea. However, I'm wondering about integrating this into the tool.

Prefect is a very good neighbor to a lot of existing tools like Dask, Ray, and many more because the level integration is just enough to get it going but doesn't get in the way. The blocks are a great way of integrating more tools into it.

Another strong point I think is that you write python, some declarative like the task config and flow config, some imperative, like the implementation of the tasks. I think this makes for a great developer experience as we can debug the code locally and there's little magic involved.

For openlineage support I would love to keep these two main features:

It needs to integrate just enough so it doesn't get in the way, but I can log the required data.
It needs to work on my machine without additional setup, so I can run unit tests and integration tests.

The integrations that have been made and abandoned so far offer a nice looking integration but with quite a bit of magic.
I personally prefer a method where I can spin up an openlineage client and then explicitly call the client to log the data I want.
I'm aware this means that we will have more boilerplate code, but it also makes integration a lot easier. And, I think it leaves
the door open to other types of lineage data logging.

1 reply

wmeints Jan 2, 2023

Apparently, there is a python client for openlineage: https://pypi.org/project/openlineage-python/ so that's good news (for me at least).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data lineage (simple) #4935

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data lineage (simple) #4935

limx0 Sep 3, 2021

Background

Motivation

Graph Optimisation

Better/faster Experimentation (branching)

Implementation

Replies: 3 comments · 6 replies

shearer12345 Sep 3, 2021

jacobdanovitch Sep 5, 2021

limx0 Sep 7, 2021 Author

jacobdanovitch Sep 14, 2021

limx0 Sep 15, 2021 Author

davzucky Sep 15, 2021

limx0 Sep 17, 2021 Author

wmeints Jan 2, 2023

wmeints Jan 2, 2023

limx0
Sep 3, 2021

Replies: 3 comments 6 replies

shearer12345
Sep 3, 2021

limx0 Sep 7, 2021
Author

limx0 Sep 15, 2021
Author

limx0
Sep 17, 2021
Author

wmeints
Jan 2, 2023