Replies: 3 comments 6 replies
-
Hi @limx0 |
Beta Was this translation helpful? Give feedback.
-
I've opened a very WIP PR in openlineage (PR: OpenLineage/OpenLineage#293, Issue: OpenLineage/OpenLineage#81) to discuss a prefect integration. Would appreciate any thoughts anyone has! |
Beta Was this translation helpful? Give feedback.
-
I'm looking at openlineage and quite happy with the idea. However, I'm wondering about integrating this into the tool. Prefect is a very good neighbor to a lot of existing tools like Dask, Ray, and many more because the level integration is just enough to get it going but doesn't get in the way. The blocks are a great way of integrating more tools into it. Another strong point I think is that you write python, some declarative like the task config and flow config, some imperative, like the implementation of the tasks. I think this makes for a great developer experience as we can debug the code locally and there's little magic involved. For openlineage support I would love to keep these two main features:
The integrations that have been made and abandoned so far offer a nice looking integration but with quite a bit of magic. |
Beta Was this translation helpful? Give feedback.
-
I'd like to have a discussion (brain dump) about data lineage in prefect; how best to implement something simple, and whether this is on the core roadmap at all (or out of scope).
Background
Prefect is a fantastic tool, and for my personal use case is only missing a couple of features related to data lineage. What exactly do I mean by "data lineage"? I'm referring to storing metadata about task and flow runs that tracks the inputs (parameters/kwargs) and outputs (results) of task and flow runs to determine the exact chain of events (in prefects case; the DAG) that produced a specific output. Some examples of projects that currently implement this kind of thing are DVC or Pachyderm. Some details about the DVC implementation can be found here. I like the use of the term
lock
to refer to an immutable instance that many projects and package managers use (DVC, poetry, npm etc). I'll use the term below in my own way (hopefully it makes sense).So why not just use these tools? Basically I think both tools require much more effort on top my own code than using just prefect (happy to elaborate more here), and I think I could incorporate this into prefect reasonably easily.
Motivation
There are many reasons for wanting data lineage (reproducibility, compliance reasons etc) but my own specific reasons are more related to data science workflows; wanting to knowing I have the "most up to date / correct" data and reducing computation time. Typically these tasks produce data that I would like to cache, and the final output from the flows is either a dataset I will use elsewhere, or a model (but I believe this has wider applicability)
A few specific (very simplified) examples:
Graph Optimisation
Currently prefect checks for cached tasks via the
target
functionality, loading values as it walks the DAG. I believe we could do some simple graph optimisations if we implemented some data lineage.For example, given a DAG:
Say we run this DAG with some input parameter
a=1
, assuming each stage is cached, and we produce a "lock" file. If we rerun this flow, we should be able to substitute any parameters we know at the start of the graph build (a=1) and walk the lock file, realising that we can totally skip/dropA
from the graph;B
is cached and all hashes will match up to this point, so no need to even load the data forA
. This may not matter in this trivial example, but we may be able to walk the majority of a huge graph using this methodology, saving lots of IO/compute.Better/faster Experimentation (branching)
Another thing I would love to be able to use confidently is branching my data like I branch my code; to be able to determine in my flow that the source code on a particular task has changed, and to run the flow and compute new data (persisting to a Result under a new hash). I think this would require munging Result locations to include hashes of the source code.
This related to graph optimisation above because the "faster" above refers to being able to fire off a flow run and only re-compute those tasks for which I have made changes in my branch.
Implementation
I think most of the above could be implemented by subclassing
FlowRunner
andTaskRunner
, but there is a question of whether this should live in prefect itself, or if it's a better fit in a separate package/implementation due to its specificity. It will need access to some sort of storage for the lock files. This could be a whole host of things (local/network store, KV store, redis etc) - TBD.I've started a WIP implementation here that includes some basic functionality and a couple of tests that I will be playing with the next few days. It's a very quick and dirty and doesn't support all of the above, but should give a little bit of an idea of what I'm thinking of.
I'd love to hear if anyone else has similar issues, ideas for implementations or any other thoughts on the above.
Beta Was this translation helpful? Give feedback.
All reactions