Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

Open
manujosephv opened this issue Feb 14, 2024 · 4 comments

Comments

@manujosephv
Copy link
Owner

Is your feature request related to a problem? Please describe.
When the data size is quite large, many times we might need to use larger than RAM data. Also, using an engine like Polars will speed things up a lot.

Describe the solution you'd like
Re-write Datamodule to be more performant. Out of core processing like SparkDataframe or Polars combined with NVTabular might be a good solution.

Describe alternatives you've considered
Currently its impossible to load larger than memory datasets

@saankhya-mondal
Copy link

Thank you for creating the issue. Hoping for quick resolution and addition of support for spark dataframe

@huylenguyen
Copy link

Hi @manujosephv! I am currently working on a replacement of TabularDataModule for my own use case loading larger than memory datasets from outside sources, and I have a question related to this current issue.

What exactly is the use of the cache_data functionality? The other parameters here are well documented, but the use of the cache is a bit unclear to me. Is it to avoid performing the data transformations repeatedly during learning? If so, are there any benchmark results comparing the the performance drawbacks of performing each data transformation for each batch during learning?

@manujosephv
Copy link
Owner Author

That's awesome. I hope you can contribute it back in here when you have it working...

And cache_data is a parameter I very recently added. By default the datamodule holds on to the raw data as attributes. And while saving the model, we also save the datamodule. For very large datasets, that poses a problem.

cache_data was added to enable the used to choose where to save the data (in memory, on disk, or not at all).

In your case, I think we can ignore that param and functionality because if the dataset is considered out of memory, then this whole functionality isn't needed anymore.

@huylenguyen
Copy link

huylenguyen commented Apr 25, 2024

I will get back to you if I figure it out :)

There's a few tricky parts like the transforms which require access to the data. I haven't looked at all the available transforms yet, but a naive option for very large datasets is to sample from the external data source for an approximation dataset that is used to fit the transforms, then use the .transform() when the data stream is acquired by DataLoader during training. Depending on the sampling the approximation dataset may not be representative, but this is up to the user to decide.

There are more comprehensive options, such as letting the user provide already fitted data transformation objects, however I am not sure how well this fits with the philosophy of least friction in this project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants