Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

manujosephv · 2024-02-14T00:47:05Z

Is your feature request related to a problem? Please describe.
When the data size is quite large, many times we might need to use larger than RAM data. Also, using an engine like Polars will speed things up a lot.

Describe the solution you'd like
Re-write Datamodule to be more performant. Out of core processing like SparkDataframe or Polars combined with NVTabular might be a good solution.

Describe alternatives you've considered
Currently its impossible to load larger than memory datasets

saankhya-mondal · 2024-02-14T06:47:51Z

Thank you for creating the issue. Hoping for quick resolution and addition of support for spark dataframe

huylenguyen · 2024-04-23T19:47:01Z

Hi @manujosephv! I am currently working on a replacement of TabularDataModule for my own use case loading larger than memory datasets from outside sources, and I have a question related to this current issue.

What exactly is the use of the cache_data functionality? The other parameters here are well documented, but the use of the cache is a bit unclear to me. Is it to avoid performing the data transformations repeatedly during learning? If so, are there any benchmark results comparing the the performance drawbacks of performing each data transformation for each batch during learning?

manujosephv · 2024-04-24T01:23:18Z

That's awesome. I hope you can contribute it back in here when you have it working...

And cache_data is a parameter I very recently added. By default the datamodule holds on to the raw data as attributes. And while saving the model, we also save the datamodule. For very large datasets, that poses a problem.

cache_data was added to enable the used to choose where to save the data (in memory, on disk, or not at all).

In your case, I think we can ignore that param and functionality because if the dataset is considered out of memory, then this whole functionality isn't needed anymore.

huylenguyen · 2024-04-25T10:32:43Z

I will get back to you if I figure it out :)

There's a few tricky parts like the transforms which require access to the data. I haven't looked at all the available transforms yet, but a naive option for very large datasets is to sample from the external data source for an approximation dataset that is used to fit the transforms, then use the .transform() when the data stream is acquired by DataLoader during training. Depending on the sampling the approximation dataset may not be representative, but this is up to the user to decide.

There are more comprehensive options, such as letting the user provide already fitted data transformation objects, however I am not sure how well this fits with the philosophy of least friction in this project

manujosephv mentioned this issue Feb 14, 2024

Request for Spark dataframe support #399

Closed

manujosephv added the Open for Contribution label Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

manujosephv commented Feb 14, 2024

saankhya-mondal commented Feb 14, 2024

huylenguyen commented Apr 23, 2024

manujosephv commented Apr 24, 2024

huylenguyen commented Apr 25, 2024 •

edited

Loading

Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

Comments

manujosephv commented Feb 14, 2024

saankhya-mondal commented Feb 14, 2024

huylenguyen commented Apr 23, 2024

manujosephv commented Apr 24, 2024

huylenguyen commented Apr 25, 2024 • edited Loading

huylenguyen commented Apr 25, 2024 •

edited

Loading