-
-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402
Comments
Thank you for creating the issue. Hoping for quick resolution and addition of support for spark dataframe |
Hi @manujosephv! I am currently working on a replacement of What exactly is the use of the |
That's awesome. I hope you can contribute it back in here when you have it working... And cache_data is a parameter I very recently added. By default the datamodule holds on to the raw data as attributes. And while saving the model, we also save the datamodule. For very large datasets, that poses a problem. cache_data was added to enable the used to choose where to save the data (in memory, on disk, or not at all). In your case, I think we can ignore that param and functionality because if the dataset is considered out of memory, then this whole functionality isn't needed anymore. |
I will get back to you if I figure it out :) There's a few tricky parts like the transforms which require access to the data. I haven't looked at all the available transforms yet, but a naive option for very large datasets is to sample from the external data source for an approximation dataset that is used to fit the transforms, then use the There are more comprehensive options, such as letting the user provide already fitted data transformation objects, however I am not sure how well this fits with the philosophy of least friction in this project |
Is your feature request related to a problem? Please describe.
When the data size is quite large, many times we might need to use larger than RAM data. Also, using an engine like Polars will speed things up a lot.
Describe the solution you'd like
Re-write Datamodule to be more performant. Out of core processing like SparkDataframe or Polars combined with NVTabular might be a good solution.
Describe alternatives you've considered
Currently its impossible to load larger than memory datasets
The text was updated successfully, but these errors were encountered: