You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately this column seems to be misbehaving and cannot be limited before collection time
Log output
(Apologies, I’m using Google Colab and it doesn’t seem to respect this env var)
Issue description
The column created by scan_parquet when passed the include_file_paths argument does not get limited by the values of n_rows or a chained .head() method call if it is the only collected column.
This shows up unexpectedly in more convoluted examples where you might transform the selected file path on its own like:
you take the first file in the directory listing because of the n_rows=1
you extract its file name from the URL with the regex
you select just that file name column
you collect
What you end up with is not a single row single column dataframe but a dataframe with the number of rows of the first file in that directory.
I’m also not super familiar with Google Colab’s resource limits but when I run .unique() on this single column (which I expect to be identical per file) on just 2 files it runs out of RAM and crashes even with low_memory_usage set to True, something seems off here.
Checks
Reproducible example
You’d expect such a redundant use of
n_rows=1
and.head(1)
to guarantee you.collect()
a DataFrame with just one row but you’d be wrong! :-)Unfortunately this column seems to be misbehaving and cannot be limited before collection time
Log output
Issue description
The column created by
scan_parquet
when passed theinclude_file_paths
argument does not get limited by the values ofn_rows
or a chained.head()
method call if it is the only collected column.This shows up unexpectedly in more convoluted examples where you might transform the selected file path on its own like:
What should happen in this case is that:
n_rows=1
What you end up with is not a single row single column dataframe but a dataframe with the number of rows of the first file in that directory.
I’m also not super familiar with Google Colab’s resource limits but when I run
.unique()
on this single column (which I expect to be identical per file) on just 2 files it runs out of RAM and crashes even withlow_memory_usage
set to True, something seems off here.Expected behavior
I would expect head(1) to return 1 row regardless of whether before/after lazyframe collection
Installed versions
The text was updated successfully, but these errors were encountered: