Reuse metadata from deltalake when reading parquet #22

j-bennet · 2023-06-01T20:53:03Z

In dask-deltatable, when calling dd.read_parquet, perhaps we can reuse the metadata already preserved in delta json, instead of collecting it from parquet files all over again.

Here:

dask-deltatable/dask_deltatable/core.py

Line 196 in cd731a9

df = dd.read_parquet(dt.file_uris(), **kwargs)

It looks like dd.read_parquet will have to go through the parquet files to read the metadata, but the DeltaTable should have all that info already.

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2023-06-01T22:02:02Z

Good catch @j-bennet. I think adding dataset={"schema": dt.schema().to_pyarrow()} as a keyword to this read_parquet call

dask-deltatable/dask_deltatable/core.py

Line 196 in cd731a9

df = dd.read_parquet(dt.file_uris(), **kwargs)

should do the trick. Though it'd be nice if someone could confirm this is the case.

j-bennet · 2023-06-09T19:26:24Z

I think adding dataset={"schema": dt.schema().to_pyarrow()} as a keyword to this read_parquet call
should do the trick. Though it'd be nice if someone could confirm this is the case.

I think Delta Log also contains columns stats, so maybe we can avoid gathering those.

jrbourbeau changed the title ~~Reese metadata from deltalake when reading parquet~~ Reuse metadata from deltalake when reading parquet Jun 1, 2023

jacobtomlinson added the enhancement New feature or request label Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse metadata from deltalake when reading parquet #22

Reuse metadata from deltalake when reading parquet #22

j-bennet commented Jun 1, 2023 •

edited

Loading

jrbourbeau commented Jun 1, 2023

j-bennet commented Jun 9, 2023

Reuse metadata from deltalake when reading parquet #22

Reuse metadata from deltalake when reading parquet #22

Comments

j-bennet commented Jun 1, 2023 • edited Loading

jrbourbeau commented Jun 1, 2023

j-bennet commented Jun 9, 2023

j-bennet commented Jun 1, 2023 •

edited

Loading