Output npy of hdf5 file using the processor #475

ico1036 · 2021-03-26T03:24:19Z

ico1036
Mar 26, 2021

Hi, I'm using Coffea in my physics analysis.

I'm very curious about how to write npy or hdf5 files in a processor.
I'm using run_uproot_job with multi workers.

I understand that the accumulator only can stack histograms and write histo output.
Is there any method to output "arrays" with npy, hdf5, or other types?

Thanks

Answered by kondratyevd

Mar 26, 2021

Hi @ico1036,
if you can convert the outputs of your processor to Pandas DataFrames, then you should be able to use Dask executor with argument use_dataframes=True.

The output will be a distributed Dask dataframe.
If you want to continue working with it, or print out as a single dataframe, you will also need to call output.compute() after you retrieve the outputs from run_uproot_job. Otherwise, you can directly save chunks of the output dataframe as Parquet files using dd.to_parquet(df=output).

Please let me know if you run into any issues, I will be happy to help.

View full answer

nsmith- · 2021-03-26T18:16:12Z

nsmith-
Mar 26, 2021
Maintainer

@kondratyevd has contributed a feature in #368 to optionally output a dask dataframe from run_uproot_job, after which one could write the dask dataframe to a file or set of files in a variety of formats e.g. to_hdf

0 replies

kondratyevd · 2021-03-26T18:32:07Z

kondratyevd
Mar 26, 2021

Hi @ico1036,
if you can convert the outputs of your processor to Pandas DataFrames, then you should be able to use Dask executor with argument use_dataframes=True.

The output will be a distributed Dask dataframe.
If you want to continue working with it, or print out as a single dataframe, you will also need to call output.compute() after you retrieve the outputs from run_uproot_job. Otherwise, you can directly save chunks of the output dataframe as Parquet files using dd.to_parquet(df=output).

Please let me know if you run into any issues, I will be happy to help.

1 reply

kondratyevd Mar 28, 2021

@ico1036 I just remembered - I've added a test example showing how to use this setup, see here: https://github.com/CoffeaTeam/coffea/blob/master/tests/test_dask_pandas.py

ico1036 · 2021-03-30T07:10:58Z

ico1036
Mar 30, 2021
Author

@nsmith @kondratyevd
Thank you very much for helps!
I will test it and let you know if there is another issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output npy of hdf5 file using the processor #475

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Output npy of hdf5 file using the processor #475

ico1036 Mar 26, 2021

Replies: 3 comments · 1 reply

nsmith- Mar 26, 2021 Maintainer

kondratyevd Mar 26, 2021

kondratyevd Mar 28, 2021

ico1036 Mar 30, 2021 Author

ico1036
Mar 26, 2021

Replies: 3 comments 1 reply

nsmith-
Mar 26, 2021
Maintainer

kondratyevd
Mar 26, 2021

ico1036
Mar 30, 2021
Author