-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase read speed by x20-100 for most data #78
Comments
Thanks @lvarriano for raising this. I was indeed suspecting it! @oschulz: this is surely something to discuss together |
I don't think writing all columns into a flat array is a good solution. For once, columns may contain different data types. Also, column entries do not always have the same size, e.g. compressed waveforms, SiPM hit lists, etc. So each row can have a different total data size. Of course this could be handles by using a vector of vectors (backed by a single flat vector) to store the rows. This would however make efficient column access impossible. Even in the "flat matrix" approach, accessing just a few columns will we slow, since it would result into highly scattered reads. Also, this approach basically leads to writing a custom binary on-disk format. That's not a task we should take on, there's already lot's of approaches (Parquet, Arrow, the new ROOT RNTuple, etc.). Columnar storage is the right storage model for modern computing architectures, so the entries of one columns should be as contiguous on disk and in memory as possible. A single flat-array approach can't provide this. HDF5 may not be the pinnacle of performance anymore (one hears good stuff about RNTuple, for example), but it's definitely not slow, supports memory mapping (when not using chunks) and it's very widely supported. What is not a great choice is writing per-channel datasets, though. A single dataset for each columns, spanning all channels (at least of a given subsystem) would reduce the number of datasets in our HDF5 files a lot and increase the size of each dataset. It would be much more friendly to the HDF5 caching model as well. I've suggested this from the beginning, and repeatedly, but per-channel datasets were considered more user-friendly. |
Yes, in the solution Sam and I are working on, we store the data type of each column so that it can be converted back to the correct data type after reading from disk. From my perspective, it is really not a custom object - hdf5 supports natively (it seems) writing an array that contains different dtypes in each column and Each row in a We have seen since #29 that reading particular rows from Certainly, we could instead make an an array that has all channels in it instead for a single column. I don't see a difference between making that (a single variable, all channel) array and a (single channel, all variable) array. These have all the same issues that you raise, other than that the dtype for the data is a single type and is correct on disk. In this (single variable, all channel) array, the sub-array containing the variables for a single channel would not be contiguous. We could chunk to either take all the events for a single channel or all the channels for a single event to change but there would be a significant penalty to reading event-level data in that chunking. Given that our experiment adds channels as time moves forward, it does not seem like having files with datasets whose arrays have different numbers of columns would be easy to work with. |
Yes, HDF5 has a table format, but to my knowledge it's row-based storage, on disk. And I don't think it'll allow for columns that contain vectors, and so on. And I very much doubt that it can handle "ragged-arrays" (let alone nested ones, like we're starting to use now).
Yes, that's due to the way HDF5 caches datasets. I wonder if this is solvable using memory-mapped reads (does h5py support this?) and not using chunking. In general, we probably will want to write most files (except huge files with waveforms) unchunked.
Well, they have a completely different on-disk storage layout, for once. And in the "single channel, all variable" version, if stored flat instead of like we do it now, access to a subset of the columns would be very inefficient.
Ah - no, the channels do not become columns, of course. Instead, there would be a single additional column for the channel ID, and the current per-channel tables would be concatenated vertically. Interleaving them (first a bunch of channel 1 events, then a bunch of channel 2 events and so on, and then again a bunch of channel 1 events) would actually be best, because it would allow doing both per-channel and across-channel waveform-level operations in an efficient manner. |
Chunking is required for data compression (the whole column can be one chunk, but then you have to read in the whole thing before decompressing). It's true that we don't really need compression for anything but the waveforms, so maybe the rest doesn't need to be chunked. I think that maybe this is what h5py implements as what you are calling memory-mapped reads? https://docs.h5py.org/en/stable/high/dataset.html#multi-block-selection (I'm not familiar myself.) So I guess if we had a slightly more complicated way to store data, then this could be useful.
Given that storing them as such results in a speed up of x100 compared to what we do now, I would not call this "very inefficient" compared to our current strategy. It is very fast to access specific columns in memory compared to on disk, so I would expect that these have comparable times if you try to access 2-3 fields currently.
It sounds to me we would need to implement the correct striding, which would then depend on the number of channels. If some channel is not present for one/a few files or new channels are added, then the striding would change, which sounds problematic to deal with. We do almost exclusively single channel analysis until we finally combine channels into events so I'm not sure there would be any speed up from combining channels into one column. Maybe I misunderstand, though. Perhaps we can make some sketches to show any proposed data layouts - I'd like to make sure I understand and it is easy to parse the wrong thing from text. |
It is for HDF5 compression, but not for user compression. I hope that longer-term, we'll get away from the waveform compression scheme we use now and move to a custom entropy coding that's optimized for our use case.
No, memory mapped reads is this: https://juliaio.github.io/HDF5.jl/stable/#Memory-mapping I would hope that h5py supports this too, in some fashion?
I would be very surprise if we could achive a factor 100 speed increase, for compressed data, compared to what we do now. We also need to consider the maximum I/O speed. I assume you're currently benchmarking on a local SSD? In actual data production we'll be limited by the network I/O of the compute nodes, which has to be divided by the number of parallel I/O operations we want to run.
We would store channel information for each row, so it would be fully dynamic. I don't see us moving to such a scheme anytime soon though. |
If, in the end, we discover that HDF5 performance is limiting our analysis throughput substantially, and if this can't be fixed by using techniques like memory mapping, then we should move to another well-established and broadly supported file format, based on columnar storage. We should not roll our own low-level table storage format. |
So I don't think it's HDF5 that's slow, I think it's python that's slow. I added some cProfile lines to Louis's code: ...
# read normal (compressed) LH5 file
start = time.time()
with cProfile.Profile() as pr:
for ch in geds_list:
chobj, _ = store.read(f'{ch}/hit', input_file)
stats = pstats.Stats(pr)
stats.sort_stats('tottime').print_stats(10)
print('read normal (compressed) LH5 file ', time.time() - start)
# read normal uncompressed LH5 file
start = time.time()
with cProfile.Profile() as pr:
for ch in geds_list:
chobj, _ = store.read(f'{ch}/hit', input_file_uncompressed)
stats = pstats.Stats(pr)
stats.sort_stats('tottime').print_stats(10)
print('read normal uncompressed LH5 file ', time.time() - start)
# read uncompressed file organized by type
start = time.time()
with cProfile.Profile() as pr:
with h5py.File(output_file_bytype, mode='r') as f:
for ch in geds_list:
data = f[ch]['scalarfloat64'][:]
data = f[ch]['scalarint32'][:]
stats = pstats.Stats(pr)
stats.sort_stats('tottime').print_stats(10)
print('read uncompressed file organized by type ', time.time() - start)
print(np.shape(data))
# read compressed file organized by type
start = time.time()
with cProfile.Profile() as pr:
with h5py.File(output_file_bytype_compressed, mode='r') as f:
for ch in geds_list:
data = f[ch]['scalarfloat64'][:]
data = f[ch]['scalarint32'][:]
stats = pstats.Stats(pr)
stats.sort_stats('tottime').print_stats(10)
print('read compressed file organized by type ', time.time() - start) Here's the output for the "normal" compressed file.
Here's the output for the compressed file organized by type:
Notice that the |
So we just need to avoid making small files! I think we should seriously reconsider our hit data storing scheme and have all channels in a single table, at some point. |
At the event level (I think that's what you call hit level?), I do use a single table in Julia, that's worked out quite Ok so far. |
We might be able to make small files at least a bit faster...Looking at the profile, it's dominated by python dict lookups in the HDF5 groups. We might get a speedup by making the read less recursive/nested (looking up |
I investigated some of the suggestions for memory-mapping (using Python's
Reading the HDF5 attributes (data type, shape, attributes, etc.) seems slow, as @iguinn and Luke (https://indico.legend-exp.org/event/1242/#9-chunking-block-reads-and-fil) pointed out. @gipert noticed this also in #71 where some of the most time-consuming calls are to the attribute reading. I timed The memory mapping reading is much faster but comes with some limitations, namely that the data must be contiguous and thus we can't append rows to it. I was encouraged to see an Per a suggestion from @SamuelBorden, it might be possible to shave off some time on the I really have no idea why it takes so long to create LGDO objects out of
|
I tried writing a very simple version of the test where it takes a predefined LGDO structure and reads the data in directly, with fewer nested lookups. Here's that code: def get_obj_paths(lgdo, base):
"""Crawl through the lh5 tree and find the objects we need to read in"""
try:
buf = lgdo.nda
return {base: buf}
except:
vals = {}
for key in lgdo:
vals.update(get_obj_paths(lgdo[key], f"{base}/{key}"))
return vals
#print(get_obj_paths(chobj, f'{ch}/hit'))
start = time.time()
ct = 0
# read normal (compressed) LH5 file to pre-defined obj buffer
with cProfile.Profile() as pr:
for ch in geds_list:
lgdo2read = get_obj_paths(chobj, f'{ch}/hit')
h5f = store.gimme_file(input_file)
try:
for key, buf in lgdo2read.items():
buf.resize(len(h5f[key]))
h5f[key].read_direct(buf)
ct += 1
except:
pass
stats = pstats.Stats(pr)
stats.sort_stats('tottime').print_stats(10)
print(f'read compressed file into pre-defined buffer ({ct} tables read) ', time.time() - start) The Here's the profile:
So now, the greatest single time contributer is If we did want to go this route, I think we'd have to separate out the numpy array I/O from the attrs and the construction of the LGDO array. You could imagine splitting |
What is the total time spent on these nested lookups in a realistic use case, per file? Is it actually limiting analysis performance, esp. if files are read over the network, instead from SSD? |
I tried Louis's script on NERSC, and the difference is more pronounced. It takes 5-6 seconds to read things group-by-group and 0.1/0.4 s to read things as a structured table. I also tried adding my modification and it took 1.8 s, so the distance between this and the structured table increased by quite a bit. Finally, I also tried moving the files into NERSC has some pages with advice:
|
I tried following some of NERSC's advice, and it made a pretty big improvement. Here's what I found:
From CFS:
So using the low level API made a big improvement (by a factor of ~7-8). When I print out the profiling information, it looks like the CPU time is very low with the low level API and almost all of the time is spent on file I/O. That said, there's still another factor of ~3 (more if we ignore decompression) to be gained by having larger datasets (either by merging columns as Louis did, or by having all channels in a single group as Oliver and Luigi have suggested). The "compact file" uses an HDF5 setting that is designed for small datasets, where you write the data along with the dataset header, and it makes a factor of 2 (which is about what NERSC suggests). However, this does not allow chunking, which is required for the built-in gzip compression, so we'd have to think about whether or not that trade of is worth it (which is a moot question if we move to larger datasets). The low level functions are basically directly calling the C-library for HDF5 (with a few added convenience functions and dunder methods). The low level functions also make all of the optimizations and tweaks possible with HDF5 available (which may or may not be the case with the more pythonic part of h5py). The documentation is here: https://api.h5py.org/. Here's what reading looked like:
And here's what writing looked like (for copying the data into the compact file):
TL;DR: By switching over to the low level API within h5py, we can greatly speed up our code by a factor of almost 10. A further speedup of 2x is available from using "compact" writing of small groups. A speedup of 3-4 is available from writing larger groups, either by saving tables as a single dataset or by saving all channels in a single dataset. |
I would suggest that we unchunk (via |
I think the waveforms are the only things that are worth chunking/compressing. I just tested h5repack using the same file Louis was using for testing (raw was 1.6 G, dsp was 113M, hit was 35M, all others <1M). After repacking to compact, it ended up at 112M, even though there was no gzip compression; after re-gzipping with h5repack, it actually ended up bigger (I didn't bother trying to optimize the chunking/compression parameters). The pht tier was 35M compact and 33M after re-compressing. So compression is gaining us essentially nothing with these files (which doesn't surprise me too much). My guess is the boolean arrays are the only things that are seeing a significant benefit from compression, since these are stored using 8-bits; we could easily write a compression routine to store 8-bools per char instead of 1. If we switched to the low-level API for h5py and compacted most of our data, this would be ~15x faster than our current read speeds; Louis's tests show that there's more improvement possible, but it's not nearly as big as a difference as comparing to our current scheme. For the waveforms, chunking/compressing will make a real difference; I'll be interested in seeing how your algorithm performs. Some neutron folks here at ORNL also developed an algorithm that might make an improvement for our files as well. Edit: Actually, if we are not compressing, the better comparison isn't Louis's 20x speedup with the compressed file, but the 100x speedup with the uncompressed one. It's worth noting that this test doesn't include splitting up the structure into the dict-like table, which may be noticeable, but I'd guess it would still be quite a bit faster than the compressed read. |
Wont' be anything fancy, just a simple fixed LPC followed by an ANS entropy coder. But since the diff's (resp. LPC residuals) of our waveforms follow very predictable distributions I think we can get near-optimal performance. I think it can also be quite fast. Just the ANS implementation is a bit tricky, I want it very simple, to make it easy to port accross languages, without getting too mem-hungry. |
@iguinn as far as I understand switching to the HDF5 low-level API has no downside and will speed up things significantly. Should we proceed with the implementation? |
@iguinn Can you post your full script or check if I've implemented what you wrote correctly? My understanding from your recent comment where you used NERSC's advice is that this only tested the read, not getting atttributes or creating the LGDO objects (with the idea being that you read into the predefined buffer of an existing LGDO object, I guess). I added the code you posted to my script in an attempt to check this, but maybe I am misunderstanding how to do it correctly. I see only a ~30% increase in read speed using the low-level HDF5 read versus h5py read. My full script is here (readtest.txt) but I removed some outputs related to memory mapped reads. For a standard (compressed) LH5 pht file, the read speed with h5py is 0.9 seconds (you show in an earlier comment that compressed and uncompressed speeds are about the same). With the low-level HDF5 read, it takes 0.6 seconds. This is comparing reads only, not collecting attributes or making LGDO objects. In addition, the h5py reads I am doing are creating new numpy arrays not reading directly into a buffer, so there could be a little improvement there anyway. (Also, maybe you know already why the low level HDF5 call can't read 6 channels.)
|
I think one important caveat is that I haven't done any testing of the low-level API on attrs, so I'm not sure if we should expect similar gains there. In my tests, I was using the already defined lgdo structure to fetch this kind of information. Since this involves small variable length strings, I could believe this will not be that fast. If reading the attrs cuts into the gains from using the low level API, we could also consider separating out the reading of attrs to form the LGDO structures from the actual reading of data into the structures. This would be a benefit for reading large number of files (which is when the file read speed is a real issue). However, this would also require refactoring For the actual reading of data into the arrays, I think we should proceed. The API is more "C-like" than pythonic and involves a little bit more boilerplate stuff, like explicitly opening datasets within files and setting up dataspace and property structures. The overall flow of things should be very similar, though. |
TestIO.zip
It actually looks like you are getting better performance than me in just the data reading portion of h5py (I commented out the tests, but those took a factor of 2-3 longer than the low level interface for me). In fact you're getting quite a bit better performance out of h5py generally than I did (5 s vs 2 s)
My low level code does the same. I think this is somewhat unavoidable because we have to resize the buffers even if we reuse them. One possible solution would be to do what C++ vectors do, and have a capacity (the size of the buffer) that can be different from the length of the array itself, which makes resizing much quicker as long as the capacity is already larger than the resize. This would require refactoring a lot of our data structures though.
My low level example fails on those channels because they are missing a column and I was too lazy to handle that case properly. |
Thanks, Ian!
I see - this is because your script uses the I would suggest we do not make any change for the moment to read speed and have a discussion at the collaboration meeting. I would not tackle this problem piecemeal, and we should come up with some comprehensive approach. I'll spend some time in advance of the collaboration meeting putting together a summary and proposal. |
Reading more about HDF5, I've found another thing I'd like to try, which is paged aggregation which is described here: https://docs.hdfgroup.org/hdf5/rfc/paged_aggregation.pdf. Basically what this does is combine multiple datasets into a single memory page, which seems very similar to your output by type strategy, so I think it may see similar performance. I'll try to put a test together before the meeting next week. |
That sounds interesting - is that just a proposal, or is that part of current HDF5 versions? |
It's part of current HDF5. It's also in this document https://github.com/HDFGroup/arch-doc/blob/main/An_Overview_of_the_HDF5_Library_Architecture.v2.pdf which may be more up to date. It seems like h5py has a way to use it in both the high and low level interfaces, too |
On the Julia side, it's seems one can set a "page size" - I hope that's it. |
I've finally had some time to try testing some strategies for speeding things up over the last few days. Here's a few things I think we can do:
So between all of these, there may be up to an 8-fold speedup. |
Thanks, Ian! 1. is very clever and we should certainly do that. Could you post some of your test code for the other items? That could help me understand how it interacts with #94 and whether that PR is still useful. @isaackunen may be interested as well. |
Yeah, I'm definitely interested in this as well. One thing that's been hard for me to wrap my head around is what exactly are the scenarios that we care about -- and as Louis's testing for his PR showed, the performance profile is very dependent on exactly what's being done. I think it would be helpful for us to pull together a shared, representative script. I'll be out of town (and mostly out of touch) for a few weeks, but I could pick this up when I get back. |
Here's the test file. Note you will want to change the file extension to ipynb (github doesn't accept ipynb attachments). After rerunning it I got less benefit that before...Now it's looking like a factor of 3 between my second two suggestions, and it seems like the paged buffering is making less of a difference than it did earlier. Basically, I wrote a few test functions for reading in a file and then did %timeit on them. I also have a cell where I profile one of them (right now it's the LGDO one; this was how I identified the issue of how we were recursing through things, it was calling I also found that for the most part using the low level interface wasn't difficult. The two things that are significantly more annoying than the high level interface are indexing, where you have to go through the dataspace class ( |
Here's another pull request with performance improvements: #100 The biggest change was changing read to use the low level API. After all of these changes, lgdo ends up taking ~900 ms for the test file (compared to ~3 s before all changes, so overall a bit less than I expected). The overall breakdown of the timing is now less than half in reading attributes, but those are still the biggest factor (it's ~350 ms for attrs, 250 ms for datasets/ndarrays, and 200 ms for navigating groups). Comparing LGDO post-changes to my bare-minimum reading, it took almost twice as long (~500 ms for the bare minimum); I'm not exactly sure where the rest of the time goes, but I think it's doing the same set of h5py operations, which would imply that it's from creating LGDO structures, allocating memory, and parsing attributes. That seems long compared to what Louis found before, though. While trying out different tweaks for hdf5, I also noticed an interesting feature that would work very well with @lvarriano's structured table above, which is virtual datasets (https://docs.hdfgroup.org/hdf5/v1_14/_v_d_s.html, https://docs.h5py.org/en/stable/vds.html). Basically, this let's you add a dataset to a file that doesn't contain its own data but instead references one or more other datasets. The idea of using this with the structured table would be to have one dataset contain all of the table data, and then for each field have a virtual dataset that references a block of the big table, but otherwise acts like a normal LGDO Array. This would have two really nice features:
The way this would be structured would be something like:
If possible it would be great to have the dataset just be bytes, and then the l_0, l_1... in the above would be based on the size of the datatype. That said, the HDF5 documentation seems to recommend against type casting; in principle, if it's stored as bytes you could do a trivial type cast, but I'm not sure if HDF5 supports this. When accessing this, if you are loading out the full array, LGDO would directly read the dataset and then use numpy to sort it into different LGDO objects the way Louis described above, and when reading a single element, you could just use the virtual datasets. I haven't tried any of this though, so I'm not sure if it would work as seemlessly as I'm hoping. We're still quite far from the performance Louis was able to demonstrate above, so this could easily make another factor of >5 improvement. |
That's great, Ian! That's great that the lower level API made such a big impact. I haven't had time to look very much at your script yet, but I'll try to take a look at your newer version soon. I made some timing check with the #94 metadata pull request already, and I'll see about this new pull request when I get a chance. If the speed-up is generally so good, it probably doesn't make sense to use #94 and add such complexity - the virtual dataset idea sounds very interesting, especially if you could store attributes of the other datasets with the virtual dataset and avoid having to load them. Maybe it handles that already. Regarding my previous testing - just in case you are looking at the memory mapping times, Isaac (@isaackunen) told me that I was not doing something correctly and the times are un-physically small for the memory-mapped reads that I report (and he seems right when comparing to my SSD specs). I'm still not exactly sure where I went wrong (lazy reading of data?), but I just wanted to note it in case those are the times you are aiming for. I think the other times I reported for the different reading strategies should still be correct. |
I'm pretty sure that in Linux, memory mapping is lazy, so unless you actually read a page, the system won't actually pull it in. In my tests, I made sure to force a read and saw some speed improvement with memory mapping, but <2x.It seems possible to do better than that, but what bandwidth are you seeing? This feels like it might be higher than the disk bandwidth.-IsaacOn Jul 11, 2024 16:54, Louis Varriano ***@***.***> wrote:
That's great, Ian! That's great that the lower level API made such a big impact. I haven't had time to look very much at your script yet, but I'll try to take a look at your newer version soon. I made some timing check with the #94 metadata pull request already, and I'll see about this new pull request when I get a chance. If the speed-up is generally so good, it probably doesn't make sense to use #94 and add such complexity - the virtual dataset idea sounds very interesting, especially if you could store attributes of the other datasets with the virtual dataset and avoid having to load them. Maybe it handles that already.
Regarding my previous testing - just in case you are looking at the memory mapping times, Isaac ***@***.***) told me that I was not doing something correctly and the times are un-physically small for the memory-mapped reads that I report (and he seems right when comparing to my SSD specs). I'm still not exactly sure where I went wrong (lazy reading of data?), but I just wanted to note it in case those are the times you are aiming for. I think the other times I reported for the different reading strategies should still be correct.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
After doing some more tests, I've found that the gain is not that great when reading a lot of files from NERSC. A big reason for this is that when using %timeit to test these, the files are getting cached by the DVS system, so reopening them on subsequent runs is much faster. Another big reason for this is that when reading a ton of files, we usually don't read the whole files, but a few datasets from them. Here's the timing, where I first try 6 datasets from one channel in all of DS8's calibration runs:
So basically, the above changes are really only going to make a difference if we are loading a large fraction of the data from the file, so that this loading time dominates the caching time. One thing that helped a little was opening the files before reading from them (reduced these times to 150 s and 18 s). The good news is there is a good way to drastically speed up this caching process, using the read only mount (see https://www.nersc.gov/assets/Uploads/DVS-for-NUG.pdf and https://docs.nersc.gov/performance/io/dvs/). This is done by reading from
And the other is when constructing the file: Here's the notebook I was using to test this: TL;DR: We should have LGDO open files without file locking in read mode, and tell people to read data from |
An
Array
is a dictionary of attributes and an ndarray. It is written to disk as an HDF5dataset
and attributes. ATable
is a collection ofArray
where eachArray
is written separately to disk. This allows flexibility in writing new objects into aTable
without having to open theTable
in memory.However, this is a very poor choice for writing small datasets to disk due to the fact that each HDF5
dataset
has to be paged and read separately, as they may not be written on contiguous sectors. This means that, for aTable
that contains manyArray
, eachArray
must incur the overhead associated with a read even if theArray
is extremely small.In LEGEND, we access and store most of our data in
Array
, including at theraw
,hit
,evt
, andskm
tiers. Only the waveforms in theraw
tier are arrays of large size. In particular, at thehit
tier in the latest production, we store 50Array
in each germanium channel. EachArray
is very small (~10 kB for a singlephy
file) and so this incurs significant read penalty relative to data size due to how this data is organized on disk.If, instead, the data were stored in a single array on disk, the read speed can be increased significantly. Here's the results of trying this on a LEGEND
hit
tier file, code follows. At thehit
tier, a read of all of the germanium channels goes from taking 3.6 seconds on my laptop's SSD to taking 36 ms. I stored these data in two ndarrays of single dtype, one float64 and one int32 by promoting theArray
dtypes since the impact on storage size is unimportant for such small files. I imagine that this can have a very large impact for users on a computing cluster sharing access to data files. The actual speed-up should scale as the number ofArray
stored in theTable
(at theraw
tier, I saw a speed-up of a factor of 20 for non-waveform data).For the vast majority of physics experiments, as in LEGEND, it is the case that many small variables for an event are stored and need to be retrieved often. This means that the current behavior and storage strategy of
Table
significantly impacts the utility ofLGDO
for other uses.@SamuelBorden and I have been working on an intermediary LGDO object to allow for storage on disk in a fixed type array way while maintaining the functionality of a
Table
in memory.from
The text was updated successfully, but these errors were encountered: