-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_object() with idx parameter is slow #29
Comments
I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves. |
One thing that probably has an effect here is that when data is read from disk it tends to be read in contiguous chunks of 4096 kB pages. This means that if the spacing between entries is less than one page, it is still reading the full file. The waveforms are big enough that they should be spread across pages, while the other fields probably use only one or two pages for the full file. So a couple of other things that might be worth checking:
|
Did we ever think carefully about how to chunk datasets on disk? This is also relevant for HDF5 compression. https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage |
It is the same slowdown. Loading just the waveforms (using the same
For reference, I changed lines in the script above to e.g. chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind, field_mask=['waveform'])
I used a
Again, changing line 732 in
import lgdo.lh5_store as lh5
import numpy as np
from matplotlib import pyplot as plt
import time
raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
test_channel = 'ch1104002'
numtests = 10
store = lh5.LH5Store()
times_readall = []
for i in range(numtests):
start = time.time()
chobj, _ = store.read_object(test_channel+'/raw', raw_file)
end = time.time()
times_readall.append(end-start)
print(f"read entire: {np.mean(times_readall):.5f} +- {np.std(times_readall):.5f} s")
totind = len(chobj)
allind = np.arange(totind)
times_readallidx = []
for i in range(numtests):
start = time.time()
chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=allind)
end = time.time()
times_readallidx.append(end-start)
print(f"read by idx, all {len(allind)} rows: {np.mean(times_readallidx):.5f} +- {np.std(times_readallidx):.5f} s")
numbreaks = np.array([1, 2, 5, 10, 20])
breaksizes = np.array([1, 10, 100, 1000])
for k in range(len(breaksizes)):
for j in range(len(numbreaks)):
breaks = np.round(np.linspace(0, len(allind) - 1, breaksizes[k]*numbreaks[j])).astype(int)
theseind = np.delete(allind, breaks)
times_readsome = []
for i in range(numtests):
start = time.time()
chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind)
end = time.time()
times_readsome.append(end-start)
print(f"read by idx, {len(theseind)} rows to read with {numbreaks[j]} breaks of size {breaksizes[k]}: {np.mean(times_readsome):.5f} +- {np.std(times_readsome):.5f} s") |
I take this back - I think we should change |
From the h5py docs:
And we are not setting any chunking option! legend-pydataobj/src/lgdo/lh5_store.py Lines 1097 to 1099 in 6901bf3
(
So we should definitely do something about it (even if turning on HDF5 compression should automatically do it? But how?). Continuing:
Should we use this feature? @oschulz might also have an option about all this. |
We're doing something on the Julia side to give the user control over chunking, but it's not really finished yet (legend-exp/LegendHDF5IO.jl#35). In general, non-chunked files are faster to read, since memory-mapping can be used, which will be good for ML applications. For "standard" data storage, chunking makes it possible to write files bit by bit, so the writer doesn't need to have a large amount of memory available. Also, as you point out, compression (I think) always implies chunking on HDF5. |
So I was wrong. We do always set legend-pydataobj/src/lgdo/lh5_store.py Line 1080 in 6901bf3
so (as per h5py docs) all our datasets are chunked. I don't know how they are chunked (I could not find documentation). Maybe the "autochunking" feature is used? I still have no clue what that means. |
Note that this implies: it's impossible to write contiguous datasets to disk with legend-pydataobj at the moment. |
Here's a stack overflow that was answered by one of the h5py developers https://stackoverflow.com/questions/21766145/h5py-correct-way-to-slice-array-datasets. Apparently fancy indexing is not directly supported in hdf5 (unlike slicing, which is), so they did their own implementation of it that he admits is very slow for >1000 elements (abysmal is his choice word)...So I think this supports the idea that we may want to accept higher memory usage and do fancy indexing through numpy instead. If we really wanted to optimize this, we could try to implement it with chunks in mind...Basically, look at our index list, figure out which indices are in a chunk and grab those from the chunk and then move to the next one. This would limit the extra memory burden from this to the smaller chunk size. This would also let us skip chunks that don't have any indices in our fancy index, which could speed up sparse data reading (the waveform browser could potentially benefit from this). Of course, this would be a lot of extra work, so for now lets just read in the whole group and then assess if this would be worth it. |
looks like this behavior is even properly documented and I just ignored it or didn't worry about it when we implemented this long ago. Louis's fix seems to handle all use cases with just moderate burden due to memory realloc and its way better than what we have. I propose we just accept his fix for now and come back to this when/if it's the bottleneck again. |
What a pity! I agree we clearly need Louis' workaround, at this point. I'm curious to see the effect on |
For interest/posterity, here's a comparison of the new implementation and the previous implementation (accessed by using the new There's currently a factor of x2 penalty to speed if the user passes in import lgdo.lh5_store as lh5
import numpy as np
raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
test_channel = 'ch1104002'
store = lh5.LH5Store()
chobj, _ = store.read_object(test_channel+'/raw', raw_file)
allind = np.arange(len(chobj))
theseind = np.sort(np.random.choice(allind, size=5000, replace=False))
del chobj
@profile
def new_readall(files):
chobj, _ = store.read_object(test_channel+'/raw', files)
@profile
def orig_readall(files):
chobj, _ = store.read_object(test_channel+'/raw', files, use_h5idx=True)
@profile
def new_readallidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)
@profile
def orig_readallidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)
@profile
def new_readsomeidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)
@profile
def orig_readsomeidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)
if __name__ == '__main__':
new_readall([raw_file, raw_file, raw_file])
orig_readall([raw_file, raw_file, raw_file])
new_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
orig_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
new_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])
orig_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind]) |
In the pull-request #35, Luigi gave me an idea to convert |
Using the
idx
parameter inread_object()
makes the read about 20x slower. From thelh5_store.py
code, this is related to line 732nda = h5f[name][source_sel]
. I thought that this might be related to the number of discontinuities in the data so I tested for that, and it seems somehow related. I don't understand the sudden jump in read time, though.If, instead, the entire object is read into memory before the list indexing, i.e.
nda = h5f[name][...][source_sel]
, then the speed is about the same. Note that this line does not apply ifread_object
is passed anobj_buf
to read the data into and some other changes will need to be made.We suspect this may be related to DataLoader's performance legend-exp/pygama#521 but it does not account for all of the slowdown. This seems to be a fundamental issue with HDF5 files (from the linked issue "It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and h5py/h5py#1597").
This is demonstrated below on a single channel in a single file.
prints
If line 732 is replaced as above (to
nda = h5f[name][...][source_sel]
) to read the whole object below list indexing, this prints insteadThe text was updated successfully, but these errors were encountered: