Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault when reading multiple fast5 files #27

Open
BlkPingu opened this issue Sep 10, 2023 · 4 comments
Open

Segmentation Fault when reading multiple fast5 files #27

BlkPingu opened this issue Sep 10, 2023 · 4 comments

Comments

@BlkPingu
Copy link

BlkPingu commented Sep 10, 2023

To assist reproducing bugs, please include the following:

  • Operating System: macOS 13.4.1
  • Python version 3.11.5
  • Where Python was acquired: Homebrew
  • h5py version 3.9.0
  • HDF5 version 1.12.2
  • The full traceback/stack trace shown: below

Crash report:

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	       0x1a41e8724 __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x1a421fc28 pthread_kill + 288
2   libsystem_c.dylib             	       0x1a40f646c raise + 32
3   Python                        	       0x10114f8b0 faulthandler_fatal_error + 440
4   libsystem_platform.dylib      	       0x1a424ea24 _sigtramp + 56
5   libvbz_hdf_plugin_m1.dylib    	       0x126905744 StreamVByteWorkerV0<short, true>::decompress(gsl::span<char const>, gsl::span<char>) + 180
6   libvbz_hdf_plugin_m1.dylib    	       0x1269070f8 vbz_decompress + 340
7   libvbz_hdf_plugin_m1.dylib    	       0x126903fbc vbz_filter(unsigned int, unsigned long, unsigned int const*, unsigned long, unsigned long*, void**) + 748
8   libhdf5.200.dylib             	       0x1020d1438 H5Z_pipeline + 508
9   libhdf5.200.dylib             	       0x101e670c8 H5D__chunk_lock + 884
10  libhdf5.200.dylib             	       0x101e622ac H5D__chunk_read + 780
11  libhdf5.200.dylib             	       0x101e855b8 H5D__read + 1224
12  libhdf5.200.dylib             	       0x1020c2f58 H5VL__native_dataset_read + 116
13  libhdf5.200.dylib             	       0x1020ab1f0 H5VL_dataset_read + 180
14  libhdf5.200.dylib             	       0x101e8454c H5Dread + 744
15  defs.cpython-311-darwin.so    	       0x102d6bfbc __pyx_f_4h5py_4defs_H5Dread + 96
16  _selector.cpython-311-darwin.so	       0x10372cfe4 __pyx_pw_4h5py_9_selector_6Reader_3read + 336
17  Python                        	       0x1010e988c _PyEval_EvalFrameDefault + 46824
18  Python                        	       0x1010ec72c _PyEval_Vector + 116
19  _objects.cpython-311-darwin.so	       0x102d392f8 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper + 564
20  Python                        	       0x10100fe34 _PyObject_MakeTpCall + 128
21  Python                        	       0x1010133bc method_vectorcall + 564
22  Python                        	       0x1010798c8 vectorcall_method + 128
23  Python                        	       0x101078d00 slot_mp_subscript + 52
24  Python                        	       0x1010dfa34 _PyEval_EvalFrameDefault + 6288
25  Python                        	       0x1010dd660 PyEval_EvalCode + 168
26  Python                        	       0x1011300ec run_eval_code_obj + 84
27  Python                        	       0x101130050 run_mod + 112
28  Python                        	       0x10112fe90 pyrun_file + 148
29  Python                        	       0x10112f8e4 _PyRun_SimpleFileObject + 268
30  Python                        	       0x10112f274 _PyRun_AnyFileObject + 216
31  Python                        	       0x10114b16c pymain_run_file_obj + 220
32  Python                        	       0x10114aaac pymain_run_file + 72
33  Python                        	       0x10114a38c Py_RunMain + 704
34  Python                        	       0x10114b4c4 Py_BytesMain + 40
35  dyld                          	       0x1a3ec7f28 start + 2236

Faulthandler:

Current thread 0x00000001ff1f9e00 (most recent call first):
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/.venv/lib/python3.11/site-packages/h5py/_hl/dataset.py", line 758 in __getitem__
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/.venv/lib/python3.11/site-packages/ont_fast5_api/fast5_read.py", line 527 in _load_raw
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/.venv/lib/python3.11/site-packages/ont_fast5_api/fast5_read.py", line 161 in get_raw_data
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 17 in raw_data_to_numpy_array
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 54 in combine_files
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 62 in iter_run
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 74 in run_all
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 83 in main
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 86 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.h5r, h5py.utils, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5t, h5py._conv, h5py.h5z, h5py._proxy, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs (total: 87)
[1]    88801 segmentation fault  python barcode_to_parquet.py

Script:

from ont_fast5_api.fast5_interface import get_fast5_file
import numpy as np
import h5py
import pandas as pd
import os
import faulthandler




def raw_data_to_numpy_array(readfile):
    data = []

    fast5_filepath = readfile # This can be a single- or multi-read file
    with get_fast5_file(fast5_filepath, mode="r") as f5:
        for read in f5.get_reads():
            raw_data = read.get_raw_data()
            tup = read.read_id, raw_data
            data.append(tup)
    f5.close()
            # read.read_id, raw_data.shape

    return pd.DataFrame(data, columns=['read_id', 'raw_data'])


def get_paths(source_dir, iter):
   barcode_iter = 'barcode' + iter
   barcode_pass = os.path.join(source_dir, 'fast5_pass', barcode_iter)
   barcode_fail = os.path.join(source_dir, 'fast5_fail', barcode_iter)
   if os.path.exists(barcode_pass) and os.path.exists(barcode_fail):
        print(barcode_pass)
        print(barcode_fail)

        return [barcode_pass, barcode_fail]
   else:
        print('No barcodes found')


def walk_through_files(path, file_extension='.fast5'):
   for (dirpath, dirnames, filenames) in os.walk(path):
      for filename in filenames:
         if filename.endswith(file_extension):
            yield os.path.join(dirpath, filename)


def combine_files(source_dir, iterator):
    barcode_list = get_paths(source_dir, iterator)
    if barcode_list is None:
        print('File not found')
        return None
    barcode_parts = []
    for barcode in barcode_list:
        for fname in walk_through_files(barcode):
            barcode_part = raw_data_to_numpy_array(fname)
            barcode_parts.append(barcode_part)

    return pd.concat(barcode_parts)


def iter_run(run_path, run_id):
    for iterator in ["{0:02}".format(i) for i in range(1,100)]:
        df = combine_files(run_path, iterator)
        print()
        df.to_parquet(run_id + '_barcode_' + iterator + '.parquet', engine='pyarrow', compression='snappy')
        df = None

def run_all(source_dir, run_id_list):
    dirs = os.listdir(source_dir)
    print(dirs)
    for run_id in run_id_list:
        print(run_id)
        run_folder_name = [d for d in dirs if d.endswith(run_id)]
        print(run_folder_name[0])
        run_path = os.path.join(source_dir, run_folder_name[0])
        iter_run(run_path, run_id)

test_path = 'Path/to/data/MinION_sample_data/RawData/RKI/MinION_fast5/unpacked/'
prod_path = 'Path/to/data/extracted/'

run_id_list = ['024', '084', '085', '086', '107', '123', '135']

def main():
    faulthandler.enable()
    run_all(source_dir=prod_path, run_id_list=run_id_list)

if __name__ == '__main__':
    main()

Basically, when reading multiple fast5 files my script seqfaults. I have no idea why, but faulthandler points me in the direction of dataset.py. The crash report indicates the segfault occurred in the library's plugin for VBZ compression libvbz_hdf_plugin_m1.dylib.

Scripts purpose is to combine multiple barcodes data, each with pass and fails, into one parquet file. For some reason it seqfaults after writing a few GB worth of data. Please don't roast me for the terrible code quality. It's just to process some data into a different random access format.

Any advice?

@0x55555555
Copy link

Hi @BlkPingu ,

Have you tried on multiple input datasets, or only one?

I'll give it a go now with some input data I have locally.

  • George

@0x55555555
Copy link

Right @BlkPingu ,

It does all seem to work as expected on my end, I converted ~10GB of data through the script and it didnt crash.

Is it possible there is a specific part of the input datasets that is corrupted?

  • George

@BlkPingu
Copy link
Author

Hello George, that is wild. Did you run the script using a macOS device or something else? It could very well be that the data is corrupted, at least partially. Thanks for th suggestion.

@0x55555555
Copy link

Hi @BlkPingu ,

It was on an Apple M1 Max with 32 GB of memory. I did notice the script using > 40GB of memory while running - which was quite exciting.

If you find it reproduces specifically with one file I could have a look at the file?

  • George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants