Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dorado basecalling bam file end up corrupt during sort #1197

Open
uribertocchi opened this issue Dec 26, 2024 · 7 comments
Open

dorado basecalling bam file end up corrupt during sort #1197

uribertocchi opened this issue Dec 26, 2024 · 7 comments

Comments

@uribertocchi
Copy link

I am experiencing an issue with Dorado where the generated BAM files appear corrupted. The basecalling process completes without any reported errors, but subsequent attempts to sort the BAM files (using both Sambamba sort and Samtools sort) result in errors such as:

sambamba-sort: [zlib] buf error
sambamba-sort: [zlib] data error

I expect to produce a correctly formatted and usable BAM file, but the sorting process consistently fails.

Steps to reproduce the issue:

Dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v5.0.0 --modified-bases 6mA --reference /path/to/reference.fa -x cuda: all --recursive /path/to/pod5_pass > output.bam

Run environment:

Dorado version: v0.9.0 (previous versions tested with similar results)

  • Operating system:

  • Hardware (CPUs, Memory, GPUs):
    CPUs: 96-core processor
    Memory: 1.5 TiB
    GPUs: NVIDIA L40 (4 GPUs)

  • Source data location (on device or networked drive - NFS, etc.): networked drive

  • Details about data: Data with over 30GB of output bam file is usually corrupt.

@HalfPhoton
Copy link
Collaborator

Hi @uribertocchi,
Can you add some more information please;

  • Could share some more detailed logs from samtools and more information on your system e.g. OS, CPU architecture?
  • Can you reproduce the issue with a single read, with/without the reference (alignment)?

Best regards,
Rich

@uribertocchi
Copy link
Author

Hi @rich,

Thank you for your response! Here is the additional information:

System Details:
Operating System: Linux (specific version: Ubuntu 22.04.1 LTS, kernel 6.8.0-49-generic)
CPU Architecture: AMD EPYC 9654, 96-core processor
Memory: 1.5 TiB
GPUs: NVIDIA L40 (4 GPUs)

Samtools sort logs:
[E::bgzf_uncompress] CRC32 checksum mismatch
[E::bgzf_read_block] BGZF decode jobs returned error 1 for block offset 3605376433
[E::bgzf_read] Read block operation failed with error 1 after 0 of 4 bytes
samtools sort: truncated file. Aborting

@HalfPhoton
Copy link
Collaborator

Can you reproduce the issue with a single read, with/without the reference (alignment)?

@uribertocchi
Copy link
Author

Basecalling is fine for individual reads or small pod5/fast5 files. However, the issue arises when attempting to basecall many reads and align them to BAM or SAM files. I have successfully produced functioning BAM or SAM files up to 30GB but encountered issues with larger files. Unfortunately, most of my experiments exceed 60GB, resulting in most files becoming corrupted.

@HalfPhoton
Copy link
Collaborator

Hi @uribertocchi,
I believe this problem is likely to be an issue with your system or network storage configuration and not an issue with Dorado.

Could you check the following:

  • Are you using a recent version of samtools?
  • Can you check that your BAM files pass samtools quickcheck <FILE>?
  • Are you able to basecall 2+ ~30G datasets and samtools merge them into a valid BAM file?
  • Does your system / network file store have a maximum file size?

Kind regards,
Rich

@uribertocchi
Copy link
Author

Thank you, Rich. I'll check everything soon. How can I limit the size of the pod5 folder to 30GB, as it currently generates about 80GB?

@HalfPhoton
Copy link
Collaborator

HalfPhoton commented Jan 2, 2025

No worries @uribertocchi - This is quite a puzzling issue that I've not come across and which hasn't been reported before as far as I can see. Hopefully we can get to the bottom of it.

To create smaller outputs you can either;

  • Batch your input data into smaller datasets (i.e. avoid using --recursive and collect input data into collections of N pod5 files. This could be easiest if your input data is already collection of small datasets and it's just a matter of basecalling each separately.
  • Specify batches of reads in the form of lists of read ids to basecall from your overall datasets. Continue using --recursive but also use --read-ids <FILE> for N lists of read ids. You can create batches of read ids using the bash snippet below:
    # Set BATCH_SIZE_SAMPLES  100G samples ~ 10GB output BAM
    BATCH_SIZE_SAMPLES=100000000000 

    pod5 view -H --include "read_id, num_samples" --recursive --output view.txt reads/

    # Tally the 2nd column (num_samples) and calculate the number of sub-batches
    TOTAL_SAMPLES=$(awk --bignum '{sum+=$2;} END {print sum;}' view.txt)
    BATCHES=$(($TOTAL_SAMPLES / $BATCH_SIZE_SAMPLES  ))
    BATCHES=$(($BATCHES + 1 ))

    # Extract only the read ids
    awk '{print \$1}' view.txt > read_ids.txt
    # split the read ids into separate files approximately even batches
    split read_ids.txt -n l/$BATCHES ids. -a 4 -d --additional-suffix .txt

and call them in some loop over each file writing to separate output.$N.bam files

for F in $(find -maxdepth 1 -iname "ids*txt"); do 
    N=$(echo $F | grep -oP "\d+")
    echo $N $F
    dorado basecaller hac,6mA /path/to/pod5_pass/ --recursive --reference /path/to/reference.fa -x cuda: all --read-ids $F > output.${N}.bam
done

Best regards,
Rich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants