-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dorado basecalling bam file end up corrupt during sort #1197
Comments
Hi @uribertocchi,
Best regards, |
Hi @rich, Thank you for your response! Here is the additional information: System Details: Samtools sort logs: |
Can you reproduce the issue with a single read, with/without the reference (alignment)? |
Basecalling is fine for individual reads or small pod5/fast5 files. However, the issue arises when attempting to basecall many reads and align them to BAM or SAM files. I have successfully produced functioning BAM or SAM files up to 30GB but encountered issues with larger files. Unfortunately, most of my experiments exceed 60GB, resulting in most files becoming corrupted. |
Hi @uribertocchi, Could you check the following:
Kind regards, |
Thank you, Rich. I'll check everything soon. How can I limit the size of the pod5 folder to 30GB, as it currently generates about 80GB? |
No worries @uribertocchi - This is quite a puzzling issue that I've not come across and which hasn't been reported before as far as I can see. Hopefully we can get to the bottom of it. To create smaller outputs you can either;
# Set BATCH_SIZE_SAMPLES 100G samples ~ 10GB output BAM
BATCH_SIZE_SAMPLES=100000000000
pod5 view -H --include "read_id, num_samples" --recursive --output view.txt reads/
# Tally the 2nd column (num_samples) and calculate the number of sub-batches
TOTAL_SAMPLES=$(awk --bignum '{sum+=$2;} END {print sum;}' view.txt)
BATCHES=$(($TOTAL_SAMPLES / $BATCH_SIZE_SAMPLES ))
BATCHES=$(($BATCHES + 1 ))
# Extract only the read ids
awk '{print \$1}' view.txt > read_ids.txt
# split the read ids into separate files approximately even batches
split read_ids.txt -n l/$BATCHES ids. -a 4 -d --additional-suffix .txt and call them in some loop over each file writing to separate output.$N.bam files for F in $(find -maxdepth 1 -iname "ids*txt"); do
N=$(echo $F | grep -oP "\d+")
echo $N $F
dorado basecaller hac,6mA /path/to/pod5_pass/ --recursive --reference /path/to/reference.fa -x cuda: all --read-ids $F > output.${N}.bam
done Best regards, |
I am experiencing an issue with Dorado where the generated BAM files appear corrupted. The basecalling process completes without any reported errors, but subsequent attempts to sort the BAM files (using both Sambamba sort and Samtools sort) result in errors such as:
sambamba-sort: [zlib] buf error
sambamba-sort: [zlib] data error
I expect to produce a correctly formatted and usable BAM file, but the sorting process consistently fails.
Steps to reproduce the issue:
Dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v5.0.0 --modified-bases 6mA --reference /path/to/reference.fa -x cuda: all --recursive /path/to/pod5_pass > output.bam
Run environment:
Dorado version: v0.9.0 (previous versions tested with similar results)
Operating system:
Hardware (CPUs, Memory, GPUs):
CPUs: 96-core processor
Memory: 1.5 TiB
GPUs: NVIDIA L40 (4 GPUs)
Source data location (on device or networked drive - NFS, etc.): networked drive
Details about data: Data with over 30GB of output bam file is usually corrupt.
The text was updated successfully, but these errors were encountered: