Abbreviation | In full |
---|---|
BAM |
Binary Alignment Map |
CIGAR |
Compact Idiosyncratic Gapped Alignment Report |
SAM |
Sequence Alignment Map |
- Discuss maybe solution: unsure of more of headers needs to change:
# Add CB:s in the header
sed 's/SN:ERCC-/CB:SN:ERCC-/g' SS2_19_037-H13-CB_chr2.sam > SS2_19_037-H13-CB_chr2_with_cbs.sam
# Convert to BAM
samtools view -S -b SS2_19_037-H13-CB_chr2_with_cbs.sam > SS2_19_037-H13-CB_chr2_with_cbs.bam
This was indeed the solution. The user will add this to her scripts herself.
- Documented jvarkit and how to put it in a Singularity container: maybe user is interested?
In run.sh, the line that would change would be:
./jvarkit.sif java -jar /opt/jvarkit/dist/jvarkit.jar samjdk -e 'String c=record.getReadName(); int h=0; int s=21; record.setAttribute(\"CB\",c.substring(h,s));return record;' ${LINE} > ${outdir1}${cell}-CB_chr2.sam
(note the escaped double-quotes)
User thinks this is useful.
- Requested a feature request to samtools to detect when a SAM has a BAM filename extension at samtools/samtools#2094
User seemed happy to have contributed indirectly to samtools
Bash script from user run.sh.
The problem is in an early line:
samtools view -b ${bamO} chr2 > SS2_19_037-H13_chr2.bam
This is a problem as a samtools view is not in the same format as a BAM file. Using this line instead brings us closer to the truth:
samtools view -b ${bamO} chr2 > SS2_19_037-H13_chr2.sam
This is because a SAM file is a plaintext format that samtools
renders to. However, the user does use the samtools -b
flag,
which, according to the documentation
states -b, --bam Output in the BAM format
I hypothesize that -b
is ignored when an output file is not
specified (with the -o
flag),
which can be tested by seeing if this would indeed create a BAM file:
samtools view --bam ${bamO} chr2 --output SS2_19_037-H13_chr2.bam
- Test hypothesis
This is, however, not the solution. Instead, I predict this will work:
samtools view ${bamO} chr2 > SS2_19_037-H13_chr2.sam
# Work on the SAM file
samtools view SS2_19_037-H13_chr2.sam --bam --output SS2_19_037-H13_chr2.bam
- Test hypothesis
User text on how to get a FASTA:
Fasta (too large to send as attachment) downloaded from https://www.gencodegenes.org/human/release_37.html (Fasta files > genome sequence)
Indeed, the FASTA file can be downloaded from https://www.gencodegenes.org/human/release_37.html, which links to https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.p13.genome.fa.gz. I put that in a script get_fasta.sh.
- Verify
SS2_19_037-H13_chr2.bam
is a valid BAM filecat SS2_19_037-H13_chr2.bam
indeed shows a binary filesamtools quickcheck SS2_19_037-H13_chr2.bam
gives no errorsamtools quickcheck -vvv SS2_19_037-H13_chr2.bam
gives no error and some good output:
richel@richel-N141CU:~/GitHubs/ticket_296259$ samtools quickcheck -vvv SS2_19_037-H13_chr2.bam
verbosity set to 3
checking SS2_19_037-H13_chr2.bam
opened SS2_19_037-H13_chr2.bam
SS2_19_037-H13_chr2.bam is sequence data
SS2_19_037-H13_chr2.bam has 116 targets in header.
SS2_19_037-H13_chr2.bam has good EOF block.
- Run use script, reproduce problem
Does not work yet:
[richel@rackham1 ticket_296259]$ cat slurm-49075340.out
[DEBUG][SamJdk] Compiling :
1 import java.util.*;
2 import java.util.stream.*;
3 import java.util.function.*;
4 import htsjdk.samtools.*;
5 import htsjdk.samtools.util.*;
6 public class SamJdkCustom173338094 extends com.github.lindenb.jvarkit.tools.samjs.SamJdk.AbstractFilter {
7 public SamJdkCustom173338094(final SAMFileHeader header) {
8 super(header);
9 }
10 @Override
11 public Object apply(final SAMRecord record) {
12 /** user's code starts here */
13 String c=record.getReadName(); int h=0; int s=21; record.setAttribute(CB,c.substring(h,s));return record;
14 /** user's code ends here */
15 }
16 }
/tmp/jvarkit.tmp3660225531974008139/SamJdkCustom173338094.java:13: error: cannot find symbol
String c=record.getReadName(); int h=0; int s=21; record.setAttribute(CB,c.substring(h,s));return record;
^
symbol: variable CB
location: class SamJdkCustom173338094
1 error
[SEVERE][SamJdk]java.lang.RuntimeException: Cannot compile
java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile
at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:250)
at com.github.lindenb.jvarkit.tools.samjs.SamJdk.doWork(SamJdk.java:584)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:819)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:982)
at com.github.lindenb.jvarkit.tools.samjs.SamJdk.main(SamJdk.java:809)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral$Command.execute(JvarkitCentral.java:260)
at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral.run(JvarkitCentral.java:716)
at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral.main(JvarkitCentral.java:727)
Caused by: java.lang.RuntimeException: Cannot compile
at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.exec(OpenJdkCompiler.java:164)
at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:227)
... 11 more
[INFO][Launcher]samjdk Exited with failure (-1)
[main_samview] fail to read the header from "./SS2_19_037-H13-CB_chr2.sam".
samtools index: "SS2_19_037-H13_chr2_bam.bam" is in a format that cannot be usefully indexed
Problem seems that the text CB
disappears. Let's fix it by "CB"
to \"CB\"
.
Works:
[richel@rackham1 ticket_296259]$ cat slurm-49075347.out
[DEBUG][SamJdk] Compiling :
1 import java.util.*;
2 import java.util.stream.*;
3 import java.util.function.*;
4 import htsjdk.samtools.*;
5 import htsjdk.samtools.util.*;
6 public class SamJdkCustom1568608340 extends com.github.lindenb.jvarkit.tools.samjs.SamJdk.AbstractFilter {
7 public SamJdkCustom1568608340(final SAMFileHeader header) {
8 super(header);
9 }
10 @Override
11 public Object apply(final SAMRecord record) {
12 /** user's code starts here */
13 String c=record.getReadName(); int h=0; int s=21; record.setAttribute("CB",c.substring(h,s));return record;
14 /** user's code ends here */
15 }
16 }
[richel@rackham1 ticket_296259]$ cat log
11:08:41 [ERROR] Input file ./00.0_chrom_seq_removed_GRCh38.primary_assembly.genome-nochrY_ERCC92-chr2.fa does not exist
[richel@rackham1 ticket_296259]$
New problem is that the FASTA file cannot be found, I've added that in the script too.
[richel@rackham1 ticket_296259]$ cat log
11:31:38 [ERROR] Input file ./00.0_chrom_seq_removed_GRCh38.primary_assembly.genome-nochrY_ERCC92-chr2.fa does not exist
[richel@rackham1 ticket_296259]$
``
- [x] [FAILS] Confirm that `samtools quickcheck` on the `SS2_19_037-H13_chr2.bam` created by the original script
indicates a problem
```bash
[richel@rackham1 ticket_296259]$ samtools quickcheck SS2_19_037-H13_chr2.bam
[richel@rackham1 ticket_296259]$
- Send a feature request to samtools to detect this. Done at samtools/samtools#2094
samtools
determines the type of file from the input file, not from its filename
- Confirm that
java -jar $PICARD ValidateSamFile --INPUT SS2_19_037-H13_chr2.bam
indicates a problem
[richel@rackham1 ticket_296259]$ java -jar $PICARD ValidateSamFile --INPUT SS2_19_037-H13_chr2.bam
Aug 06, 2024 1:14:05 PM com.intel.gkl.NativeLibraryLoader load
INFO: Loading libgkl_compression.so from jar:file:/sw/bioinfo/picard/3.1.1/rackham/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Aug 06 13:14:05 CEST 2024] ValidateSamFile --INPUT SS2_19_037-H13_chr2.bam --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Aug 06 13:14:05 CEST 2024] Executing as richel@rackham1.uppmax.uu.se on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17+35-2724; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:3.1.1
WARNING 2024-08-06 13:14:05 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
ERROR::MISSING_READ_GROUP:Read groups is empty
WARNING::RECORD_MISSING_READ_GROUP:Read name Run0484_ACDWKAANXX_L4_R1_T2103_C1510673, A record is missing a read group
WARNING::MISSING_TAG_NM:Record 1, Read name Run0484_ACDWKAANXX_L4_R1_T2103_C1510673, NM tag (nucleotide differences) is missing
WARNING::RECORD_MISSING_READ_GROUP:Read name Run0484_ACDWKAANXX_L4_R1_T1209_C1324596, A record is missing a read group
[...]
Fix problem in script:
mv SS2_19_037-H13_chr2.bam SS2_19_037-H13_chr2.sam
samtools view -S -b SS2_19_037-H13_chr2.sam > SS2_19_037-H13_chr2.bam
After a lot of trying: there is no need to fix the script: the user already did so! Cleaning up...
I now have the SAM file, SS2_19_037-H13-CB_chr2.sam that needs to be modified...
[richel@rackham1 ticket_296259]$ head SS2_19_037-H13-CB_chr2.sam
@HD VN:1.6 SO:coordinate
@SQ SN:ERCC-00002 LN:1061
@SQ SN:ERCC-00003 LN:1023
@SQ SN:ERCC-00004 LN:523
@SQ SN:ERCC-00009 LN:984
@SQ SN:ERCC-00012 LN:994
@SQ SN:ERCC-00013 LN:808
@SQ SN:ERCC-00014 LN:1957
@SQ SN:ERCC-00016 LN:844
@SQ SN:ERCC-00017 LN:1136
I checked, I can shamelessly replace SN:ERCC-
by CB:SN:ERCC-
sed 's/SN:ERCC-/CB:SN:ERCC-/g' SS2_19_037-H13-CB_chr2.sam > SS2_19_037-H13-CB_chr2_with_cbs.sam
This does leave some .sam
unaffected:
@SQ SN:chr1 LN:248956422
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr2 LN:242193529
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chrM LN:16569
@SQ SN:chrX LN:156040895
- Ask user: is that OK?
If it is OK, the line that needs to be added is:
# Add CB:s in the header
sed 's/SN:ERCC-/CB:SN:ERCC-/g' SS2_19_037-H13-CB_chr2.sam > SS2_19_037-H13-CB_chr2_with_cbs.sam
# Convert to BAM
samtools view -S -b SS2_19_037-H13-CB_chr2_with_cbs.sam > SS2_19_037-H13-CB_chr2_with_cbs.bam
Discuss:
- Goal: how to be able to fix the problem, maybe the VarTrix developers are better to ask
It seems fixable: we concluded that the user's script converted a BAM file to a text file. I can fix that script :-)
- Validate BAM file
module load bioinfo-tools
module load vartrix/1.1.22
module load picard/3.1.1
java -jar $PICARD ValidateSamFile --INPUT user_filename.bam
We found out the BAM file was not a binary file at all!
Instead, it was a text file in the same format as samtools view
gives.
- [/] What is the history of the BAM file? Goal: reproduce problem on public BAM file
The user will send me her script
- I've creates an Issue at VarTrix: Help find problematic file with error 'Failed to seek to offset'
Not needed.
-
Decide upon plan
- User sends script and subsetted BAM file before 14:00
- Richel works on the fix
- Next meeting: Tuesday Aug 6th 14:00
I predict there is something wrong with the bam input file. The user thinks so to (see communication of 2024-08-02). The bam input file cannot be shared.
However, it can be validated (code from the UPPMAX documentation on Picard):
module load bioinfo-tools
module load vartrix/1.1.22
module load picard/3.1.1
java -jar $PICARD ValidateSamFile --INPUT $VARTRIX_TEST/test_dna.bam
The user tried to delete the analysis files and started the analysis from scratch again, but I am unsure if this was correct (see communication of 2024-08-02)
- I've creates an Issue at VarTrix: Help find problematic file with error 'Failed to seek to offset'
I am trying to run vartrix in Bianca. The program seems to work fine when I try the test run (files in $VARTRIX_TEST) but crashes when I use my input files. The errors I get: "Failed to seek to offset ((some number))" "Segmentation fault. Core dump" and I get a core.XXXX file. When I check the core.XXXX with gdb it says that it "Failed to read a valid object file image from memory" and "Program terminated with signal 6" I have no idea what it is trying to tell me. Could you maybe help me? (attached screenshots)
I assume this is about the Rust program called vartrix at https://github.com/10XGenomics/vartrix.
A GitHub seach on Failed to seek to offset
:
- [UNRESOLVED] mflamand/Bullseye#2 (comment)
- [UNRESOLVED] GMOD/jbrowse-components#1990 (comment)
I feel the problem is in https://github.com/samtools/htslib, due to this PR:
Lines of code are at:
- https://github.com/samtools/htslib/blob/master/hts.c#L4182
- https://github.com/samtools/htslib/blob/master/hts.c#L4203
Both work on a BGZF
pointer, a structure
defined at https://github.com/samtools/htslib/blob/develop/htslib/bgzf.h#L68.
As the user can get to work testing files, I feel it is most likely that the user's input files are in an invalid format.
I contacted the VarTrix maintainers to help me and the user diagnose the faulty file at this Issue.
From the user:
The command I used to run a test in Vartrix:
$ module load bioinfo-tools
$ module load vartrix/1.1.22
$ ls -ltr $VARTRIX_TEST/*
$ vartrix --bam $VARTRIX_TEST/test_dna.bam \
--cell-barcodes $VARTRIX_TEST/dna_barcodes.tsv \
--fasta $VARTRIX_TEST/test_dna.fa \
--vcf $VARTRIX_TEST/test_dna.vcf
Trying to remove all analysis files: there are none created, I checked with '''ls -altr''' and I see nothing
Input files: I agree in that it is, most likely, my input files. My input bams have been tinkered with, because my bam files are output coming from RSEM and they are missing the CB tag needed by Vartrix (as Vartrix takes bam files with CellRanger format) I used java/jvarkit to add the CB tag to my RSEM bams I also modified the vcf file to tailor it to what we need, but when I compare it to the test vcf file they seem to have the same format
Sharing my inputs: I'm not sure I can share my bam as it is sensitive information, but I have it in Bianca, so maybe you can access that directory? For the remaining files I could send you a link to a tar in OneDrive, would that work?