Skip to content

UPPMAX/ticket_296259

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ticket_296259

Abbreviation In full
BAM Binary Alignment Map
CIGAR Compact Idiosyncratic Gapped Alignment Report
SAM Sequence Alignment Map

2024-08-06

Meeting 16:00

  • Discuss maybe solution: unsure of more of headers needs to change:

# Add CB:s in the header
sed 's/SN:ERCC-/CB:SN:ERCC-/g' SS2_19_037-H13-CB_chr2.sam > SS2_19_037-H13-CB_chr2_with_cbs.sam
# Convert to BAM
samtools view -S -b SS2_19_037-H13-CB_chr2_with_cbs.sam > SS2_19_037-H13-CB_chr2_with_cbs.bam

This was indeed the solution. The user will add this to her scripts herself.

In run.sh, the line that would change would be:

./jvarkit.sif java -jar /opt/jvarkit/dist/jvarkit.jar samjdk -e 'String c=record.getReadName(); int h=0; int s=21; record.setAttribute(\"CB\",c.substring(h,s));return record;' ${LINE}  > ${outdir1}${cell}-CB_chr2.sam

(note the escaped double-quotes)

User thinks this is useful.

  • Requested a feature request to samtools to detect when a SAM has a BAM filename extension at samtools/samtools#2094

User seemed happy to have contributed indirectly to samtools

Notes

Bash script from user run.sh.

The problem is in an early line:

samtools view -b ${bamO} chr2 > SS2_19_037-H13_chr2.bam

This is a problem as a samtools view is not in the same format as a BAM file. Using this line instead brings us closer to the truth:

samtools view -b ${bamO} chr2 > SS2_19_037-H13_chr2.sam

This is because a SAM file is a plaintext format that samtools renders to. However, the user does use the samtools -b flag, which, according to the documentation states -b, --bam Output in the BAM format

I hypothesize that -b is ignored when an output file is not specified (with the -o flag), which can be tested by seeing if this would indeed create a BAM file:

samtools view --bam ${bamO} chr2 --output SS2_19_037-H13_chr2.bam
  • Test hypothesis

This is, however, not the solution. Instead, I predict this will work:

samtools view ${bamO} chr2 > SS2_19_037-H13_chr2.sam
# Work on the SAM file
samtools view SS2_19_037-H13_chr2.sam --bam --output SS2_19_037-H13_chr2.bam
  • Test hypothesis

User text on how to get a FASTA:

Fasta (too large to send as attachment) downloaded from https://www.gencodegenes.org/human/release_37.html (Fasta files > genome sequence)

Indeed, the FASTA file can be downloaded from https://www.gencodegenes.org/human/release_37.html, which links to https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.p13.genome.fa.gz. I put that in a script get_fasta.sh.

  • Verify SS2_19_037-H13_chr2.bam is a valid BAM file
    • cat SS2_19_037-H13_chr2.bam indeed shows a binary file
    • samtools quickcheck SS2_19_037-H13_chr2.bam gives no error
    • samtools quickcheck -vvv SS2_19_037-H13_chr2.bam gives no error and some good output:
richel@richel-N141CU:~/GitHubs/ticket_296259$ samtools quickcheck -vvv SS2_19_037-H13_chr2.bam
verbosity set to 3
checking SS2_19_037-H13_chr2.bam
opened SS2_19_037-H13_chr2.bam
SS2_19_037-H13_chr2.bam is sequence data
SS2_19_037-H13_chr2.bam has 116 targets in header.
SS2_19_037-H13_chr2.bam has good EOF block.
  • Run use script, reproduce problem

Does not work yet:

[richel@rackham1 ticket_296259]$ cat slurm-49075340.out
[DEBUG][SamJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.function.*;
         4  import htsjdk.samtools.*;
         5  import htsjdk.samtools.util.*;
         6  public class SamJdkCustom173338094 extends com.github.lindenb.jvarkit.tools.samjs.SamJdk.AbstractFilter {
         7    public SamJdkCustom173338094(final SAMFileHeader header) {
         8    super(header);
         9    }
        10    @Override
        11    public Object apply(final SAMRecord record) {
        12     /** user's code starts here */
        13  String c=record.getReadName(); int h=0; int s=21; record.setAttribute(CB,c.substring(h,s));return record;
        14  /** user's code ends here */
        15     }
        16  }
/tmp/jvarkit.tmp3660225531974008139/SamJdkCustom173338094.java:13: error: cannot find symbol
String c=record.getReadName(); int h=0; int s=21; record.setAttribute(CB,c.substring(h,s));return record;
                                                                      ^
  symbol:   variable CB
  location: class SamJdkCustom173338094
1 error
[SEVERE][SamJdk]java.lang.RuntimeException: Cannot compile
java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile
	at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:250)
	at com.github.lindenb.jvarkit.tools.samjs.SamJdk.doWork(SamJdk.java:584)
	at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:819)
	at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:982)
	at com.github.lindenb.jvarkit.tools.samjs.SamJdk.main(SamJdk.java:809)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral$Command.execute(JvarkitCentral.java:260)
	at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral.run(JvarkitCentral.java:716)
	at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral.main(JvarkitCentral.java:727)
Caused by: java.lang.RuntimeException: Cannot compile
	at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.exec(OpenJdkCompiler.java:164)
	at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:227)
	... 11 more
[INFO][Launcher]samjdk Exited with failure (-1)
[main_samview] fail to read the header from "./SS2_19_037-H13-CB_chr2.sam".
samtools index: "SS2_19_037-H13_chr2_bam.bam" is in a format that cannot be usefully indexed

Problem seems that the text CB disappears. Let's fix it by "CB" to \"CB\".

Works:

[richel@rackham1 ticket_296259]$ cat slurm-49075347.out
[DEBUG][SamJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.function.*;
         4  import htsjdk.samtools.*;
         5  import htsjdk.samtools.util.*;
         6  public class SamJdkCustom1568608340 extends com.github.lindenb.jvarkit.tools.samjs.SamJdk.AbstractFilter {
         7    public SamJdkCustom1568608340(final SAMFileHeader header) {
         8    super(header);
         9    }
        10    @Override
        11    public Object apply(final SAMRecord record) {
        12     /** user's code starts here */
        13  String c=record.getReadName(); int h=0; int s=21; record.setAttribute("CB",c.substring(h,s));return record;
        14  /** user's code ends here */
        15     }
        16  }
[richel@rackham1 ticket_296259]$ cat log 
11:08:41 [ERROR] Input file ./00.0_chrom_seq_removed_GRCh38.primary_assembly.genome-nochrY_ERCC92-chr2.fa does not exist
[richel@rackham1 ticket_296259]$ 

New problem is that the FASTA file cannot be found, I've added that in the script too.

[richel@rackham1 ticket_296259]$ cat log 
11:31:38 [ERROR] Input file ./00.0_chrom_seq_removed_GRCh38.primary_assembly.genome-nochrY_ERCC92-chr2.fa does not exist
[richel@rackham1 ticket_296259]$ 
``


- [x] [FAILS] Confirm that `samtools quickcheck` on the `SS2_19_037-H13_chr2.bam` created by the original script
  indicates a problem

```bash
[richel@rackham1 ticket_296259]$ samtools quickcheck SS2_19_037-H13_chr2.bam
[richel@rackham1 ticket_296259]$ 

samtools determines the type of file from the input file, not from its filename

  • Confirm that java -jar $PICARD ValidateSamFile --INPUT SS2_19_037-H13_chr2.bam indicates a problem
[richel@rackham1 ticket_296259]$ java -jar $PICARD ValidateSamFile --INPUT SS2_19_037-H13_chr2.bam 
Aug 06, 2024 1:14:05 PM com.intel.gkl.NativeLibraryLoader load
INFO: Loading libgkl_compression.so from jar:file:/sw/bioinfo/picard/3.1.1/rackham/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Aug 06 13:14:05 CEST 2024] ValidateSamFile --INPUT SS2_19_037-H13_chr2.bam --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Aug 06 13:14:05 CEST 2024] Executing as richel@rackham1.uppmax.uu.se on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17+35-2724; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:3.1.1
WARNING	2024-08-06 13:14:05	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
ERROR::MISSING_READ_GROUP:Read groups is empty
WARNING::RECORD_MISSING_READ_GROUP:Read name Run0484_ACDWKAANXX_L4_R1_T2103_C1510673, A record is missing a read group
WARNING::MISSING_TAG_NM:Record 1, Read name Run0484_ACDWKAANXX_L4_R1_T2103_C1510673, NM tag (nucleotide differences) is missing
WARNING::RECORD_MISSING_READ_GROUP:Read name Run0484_ACDWKAANXX_L4_R1_T1209_C1324596, A record is missing a read group
[...]

Fix problem in script:

mv SS2_19_037-H13_chr2.bam SS2_19_037-H13_chr2.sam
samtools view -S -b SS2_19_037-H13_chr2.sam > SS2_19_037-H13_chr2.bam

After a lot of trying: there is no need to fix the script: the user already did so! Cleaning up...

I now have the SAM file, SS2_19_037-H13-CB_chr2.sam that needs to be modified...

[richel@rackham1 ticket_296259]$ head SS2_19_037-H13-CB_chr2.sam
@HD	VN:1.6	SO:coordinate
@SQ	SN:ERCC-00002	LN:1061
@SQ	SN:ERCC-00003	LN:1023
@SQ	SN:ERCC-00004	LN:523
@SQ	SN:ERCC-00009	LN:984
@SQ	SN:ERCC-00012	LN:994
@SQ	SN:ERCC-00013	LN:808
@SQ	SN:ERCC-00014	LN:1957
@SQ	SN:ERCC-00016	LN:844
@SQ	SN:ERCC-00017	LN:1136

I checked, I can shamelessly replace SN:ERCC- by CB:SN:ERCC-

sed 's/SN:ERCC-/CB:SN:ERCC-/g' SS2_19_037-H13-CB_chr2.sam > SS2_19_037-H13-CB_chr2_with_cbs.sam

This does leave some .sam unaffected:

@SQ	SN:chr1	LN:248956422
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chrM	LN:16569
@SQ	SN:chrX	LN:156040895
  • Ask user: is that OK?

If it is OK, the line that needs to be added is:

# Add CB:s in the header
sed 's/SN:ERCC-/CB:SN:ERCC-/g' SS2_19_037-H13-CB_chr2.sam > SS2_19_037-H13-CB_chr2_with_cbs.sam
# Convert to BAM
samtools view -S -b SS2_19_037-H13-CB_chr2_with_cbs.sam > SS2_19_037-H13-CB_chr2_with_cbs.bam

2024-08-05

Meeting Monday 2024-08-05 10:00

Discuss:

  • Goal: how to be able to fix the problem, maybe the VarTrix developers are better to ask

It seems fixable: we concluded that the user's script converted a BAM file to a text file. I can fix that script :-)

  • Validate BAM file
module load bioinfo-tools
module load vartrix/1.1.22
module load picard/3.1.1
java -jar $PICARD ValidateSamFile --INPUT user_filename.bam

We found out the BAM file was not a binary file at all! Instead, it was a text file in the same format as samtools view gives.

  • [/] What is the history of the BAM file? Goal: reproduce problem on public BAM file

The user will send me her script

Not needed.

  • Decide upon plan

    • User sends script and subsetted BAM file before 14:00
    • Richel works on the fix
    • Next meeting: Tuesday Aug 6th 14:00

Solution

I predict there is something wrong with the bam input file. The user thinks so to (see communication of 2024-08-02). The bam input file cannot be shared.

However, it can be validated (code from the UPPMAX documentation on Picard):

module load bioinfo-tools
module load vartrix/1.1.22
module load picard/3.1.1
java -jar $PICARD ValidateSamFile --INPUT $VARTRIX_TEST/test_dna.bam

The user tried to delete the analysis files and started the analysis from scratch again, but I am unsure if this was correct (see communication of 2024-08-02)

2024-08-02

Problem

I am trying to run vartrix in Bianca. The program seems to work fine when I try the test run (files in $VARTRIX_TEST) but crashes when I use my input files. The errors I get: "Failed to seek to offset ((some number))" "Segmentation fault. Core dump" and I get a core.XXXX file. When I check the core.XXXX with gdb it says that it "Failed to read a valid object file image from memory" and "Program terminated with signal 6" I have no idea what it is trying to tell me. Could you maybe help me? (attached screenshots)

Notes

I assume this is about the Rust program called vartrix at https://github.com/10XGenomics/vartrix.

A GitHub seach on Failed to seek to offset:

I feel the problem is in https://github.com/samtools/htslib, due to this PR:

Lines of code are at:

Both work on a BGZF pointer, a structure defined at https://github.com/samtools/htslib/blob/develop/htslib/bgzf.h#L68.

As the user can get to work testing files, I feel it is most likely that the user's input files are in an invalid format.

I contacted the VarTrix maintainers to help me and the user diagnose the faulty file at this Issue.

Communication

From the user:

The command I used to run a test in Vartrix:

$ module load bioinfo-tools
$ module load vartrix/1.1.22
$ ls -ltr $VARTRIX_TEST/*
$ vartrix --bam $VARTRIX_TEST/test_dna.bam \
  --cell-barcodes $VARTRIX_TEST/dna_barcodes.tsv \
  --fasta $VARTRIX_TEST/test_dna.fa \
  --vcf $VARTRIX_TEST/test_dna.vcf

Trying to remove all analysis files: there are none created, I checked with '''ls -altr''' and I see nothing

Input files: I agree in that it is, most likely, my input files. My input bams have been tinkered with, because my bam files are output coming from RSEM and they are missing the CB tag needed by Vartrix (as Vartrix takes bam files with CellRanger format) I used java/jvarkit to add the CB tag to my RSEM bams I also modified the vcf file to tailor it to what we need, but when I compare it to the test vcf file they seem to have the same format

Sharing my inputs: I'm not sure I can share my bam as it is sensitive information, but I have it in Bianca, so maybe you can access that directory? For the remaining files I could send you a link to a tar in OneDrive, would that work?