Skip to content

Bioinformatics

Andrea Telatin edited this page Apr 15, 2020 · 9 revisions

Short reads mapping with BWA

The "SAM" format

Today we introduce the "SAM" format, to store sequence mappings. It's a highly structured TSV file (tab-delimited) with a header section (lines starting with '@').

Understanding a program

When testing a program it's always a very good idea to make a tiny dataset that we fully understand. That's why your homework has been to create a simple FASTA file with pieces of sequences coming from the Lambda genome. If you want you can copy and paste the sequences presented at the bottom of this page, saving them as ~/reads.fa.

The program that we will try today is a popular mapper called bwa and developed by Heng Li.

Indexing the genome

This is a one-step procedure: when you download a new FASTA file to be used as a reference for the alignment, you have to index it first.

bwa index  ~/learn_bash/phage/vir_genomic.fna

Check with ls what happened inside the "phage" directory. If you want to sort the files by creation date, what switches can you use (you can check with man).

Alignment: first test!

Now we can align sequences, we'll use the small dataset that we should have in ~/reads.fa.

bwa mem ~/learn_bash/phage/vir_genomic.fna ~/reads.fa > ~/first_alignment.sam

Time to inspect your first SAM file!

Alignment: a dataset

bwa mem  ~/learn_bash/phage/vir_genomic.fna ~/learn_bash/phage/vir_reads1.fq > ~/phage.sam

Samtools primer

samtools is the swiss-army knife for manipulating SAM files. We will see only the minimal pipeline to convert a SAM file to its binary version (BAM), sorting it by coordinate an finally indexing it.

This is a mock workflow, where I did put {placeholders} inside curly brackets. Try to adapt it for your needs:

# Convert SAM to BAM: you can provide the reference but it's not mandatory. The following commands are thus _alternatives_
samtools view -b -T {reference} {sam_file} > {bam_output}
samtools view -b -S {sam_file} > {bam_output}
 
# Sort a BAM file 
samtools sort -o {sorted_bam} {unsorted_bam} 
 
# Indexing a _sorted_ BAM file
samtools index {sorted_bam}
# see with an 'ls' that a new file has been created

Now that you have a (sorted) BAM file, you can have a look using samtools view:

samtools view {bam_file} | less -S 

reads.fa

Lowercase letters are artificially added "errors". "INS" means an insertion was done (lowercase), "DEL" a deletion. "RC" means reverse complemented.

>0-Lambda:980-1040
GCCTACTTTATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGACGGATGGTGATG
>1-Lambda:100-200
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>2a-Lambda:100-200_INS
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTcagataGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>2b-Lambda:100-200_DEL
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>3-Lambda:100-200_RC
TTGACTTCCATTGTTCATTCCACGGACAAAAACAGAGAAAGGAAACGACAGAGGCCAAAAAGCCTCGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAG
>4-Lambda:500-540
ACTCCGCTGAAGTGGTGGAAACCGCATTCTGTACTTTCGTG
>5-Lambda:500-540_RC
CACGAAAGTACAGAATGCGGTTTCCACCACTTCAGCGGAGT
>6-Lambda:900-940_980-1040_Errors
TGGGCAGCGACTACATaCGTGAGGTGAATGTGGTGAAGTCTGCCTACaaaaATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGtcttcgGGTcAaG
>7-Lambda:1900-1940_1980-2040
CGCCGGTATCGACTCCCAGCTGGACCGCTACGAAATGCGCGnACCGGCAGATTATTATGGGCCGCCACGACGATGAACAGACGCTGCTGCGTGTGGATGAGGC

Menu

Clone this wiki locally