-
Notifications
You must be signed in to change notification settings - Fork 4
Bioinformatics
Try doing this de novo section in 15 minutes.
First, install the needed tools:
conda install -y -c bioconda spades n50
To assemble paired ends, the syntax is:
spades.py -1 {first_pair_fastq} -2 {second_pair_fastq} -o {new_output_directory}
I used placeholders surrounded by curly brackets. I will specify that the output directory must be ~/assembly
.
- Locate in your ~/learn_bash/ directory a set of paired end reads with ".fq" extension
- How many reads are there in those files? Save the answer in "~/stats/reads_count.txt" (you will need to create the "stats" directory first)
- Run spades with the freshly identified ".fq" files.
- Check with ls the content of the output folder
- Count the reads present in the FASTA files in the output folder, saving the result in "~/stats/fasta_count.txt"
If you finished, try invoking the n50
program without parameters. Follow its instruction to read the manual (it will behave like man so you know how to exit).
Use n50 to calculate the n50 statistics of your contigs, saving the output in ~/stats/n50_test.txt
.
Today we will need the some packages that we already installed: bwa, samtools, bedtools. For an extra part we will need spades, n50 and covtobed. If you didn't install them you can do it now (using Miniconda):
conda install -y -c bioconda bwa samtools bedtools covtobed
Today we introduce the "SAM" format, to store sequence mappings. It's a highly structured TSV file (tab-delimited) with a header section (lines starting with '@').
When testing a program it's always a very good idea to make a tiny dataset that we fully understand. That's why your homework has been to create a simple FASTA file with pieces of sequences coming from the Lambda genome. If you want you can copy and paste the sequences presented at the bottom of this page, saving them as ~/reads.fa
.
The program that we will try today is a popular mapper called bwa and developed by Heng Li.
This is a one-step procedure: when you download a new FASTA file to be used as a reference for the alignment, you have to index it first.
bwa index ~/learn_bash/phage/vir_genomic.fna
Check with ls what happened inside the "phage" directory. If you want to sort the files by creation date, what switches can you use (you can check with man).
Now we can align sequences, we'll use the small dataset that we should have in ~/reads.fa
.
bwa mem ~/learn_bash/phage/vir_genomic.fna ~/reads.fa > ~/first_alignment.sam
Time to inspect your first SAM file!
bwa mem ~/learn_bash/phage/vir_genomic.fna ~/learn_bash/phage/vir_reads1.fq > ~/phage.sam
samtools is the swiss-army knife for manipulating SAM files. We will see only the minimal pipeline to convert a SAM file to its binary version (BAM), sorting it by coordinate an finally indexing it.
This is a mock workflow, where I did put {placeholders}
inside curly brackets. Try to adapt it for your needs:
# Convert SAM to BAM: you can provide the reference but it's not mandatory. The following commands are thus _alternatives_
samtools view -b -T {reference} {sam_file} > {bam_output}
samtools view -b -S {sam_file} > {bam_output}
# Sort a BAM file
samtools sort -o {sorted_bam} {unsorted_bam}
# Indexing a _sorted_ BAM file
samtools index {sorted_bam}
# see with an 'ls' that a new file has been created
Now that you have a (sorted) BAM file, you can have a look using samtools view:
samtools view {bam_file} | less -S
Lowercase letters are artificially added "errors". "INS" means an insertion was done (lowercase), "DEL" a deletion. "RC" means reverse complemented.
>0-Lambda:980-1040
GCCTACTTTATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGACGGATGGTGATG
>1-Lambda:100-200
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>2a-Lambda:100-200_INS
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTcagataGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>2b-Lambda:100-200_DEL
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>3-Lambda:100-200_RC
TTGACTTCCATTGTTCATTCCACGGACAAAAACAGAGAAAGGAAACGACAGAGGCCAAAAAGCCTCGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAG
>4-Lambda:500-540
ACTCCGCTGAAGTGGTGGAAACCGCATTCTGTACTTTCGTG
>5-Lambda:500-540_RC
CACGAAAGTACAGAATGCGGTTTCCACCACTTCAGCGGAGT
>6-Lambda:900-940_980-1040_Errors
TGGGCAGCGACTACATaCGTGAGGTGAATGTGGTGAAGTCTGCCTACaaaaATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGtcttcgGGTcAaG
>7-Lambda:1900-1940_1980-2040
CGCCGGTATCGACTCCCAGCTGGACCGCTACGAAATGCGCGnACCGGCAGATTATTATGGGCCGCCACGACGATGAACAGACGCTGCTGCGTGTGGATGAGGC
· Bioinformatics at the Command Line - Andrea Telatin, 2017-2020
Menu