Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add start from bam #193

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## TDB
## X.X.X - TBD [XXXX-XX-XX]

### `Added`

- Added the option of starting from a bam file [#193](https://github.com/genomic-medicine-sweden/tomte/pull/193)
- Optionally run Peddy for per-sample sex- and heterozygosity checks [#190](https://github.com/genomic-medicine-sweden/tomte/pull/190)
- Optionally calculate percentage mapping to hemoglobin genes (or any other set of genes provided) [#190](https://github.com/genomic-medicine-sweden/tomte/pull/190)
- Added the option of providing sex as 0, 1, or 2 as in the raredisease pipeline [#192](https://github.com/genomic-medicine-sweden/tomte/pull/192)
Expand Down
28 changes: 27 additions & 1 deletion assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,20 @@
}
]
},
"bam": {
"errorMessage": "BAM file cannot contain spaces and must have extension '.bam'",
"type": "string",
"pattern": "^\\S+\\.bam$",
"format": "file-path",
"exists": true
},
"bai": {
"errorMessage": "BAM index file cannot contain spaces and must have extension '.bai'",
"type": "string",
"pattern": "^\\S+\\.bai$",
"format": "file-path",
"exists": true
},
Comment on lines +44 to +57
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it later but it would be nice with the possibility to start from cram as well.

"strandedness": {
"type": "string",
"meta": ["strandedness"],
Expand All @@ -63,6 +77,18 @@
"errorMessage": "The valid input for sample sex is M, F, NA, other, 0, 1 or 2"
}
},
"required": ["case", "sample", "fastq_1", "strandedness"]
"anyOf": [
{
"dependentRequired": {
"lane": ["fastq_1"]
}
},
{
"dependentRequired": {
"lane": ["bam"]
}
}
],
"required": ["case", "sample", "strandedness"]
}
}
2 changes: 2 additions & 0 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ params {
// Skip when GITHUB actions
skip_drop_ae = System.getenv("GITHUB_ACTIONS").equals(null) ? false : true
skip_drop_as = System.getenv("GITHUB_ACTIONS").equals(null) ? false : true
skip_downsample = false
skip_subsample_region = true

// Peddy looks for sites beyond chromosome 21 and thus crashes for this test case
skip_peddy = true
Expand Down
1 change: 1 addition & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
#### Salmon

[`Salmon`](https://salmon.readthedocs.io/en/latest/) quantifies reads.
Note that as Salmon has been setup to start from fastq files, it will not run if the pipeline starts from bam files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know Salmon very well. Would it make sense to convert bam to fastq in order to run Salmon when starting from BAM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would, however, I think that this should not be done in this PR, perhaps on the next one.


<details markdown="1">
<summary>Output files</summary>
Expand Down
25 changes: 17 additions & 8 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,17 +98,19 @@ Running the pipeline involves three steps:

#### Samplesheet

A samplesheet is used to pass the information about the sample(s), such as the path to the FASTQ files and other meta data (sex, phenotype, etc.,) to the pipeline in csv format.
A samplesheet is used to pass the information about the sample(s), such as the path to the FASTQ/BAM files and other meta data (sex, phenotype, etc.,) to the pipeline in csv format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In long read unaligned reads are often stored as BAM files rather than fastq. Is it necessary to specify aligned reads for BAM, or would it be obvious to the user that BAM equals aligned reads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure to be honest, I even wonder if it would work uBAM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for most people BAM = aligned reads, so I think you can ignore my question.


genomic-medicine-sweden/tomte will requires the information given bellow.

| Fields | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `case` | Case ID, for the analysis used when generating a family VCF. |
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
| `fastq_1` | Absolute path to FASTQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `fastq_2` | Absolute path to FASTQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `strandedness` | Sample strandness |
| Fields | Description | Mandatory? |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------- |
| `case` | Case ID, for the analysis used when generating a family VCF. | Mandatory |
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | Mandatory |
| `fastq_1` | Absolute path to FASTQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | Provide either fastq_1 or bam |
| `fastq_2` | Absolute path to FASTQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | Provide either fastq_2 or bai |
| `strandedness` | Sample strandness | Mandatory |
| `bam` | Full path to BAM file. | Provide either fastq_1 or bam |
| `bai` | Full path to BAM index file. | Provide either fastq_2 or bai |
Comment on lines +112 to +113
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the descriptions match here? "Provide either fastq_2 or bai"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part is whether the file is mandatory or not, its the third column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a copy paste error. Should bai not always be mandatory when you have bam then, or will it work without bai?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right


It is also possible to include multiple runs of the same sample in a samplesheet. For example, when you have re-sequenced the same sample more than once to increase sequencing depth. In that case, the `sample` identifiers in the samplesheet have to be the same. The pipeline will align the raw read/read-pairs independently before merging the alignments belonging to the same sample. Below is an example for a trio with the proband sequenced across two lanes:

Expand All @@ -119,6 +121,13 @@ It is also possible to include multiple runs of the same sample in a samplesheet
| fam_1 | PATIENT_1 | AEG588A3_S1_L001_R1_001.fastq.gz | AEG588A3_S1_L001_R2_001.fastq.gz | reverse |
| fam_1 | PATIENT_1 | AEG588A3_S1_L002_R1_001.fastq.gz | AEG588A3_S1_L002_R2_001.fastq.gz | reverse |

Here is an example of a samplesheet where BAM files are provided:

| case | sample | fastq_1 | fastq_2 | strandedness | bam | bai |
| ----- | ------------ | ------- | ------- | ------------ | ------------ | ---------------- |
| fam_1 | CONTROL_REP1 | | | reverse | AEG588A1.bam | AEG588A1.bam.bai |
| fam_1 | CONTROL_REP2 | | | reverse | AEG588A2.bam | AEG588A2.bam.bai |

If you would like to see more examples of what a typical samplesheet looks like for a duo, follow this links, [sample_sheet](https://github.com/genomic-medicine-sweden/tomte/blob/master/test_data/samplesheet_chr21.csv)

#### Reference files and parameters
Expand Down
31 changes: 25 additions & 6 deletions modules/local/drop_sample_annot.nf
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,36 @@ process DROP_SAMPLE_ANNOT {
task.ext.when == null || task.ext.when

script:
def id = "${ids}".replace("[","").replace("]","").replace(",","")
def single_end = "${single_ends}".replace("[","").replace("]","").replace(",","")
def sex_drop = "${sex}".replace("[","").replace("]","").replace(",","").replace("1","M").replace("2","F").replace("0","NA").replace("other","NA")
def strandedness = "${strandednesses}".replace("[","").replace("]","").replace(",","")
def id = ids.join(' ')
def single_end = single_ends.join(' ')
def sex_drop = sex.collect { it.replace("1","M").replace("2","F").replace("0","NA").replace("other","NA") }.join(' ')
def strandedness = strandednesses.join(' ')
Comment on lines -27 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more concise, just to prettify

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the .replace(...) are no longer necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join does the same, as it will result in "one single value" separated by " " that python understands

def drop_group = "${drop_group_samples_ae},${drop_group_samples_as}".replace(" ","").replace("[","").replace("]","")
def reference_count_file = ref_gene_counts ? "--ref_count_file ${ref_gene_counts}" : ''
def reference_annotation = ref_annot ? "--ref_annot ${ref_annot}" : ''
"""
SINGLE_ENDS=(${single_end})
BAMS=(${bam.join(' ')})

# Check if single_end values are provided
updated_single_ends=()
for ((i=0; i<\${#SINGLE_ENDS[@]}; i++)); do
if [[ "\${SINGLE_ENDS[i]}" == "null" ]]; then
result=\$(samtools view -c -f 1 "\${BAMS[i]}" | awk '{print \$1 == 0 ? "true" : "false"}')
updated_single_ends+=("\$result")
else
updated_single_ends+=("\${SINGLE_ENDS[i]}")
fi
done

# Convert updated_single_ends array to space-separated string and save to file
echo "\${updated_single_ends[*]}" > updated_single_ends.txt

Comment on lines +35 to +51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to understand what's happening here. Are you checking if there are any paired ends in the BAM-file and then printing a false or true? Is this information not already in the meta/single_end input to this process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be there if one starts from fastq, but it won't if you start from bam, that's why I added it. I am very open to any suggestion on how to make it more readable, because I totally agree 😄

drop_sample_annot.py \\
--bam ${bam} \\
--bam ${bam.join(' ')} \\
--samples $id \\
--strandedness $strandedness \\
--single_end $single_end \\
--single_end \$(cat updated_single_ends.txt) \\
--sex $sex_drop \\
$reference_count_file \\
$reference_annotation \\
Expand All @@ -46,6 +63,7 @@ process DROP_SAMPLE_ANNOT {
cat <<-END_VERSIONS > versions.yml
"${task.process}":
drop_sample_annot: \$(drop_sample_annot.py --version)
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""

Expand All @@ -56,6 +74,7 @@ process DROP_SAMPLE_ANNOT {
cat <<-END_VERSIONS > versions.yml
"${task.process}":
drop_sample_annot: \$(drop_sample_annot.py --version)
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
21 changes: 14 additions & 7 deletions subworkflows/local/alignment.nf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ include { SAMTOOLS_VIEW } from '../../modules/nf-core/samtools/view/main'

workflow ALIGNMENT {
take:
reads // channel: [mandatory] [ val(meta), [path(reads)] ]
ch_fastq_reads // channel: [optional] [ val(meta), [path(reads)] ]
ch_bam_bai_reads // channel: [optional] [ val(meta), [path(bam) path(bai)] ]
Comment on lines +16 to +17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ch_fastq_reads // channel: [optional] [ val(meta), [path(reads)] ]
ch_bam_bai_reads // channel: [optional] [ val(meta), [path(bam) path(bai)] ]
ch_fastq_reads // channel: [optional] [ val(meta), [path(reads)] ]
ch_bam_bai_reads // channel: [optional] [ val(meta), [path(bam) path(bai)] ]

star_index // channel: [mandatory] [ val(meta), path(star_index) ]
ch_gtf // channel: [mandatory] [ val(meta), path(gtf) ]
ch_platform // channel: [mandatory] [ val(platform) ]
Expand All @@ -28,7 +29,7 @@ workflow ALIGNMENT {
main:
ch_versions = Channel.empty()

ch_fastq = branchFastqToSingleAndMulti(reads)
ch_fastq = branchFastqToSingleAndMulti(ch_fastq_reads)

CAT_FASTQ(ch_fastq.multiple_fq)
.reads.mix(ch_fastq.single_fq)
Expand All @@ -38,13 +39,19 @@ workflow ALIGNMENT {

STAR_ALIGN(FASTP.out.reads, star_index, ch_gtf, false, ch_platform, false)

ch_bam_reads = ch_bam_bai_reads.map { meta, bambai -> [ meta, bambai[0] ] }
ch_bai_reads = ch_bam_bai_reads.map { meta, bambai -> [ meta, bambai[1] ] }

ch_bam_aligned=ch_bam_reads.mix(STAR_ALIGN.out.bam_sorted_aligned)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ch_bam_aligned=ch_bam_reads.mix(STAR_ALIGN.out.bam_sorted_aligned)
ch_bam_aligned = ch_bam_reads.mix(STAR_ALIGN.out.bam_sorted_aligned)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can now also be aligned reads in ch_bam_reads right? ch_bam_aligned and ch_bai are "equivalents"?

I think I follow, but perhaps there could be more expressive names. It's a bit hard for me to see the difference between ch_bam_aligned + ch_bai, ch_bam_bai and ch_bam_bai_out together with the different ifs.


SAMTOOLS_INDEX( STAR_ALIGN.out.bam_sorted_aligned )
ch_bai = ch_bai_reads.mix(SAMTOOLS_INDEX.out.bai)

ch_bam_bai = Channel.empty()
ch_bam_bai_out = Channel.empty()

if (!skip_subsample_region) {
RNA_SUBSAMPLE_REGION( STAR_ALIGN.out.bam_sorted_aligned, subsample_bed, seed_frac)
RNA_SUBSAMPLE_REGION( ch_bam_aligned, subsample_bed, seed_frac)
ch_bam_bai = ch_bam_bai.mix(RNA_SUBSAMPLE_REGION.out.bam_bai)
ch_versions = ch_versions.mix(RNA_SUBSAMPLE_REGION.out.versions.first())
if (skip_downsample) {
Expand All @@ -55,17 +62,17 @@ workflow ALIGNMENT {
ch_versions = ch_versions.mix(RNA_DOWNSAMPLE.out.versions.first())
}
} else {
ch_bam_bai = ch_bam_bai.mix(STAR_ALIGN.out.bam_sorted_aligned.join(SAMTOOLS_INDEX.out.bai))
ch_bam_bai = ch_bam_bai.mix(ch_bam_aligned.join(ch_bai))
if (skip_downsample) {
ch_bam_bai_out = STAR_ALIGN.out.bam_sorted_aligned.join(SAMTOOLS_INDEX.out.bai)
ch_bam_bai_out = ch_bam_aligned.join(ch_bai)
} else {
RNA_DOWNSAMPLE( ch_bam_bai, num_reads)
ch_bam_bai_out = RNA_DOWNSAMPLE.out.bam_bai
ch_versions = ch_versions.mix(RNA_DOWNSAMPLE.out.versions.first())
}
}

SAMTOOLS_VIEW( STAR_ALIGN.out.bam_sorted_aligned.join(SAMTOOLS_INDEX.out.bai), ch_genome_fasta, [] )
SAMTOOLS_VIEW( ch_bam_aligned.join(ch_bai), ch_genome_fasta, [] )

SALMON_QUANT( FASTP.out.reads, salmon_index, ch_gtf.map{ meta, gtf -> gtf }, [], false, 'A')

Expand All @@ -79,7 +86,7 @@ workflow ALIGNMENT {
emit:
merged_reads = CAT_FASTQ.out.reads // channel: [ val(meta), path(fastq) ]
fastp_report = FASTP.out.json // channel: [ val(meta), path(json) ]
bam = STAR_ALIGN.out.bam_sorted_aligned // channel: [ val(meta), path(bam) ]
bam = ch_bam_aligned // channel: [ val(meta), path(bam) ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you align the // ...

bam_bai = ch_bam_bai // channel: [ val(meta), path(bam), path(bai) ]
bam_ds_bai = ch_bam_bai_out // channel: [ val(meta), path(bam), path(bai) ]
gene_counts = STAR_ALIGN.out.read_per_gene_tab // channel: [ val(meta), path(tsv) ]
Expand Down
43 changes: 24 additions & 19 deletions subworkflows/local/utils_nfcore_tomte_pipeline/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -72,45 +72,49 @@ workflow PIPELINE_INITIALISATION {
// Create channel from input file provided through params.input
//

Channel
Channel
.fromList(samplesheetToList(params.input, "${projectDir}/assets/schema_input.json"))
.tap { ch_original_input }
.map { meta, fastq_1, fastq_2 -> meta.id }
.reduce([:]) { counts, sample -> // get counts of each sample in the samplesheet - for groupTuple
.map { meta, fastq_1, fastq_2, bam, bai -> meta.id }
.reduce([:]) { counts, sample ->
counts[sample] = (counts[sample] ?: 0) + 1
counts
}
.combine ( ch_original_input )
.map { counts, meta, fastq_1, fastq_2 ->
if (!fastq_2) {
return [ meta + [ single_end:true, fq_pairs:counts[meta.id] ], [ fastq_1 ] ]
.map { counts, meta, fastq_1, fastq_2, bam, bai ->
if (bam) {
return [ meta + [ single_end:false, fq_pairs:counts[meta.id], is_bam:true ], [ bam, bai ] ]
} else if (!fastq_2) {
return [ meta + [ single_end:true, fq_pairs:counts[meta.id], is_bam:false ], [ fastq_1 ] ]
} else {
return [ meta + [ single_end:false, fq_pairs:counts[meta.id] ], [ fastq_1, fastq_2 ] ]
return [ meta + [ single_end:false, fq_pairs:counts[meta.id], is_bam:false ], [ fastq_1, fastq_2 ] ]
}
}
.tap { ch_input_counts }
.map { meta, fastqs -> fastqs }
.reduce([:]) { counts, fastqs -> // get number of fastq sets in the run - for creating unique ID:s
counts[fastqs] = counts.size() + 1
.map { meta, files -> [meta.id, files] }
.reduce([:]) { counts, id_files ->
counts[id_files[0]] = (counts[id_files[0]] ?: [:])
counts[id_files[0]][id_files[1]] = counts[id_files[0]].size() + 1
return counts
}
.combine( ch_input_counts )
.map { lineno, meta, fastqs -> // append line number to sample id for unique set ids
new_meta = meta + [id:meta.id+"_id"+lineno[fastqs]]
return [ new_meta, fastqs ]
.map { lineno, meta, files ->
new_meta = meta + [id:meta.id+"_id"+lineno[meta.id][files]]
return [ new_meta, files ]
}
.tap { ch_samplesheet } // Output, the rest is just for validation
.map { meta, fastqs ->
return [ meta.sample, groupKey( meta + [id:meta.sample], meta.fq_pairs ), fastqs ]
.tap { ch_samplesheet }
.map { meta, files ->
return [ meta.sample, groupKey( meta + [id:meta.sample], meta.fq_pairs ), files ]
}
.groupTuple()
.map {
validateInputSamplesheet(it)
}

emit:
samplesheet = ch_samplesheet
versions = ch_versions
ch_samplesheet.view()
emit:
samplesheet = ch_samplesheet
versions = ch_versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent everything (from line 75 to here)?

}

/*
Expand Down Expand Up @@ -186,6 +190,7 @@ def validateInputSamplesheet(input) {

return [ metas[0], fastqs ]
}

//
// Get attribute from genome config file e.g. fasta
//
Expand Down
16 changes: 14 additions & 2 deletions workflows/tomte.nf
Original file line number Diff line number Diff line change
Expand Up @@ -121,14 +121,26 @@ workflow TOMTE {
).set { ch_references }
ch_versions = ch_versions.mix(PREPARE_REFERENCES.out.versions.first())

// Prepare input
ch_samplesheet
.branch {
fastq: it[1].any { it.toString().endsWith('.fastq.gz') || it.toString().endsWith('.fq.gz') }
bam: it[1].any { it.toString().endsWith('.bam') }
}
.set { ch_input_branch }

ch_bam_reads = ch_input_branch.bam
ch_fastq_reads = ch_input_branch.fastq

FASTQC (
ch_samplesheet
ch_fastq_reads
)
ch_multiqc_files = ch_multiqc_files.mix(FASTQC.out.zip.collect{it[1]})
ch_versions = ch_versions.mix(FASTQC.out.versions.first())

ALIGNMENT(
ch_samplesheet,
ch_fastq_reads,
ch_bam_reads,
ch_references.star_index,
ch_references.gtf,
ch_platform,
Expand Down
Loading