Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add start from bam #193

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from
Open

Add start from bam #193

wants to merge 9 commits into from

Conversation

Lucpen
Copy link
Collaborator

@Lucpen Lucpen commented Jan 4, 2025

PR checklist

  • Add possibility to start from bam
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Jan 4, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 94eeb9d

+| ✅ 182 tests passed       |+
#| ❔  26 tests were ignored |#
!| ❗   8 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • nextflow_config - Config manifest.version should end in dev: 3.0.0
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed

❔ Tests ignored:

  • files_exist - File is ignored: assets/nf-core-tomte_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-tomte_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-tomte_logo_dark.png
  • files_exist - File is ignored: docs/images/tomte_logo.eps
  • files_exist - File is ignored: docs/images/tomte_pipeline_metromap.svg
  • files_exist - File is ignored: docs/images/tomte_pipeline_metromap.png
  • files_exist - File is ignored: conf/modules.config
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/feature_request.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-tomte_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-tomte_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-tomte_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/tomte/tomte/.github/workflows/awstest.yml
  • modules_config - modules_config

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2025-01-09 09:09:34

@Lucpen Lucpen marked this pull request as ready for review January 9, 2025 09:11
@Lucpen Lucpen requested a review from a team as a code owner January 9, 2025 09:11
@Lucpen Lucpen linked an issue Jan 9, 2025 that may be closed by this pull request
@Lucpen Lucpen added enhancement New feature or request Ready for review Ready for review labels Jan 9, 2025
@fellen31 fellen31 self-requested a review January 9, 2025 12:21
Copy link
Contributor

@fellen31 fellen31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, starting from BAM sounds like an improvement! Do you think adding a BAM start as a CI test is necessary?

@@ -58,6 +58,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
#### Salmon

[`Salmon`](https://salmon.readthedocs.io/en/latest/) quantifies reads.
Note that as Salmon has been setup to start from fastq files, it will not run if the pipeline starts from bam files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know Salmon very well. Would it make sense to convert bam to fastq in order to run Salmon when starting from BAM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would, however, I think that this should not be done in this PR, perhaps on the next one.

Comment on lines -27 to +30
def id = "${ids}".replace("[","").replace("]","").replace(",","")
def single_end = "${single_ends}".replace("[","").replace("]","").replace(",","")
def sex_drop = "${sex}".replace("[","").replace("]","").replace(",","").replace("1","M").replace("2","F").replace("0","NA").replace("other","NA")
def strandedness = "${strandednesses}".replace("[","").replace("]","").replace(",","")
def id = ids.join(' ')
def single_end = single_ends.join(' ')
def sex_drop = sex.collect { it.replace("1","M").replace("2","F").replace("0","NA").replace("other","NA") }.join(' ')
def strandedness = strandednesses.join(' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more concise, just to prettify

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the .replace(...) are no longer necessary?

Comment on lines +35 to +51
SINGLE_ENDS=(${single_end})
BAMS=(${bam.join(' ')})

# Check if single_end values are provided
updated_single_ends=()
for ((i=0; i<\${#SINGLE_ENDS[@]}; i++)); do
if [[ "\${SINGLE_ENDS[i]}" == "null" ]]; then
result=\$(samtools view -c -f 1 "\${BAMS[i]}" | awk '{print \$1 == 0 ? "true" : "false"}')
updated_single_ends+=("\$result")
else
updated_single_ends+=("\${SINGLE_ENDS[i]}")
fi
done

# Convert updated_single_ends array to space-separated string and save to file
echo "\${updated_single_ends[*]}" > updated_single_ends.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to understand what's happening here. Are you checking if there are any paired ends in the BAM-file and then printing a false or true? Is this information not already in the meta/single_end input to this process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be there if one starts from fastq, but it won't if you start from bam, that's why I added it. I am very open to any suggestion on how to make it more readable, because I totally agree 😄

Comment on lines +16 to +17
ch_fastq_reads // channel: [optional] [ val(meta), [path(reads)] ]
ch_bam_bai_reads // channel: [optional] [ val(meta), [path(bam) path(bai)] ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ch_fastq_reads // channel: [optional] [ val(meta), [path(reads)] ]
ch_bam_bai_reads // channel: [optional] [ val(meta), [path(bam) path(bai)] ]
ch_fastq_reads // channel: [optional] [ val(meta), [path(reads)] ]
ch_bam_bai_reads // channel: [optional] [ val(meta), [path(bam) path(bai)] ]

ch_bam_reads = ch_bam_bai_reads.map { meta, bambai -> [ meta, bambai[0] ] }
ch_bai_reads = ch_bam_bai_reads.map { meta, bambai -> [ meta, bambai[1] ] }

ch_bam_aligned=ch_bam_reads.mix(STAR_ALIGN.out.bam_sorted_aligned)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ch_bam_aligned=ch_bam_reads.mix(STAR_ALIGN.out.bam_sorted_aligned)
ch_bam_aligned = ch_bam_reads.mix(STAR_ALIGN.out.bam_sorted_aligned)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can now also be aligned reads in ch_bam_reads right? ch_bam_aligned and ch_bai are "equivalents"?

I think I follow, but perhaps there could be more expressive names. It's a bit hard for me to see the difference between ch_bam_aligned + ch_bai, ch_bam_bai and ch_bam_bai_out together with the different ifs.


ch_samplesheet
.map { meta, files ->
[meta.sample, groupKey(meta + [id: meta.sample], meta.fq_pairs ?: 1), files]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does meta.fq_pairs ?: 1 work for the bam branch?

versions = ch_versions
emit:
samplesheet = ch_samplesheet
versions = ch_versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent everything (from line 75 to here)?

Comment on lines +225 to +228
def expectedNumber = meta[0].single_end ? 1 : 2
def sampleNumber = files.flatten().size()
if (expectedNumber != sampleNumber) {
error("Samplesheet contains incorrect number of fastq files for sample ${sample}. Expected ${expectedNumber}, got ${sampleNumber}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +112 to +113
| `bam` | Full path to BAM file. | Provide either fastq_1 or bam |
| `bai` | Full path to BAM index file. | Provide either fastq_2 or bai |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the descriptions match here? "Provide either fastq_2 or bai"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part is whether the file is mandatory or not, its the third column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a copy paste error. Should bai not always be mandatory when you have bam then, or will it work without bai?

@@ -98,17 +98,19 @@ Running the pipeline involves three steps:

#### Samplesheet

A samplesheet is used to pass the information about the sample(s), such as the path to the FASTQ files and other meta data (sex, phenotype, etc.,) to the pipeline in csv format.
A samplesheet is used to pass the information about the sample(s), such as the path to the FASTQ/BAM files and other meta data (sex, phenotype, etc.,) to the pipeline in csv format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In long read unaligned reads are often stored as BAM files rather than fastq. Is it necessary to specify aligned reads for BAM, or would it be obvious to the user that BAM equals aligned reads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure to be honest, I even wonder if it would work uBAM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for most people BAM = aligned reads, so I think you can ignore my question.

Comment on lines +44 to +57
"bam": {
"errorMessage": "BAM file cannot contain spaces and must have extension '.bam'",
"type": "string",
"pattern": "^\\S+\\.bam$",
"format": "file-path",
"exists": true
},
"bai": {
"errorMessage": "BAM index file cannot contain spaces and must have extension '.bai'",
"type": "string",
"pattern": "^\\S+\\.bai$",
"format": "file-path",
"exists": true
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it later but it would be nice with the possibility to start from cram as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Ready for review Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable bam input
3 participants