Merge branch 'dev' into TEMPLATE

genomic-medicine-sweden · Jan 17, 2024 · dbf5dff · dbf5dff
2 parents 0673056 + 7c8e671
commit dbf5dff
Show file tree

Hide file tree

Showing 3 changed files with 86 additions and 62 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## v1.0dev - [date]
 
-Initial release of nf-core/metaval, created with the [nf-core](https://nf-co.re/) template.
+Initial release of metaval, created with the [nf-core](https://nf-co.re/) template.
 
 ### `Added`
 

diff --git a/README.md b/README.md
@@ -1,32 +1,30 @@
-# ![nf-core/metaval](docs/images/nf-core-metaval_logo_light.png#gh-light-mode-only) ![nf-core/metaval](docs/images/nf-core-metaval_logo_dark.png#gh-dark-mode-only)
-
-[![GitHub Actions CI Status](https://github.com/nf-core/metaval/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/metaval/actions?query=workflow%3A%22nf-core+CI%22)
-[![GitHub Actions Linting Status](https://github.com/nf-core/metaval/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/metaval/actions?query=workflow%3A%22nf-core+linting%22)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/metaval/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
+# metaval
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
 [![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/nf-core/metaval)
 
-[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23metaval-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/metaval)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
 
 ## Introduction
 
-**nf-core/metaval** is a bioinformatics pipeline that ...
+**metaval** is a bioinformatics pipeline that verifies the organisms predicted by the nf-core/taxprofiler pipeline using metagenomic data, including both Illumina short-gun sequencing and Nanopore sequencing data.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+At moment, metaval only checks the classification results from three classifiers `Kraken2`, `Centrifuge` and `diamond`
+
+## Pipeline summary
+
+1. Extract classified reads for organisms of interest, such as all identified viruses or a predefined list of organisms.
+
+2. Use `BLAST` to identify the closet reference genome for the extracted reads (Downsample if there are more than 200 reads)
+
+3. Map the extracted reads to reference genomes using `Bowtie2` for Illumina reads and `minimap2` for Nanopore reads.
+
+4. Construct consensus maps for the mapped reads.
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
+5. Generat Coverage plots
 
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
 
 ## Usage
 
@@ -74,7 +72,7 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/metaval was originally written by LilyAnderssonLee.
+nf-core/metaval was originally written by @LilyAnderssonLee.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
@@ -84,7 +82,6 @@ We thank the following people for their extensive assistance in the development
 
 If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
 
-For further information or help, don't hesitate to get in touch on the [Slack `#metaval` channel](https://nfcore.slack.com/channels/metaval) (you can join with [this invite](https://nf-co.re/join/slack)).
 
 ## Citations
 

diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -11,7 +11,6 @@
 
 logger = logging.getLogger()
 
-
 class RowChecker:
     """
     Define a service that can validate and transform each given row.
@@ -27,9 +26,17 @@ class RowChecker:
     def __init__(
         self,
         sample_col="sample",
-        first_col="fastq_1",
-        second_col="fastq_2",
-        single_col="single_end",
+        run_accession_col="run_accession",
+        instrument_platform_col="instrument_platform",
+        reads_type_col="reads_type",
+        fastq_1_col="fastq_1",
+        fastq_2_col="fastq_2",
+        fasta_col="fasta",
+        kraken2_report_col="kraken2_report",
+        kraken2_classifiedout_col="kraken2_classifiedout",
+        centrifuge_out_col="centrifuge_out",
+        centrifuge_result_col="centrifuge_result",
+        diamond_col="diamond",
         **kwargs,
     ):
         """
@@ -38,20 +45,43 @@ def __init__(
         Args:
             sample_col (str): The name of the column that contains the sample name
                 (default "sample").
-            first_col (str): The name of the column that contains the first (or only)
-                FASTQ file path (default "fastq_1").
-            second_col (str): The name of the column that contains the second (if any)
-                FASTQ file path (default "fastq_2").
-            single_col (str): The name of the new column that will be inserted and
-                records whether the sample contains single- or paired-end sequencing
-                reads (default "single_end").
+            run_accession_col (str): The name of the column that contains the run accession
+                (default "run_accession").
+            instrument_platform_col (str): The name of the column that contains the instrument platform
+                (default "instrument_platform").
+            reads_type_col (str): The name of the column that contains the reads type, shortread or longread
+                (default "reads_type").
+            fastq_1_col (str): The name of the column that contains the first (or only)
+                FASTQ file path (default "fastq_1") from bowtie2 unmapped read 1 against the human genome.
+            fastq_2_col (str): The name of the column that contains the second (if any)
+                FASTQ file path (default "fastq_2") from bowtie2 unmapped read 2 against the human genome.
+            fasta_col (str): The name of the column that contains the FASTA information
+                (default "fasta") from minimap2 unmapped read against the human genome.
+            kraken2_report_col (str): The name of the column that contains the kraken2 report
+                (default "kraken2_report").
+            kraken2_classifiedout_col (str): The name of the column that contains the kraken2 classifiedout
+                (default "kraken2_classifiedout") with the format extension "kraken2.kraken2.classifiedreads.txt".
+            centrifuge_out_col (str): The name of the column that contains the centrifuge out (kraken2-like report)
+                (default "centrifuge_out").
+            centrifuge_result_col (str): The name of the column that contains the centrifuge result
+                (default "centrifuge_result") with the format extension "centrifuge.results.txt".
+            diamond_col (str): The name of the column that contains the diamond information
+                (default "diamond") with the format extension ".csv".
 
         """
         super().__init__(**kwargs)
         self._sample_col = sample_col
-        self._first_col = first_col
-        self._second_col = second_col
-        self._single_col = single_col
+        self._run_accession_col = run_accession_col
+        self._instrument_platform_col = instrument_platform_col
+        self._reads_type_col = reads_type_col
+        self._fastq_1_col = fastq_1_col
+        self._fastq_2_col = fastq_2_col
+        self._fasta_col = fasta_col
+        self._kraken2_report_col = kraken2_report_col
+        self._kraken2_classifiedout_col = kraken2_classifiedout_col
+        self._centrifuge_out_col = centrifuge_out_col
+        self._centrifuge_result_col = centrifuge_result_col
+        self._diamond_col = diamond_col
         self._seen = set()
         self.modified = []
 
@@ -68,7 +98,7 @@ def validate_and_transform(self, row):
         self._validate_first(row)
         self._validate_second(row)
         self._validate_pair(row)
-        self._seen.add((row[self._sample_col], row[self._first_col]))
+        self._seen.add((row[self._sample_col], row[self._fastq_1_col]))
         self.modified.append(row)
 
     def _validate_sample(self, row):
@@ -80,25 +110,22 @@ def _validate_sample(self, row):
 
     def _validate_first(self, row):
         """Assert that the first FASTQ entry is non-empty and has the right format."""
-        if len(row[self._first_col]) <= 0:
+        if len(row[self._fastq_1_col]) <= 0:
             raise AssertionError("At least the first FASTQ file is required.")
-        self._validate_fastq_format(row[self._first_col])
+        self._validate_fastq_format(row[self._fastq_1_col])
 
     def _validate_second(self, row):
         """Assert that the second FASTQ entry has the right format if it exists."""
-        if len(row[self._second_col]) > 0:
-            self._validate_fastq_format(row[self._second_col])
+        if len(row[self._fastq_2_col]) > 0:
+            self._validate_fastq_format(row[self._fastq_2_col])
 
     def _validate_pair(self, row):
         """Assert that read pairs have the same file extension. Report pair status."""
-        if row[self._first_col] and row[self._second_col]:
-            row[self._single_col] = False
-            first_col_suffix = Path(row[self._first_col]).suffixes[-2:]
-            second_col_suffix = Path(row[self._second_col]).suffixes[-2:]
+        if row[self._fastq_1_col] and row[self._fastq_2_col]:
+            first_col_suffix = Path(row[self._fastq_1_col]).suffixes[-1]
+            second_col_suffix = Path(row[self._fastq_2_col]).suffixes[-1]
             if first_col_suffix != second_col_suffix:
                 raise AssertionError("FASTQ pairs must have the same file extensions.")
-        else:
-            row[self._single_col] = True
 
     def _validate_fastq_format(self, filename):
         """Assert that a given filename has one of the expected FASTQ extensions."""
@@ -113,7 +140,7 @@ def validate_unique_samples(self):
         Assert that the combination of sample name and FASTQ filename is unique.
 
         In addition to the validation, also rename all samples to have a suffix of _T{n}, where n is the
-        number of times the same sample exist, but with different FASTQ files, e.g., multiple runs per experiment.
+        number of times the same sample exists, but with different FASTQ files, e.g., multiple runs per experiment.
 
         """
         if len(self._seen) != len(self.modified):
@@ -159,7 +186,7 @@ def sniff_format(handle):
 
 def check_samplesheet(file_in, file_out):
     """
-    Check that the tabular samplesheet has the structure expected by nf-core pipelines.
+    Check that the tabular samplesheet has the structure expected by metaval.
 
     Validate the general shape of the table, expected columns, and each row. Also add
     an additional column which records whether one or two FASTQ reads were found.
@@ -169,26 +196,26 @@ def check_samplesheet(file_in, file_out):
             CSV, TSV, or any other format automatically recognized by ``csv.Sniffer``.
         file_out (pathlib.Path): Where the validated and transformed samplesheet should
             be created; always in CSV format.
-
-    Example:
-        This function checks that the samplesheet follows the following structure,
-        see also the `viral recon samplesheet`_::
-
-            sample,fastq_1,fastq_2
-            SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz
-            SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz
-            SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,
-
-    .. _viral recon samplesheet:
-        https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
-
     """
-    required_columns = {"sample", "fastq_1", "fastq_2"}
+    required_columns = [
+        "sample",
+        "run_accession",
+        "instrument_platform",
+        "reads_type",
+        "fastq_1",
+        "fastq_2",
+        "fasta",
+        "kraken2_report",
+        "kraken2_classifiedout",
+        "centrifuge_out",
+        "centrifuge_result",
+        "diamond",
+    ]
     # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
     with file_in.open(newline="") as in_handle:
         reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
         # Validate the existence of the expected header columns.
-        if not required_columns.issubset(reader.fieldnames):
+        if not required_columns == reader.fieldnames:
             req_cols = ", ".join(required_columns)
             logger.critical(f"The sample sheet **must** contain these column headers: {req_cols}.")
             sys.exit(1)
@@ -202,7 +229,7 @@ def check_samplesheet(file_in, file_out):
                 sys.exit(1)
         checker.validate_unique_samples()
     header = list(reader.fieldnames)
-    header.insert(1, "single_end")
+ #   header.insert(1, "single_end")
     # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
     with file_out.open(mode="w", newline="") as out_handle:
         writer = csv.DictWriter(out_handle, header, delimiter=",")