Skip to content

Nextflow pipeline for the analysis of metagenomic short reads.

License

Notifications You must be signed in to change notification settings

ikmb/TOFU-MAaPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOFU-MAaPO

Taxonomic Or FUnctional Metagenomic Assembly and PrOfiling = TOFU-MAaPO 

TOFU-MAaPO is a Nextflow pipeline designed for the analysis of metagenomic short reads.

It provides comprehensive functionalities for:

  • Quality control
  • Taxonomic profiling and microbial abundance estimation
  • Metabolic pathway analysis
  • Assembly of metagenome-assembled genomes (MAGs)

The pipeline is compatible with any Linux system and requires only two dependencies:

  • Nextflow (workflow manager)
  • Singularity (as the container engine)

No software installation step is needed — Nextflow automatically downloads all necessary containers and tools.

Table of content

Pipeline Structure

Overview of TOFU-MAaPO 1.5.0

Key features

Input data

TOFU-MAaPO accepts the following types of input:

  • Single- or paired-end metagenomic shotgun sequencing FASTQ files
  • A CSV file listing samples and their associated FASTQ files
  • Direct download of sequencing data from SRA using project, sample or run IDs

Database management

The pipeline can download and install the required databases for GTDBtk, MetaPhlAn and HUMAnN. Refer to the usage documentation for more details.

Following tools need manual creation or download of required databases:

Quality control and preprocessing

The quality control includes:

  • PreQC quality assessment with fastqc
  • Read trimming with BBtools or fastp
  • Phix and artifact removal with BBtools
  • optional host decontamination with Bowtie2
  • PostQC quality assessment with fastqc
  • MultiQC Report

Downstream analysis

TOFU-MAaPO can perform following analysis:

Taxonomic Profiling

Generate taxonomic abundance profiles for your samples with:

  • MetaPhlAn4
  • Sylph
  • Salmon and/or
  • Kraken2 (with optional Bracken)

Metabolic gene/pathway analysis

Utilizes HUMAnN (v3.6) to identify microbial metabolic genes/pathways.

Genome assembly

Assembly

Reads are assembled into contigs using Megahit (for individual samples, grouped samples, or combined samples).
Contigs are catalogued and indexed using Minimap2.

Binning

Binning is performed with up to five tools:

  • Metabat2
  • Concoct
  • Maxbin2
  • Semibin2 and/or
  • Vamb

Bin refinement

The bins are refined and merged where appropriate using MAGScoT: Single-copy microbial marker genes from the Genome Taxonomy Database (GTDB) are used to profile bins. Hybrid candidate bins are created by comparing marker gene profiles across different binning algorithms (based on user-defined thresholds).

Annotation and Quality Check

Following steps are performed with all refined bins:

  • Taxonomic annotation with GTDB-TK.
  • Quality assessment with Checkm.
  • Coverage Analysis

Quick start

Prerequisites:

TOFU-MAaPO requires significant computational resources. Ensure your system meets the following minimum requirements:

  • CPU: At least 16 cores.
  • RAM: At least 128 GB (e.g., Semibin may require up to 200 GB, and GTDB-TK up to 100 GB).

For large datasets, it is recommended to run the pipeline on a high-performance computing (HPC) system.

Installing dependencies

Step 1: Install Nextflow

Nextflow requires Java. We recommend using SDKMAN for easy Java installation:

# Install SDKMAN
curl -s https://get.sdkman.io | bash
# Install Java Temurin with SDKMAN (other Java versions might cause bugs)
sdk install java 17.0.10-tem
# Confirm that java is available in version 17.0.10-tem
java -version
#In case another java version is shown: Create and activate a sdk environment in the directory you want to execute the Nextflow pipeline
sdk env init
sdk env

To install and test Nextflow:

# Install Nextflow in your current directory:
curl -s https://get.nextflow.io | bash
# Make Nextflow executable:
chmod +x nextflow
# Try a simple Nextflow demo
nextflow run hello

Step 2: Install Singularity (Apptainer)

You can install Singularity via:

# Create a new conda environment for Singularity
conda create --name sing_env -c conda-forge -c bioconda singularity=3.8 
# Activate environment
conda activate sing_env
# Check whether Singularity has been successfully installed
singularity --version
# Also make sure you can run an example container
singularity run library://sylabsed/examples/lolcow

Downloading TOFU-MAaPO

Use the following command to download or update the pipeline:

nextflow pull ikmb/TOFU-MAaPO

You will find the pipeline code stored in ${HOME}/.nextflow/assets/ikmb/TOFU-MAaPO.

Configuration

Quickstart profile

TOFU-MAaPO includes a pre-configured quickstart profile for local testing:

  • Cores: Limited to 4 per process.
  • RAM: Limited to 32 GB.
  • Directory: Designed to run in the user's home directory.

Note: The quickstart profile is not recommended for real metagenome data analysis usage.

Custom configuration

To fully utilize TOFU-MAaPO on an HPC or other systems, you must create a custom configuration file specifying:

  • Available CPU cores and memory.
  • Scheduler settings (e.g., local or SLURM).
  • Paths for reference databases.

Refer to the installation and configuration documentation for details.

Example workflows:

Running quality control

TOFU-MAaPO offers following input options:

  • FASTQ (.fastq.gz) files: Single or paired-end reads stored locally.
  • SRA IDs: Run, sample, or project IDs (comma-separated).

With Local FASTQ Files

  1. Create your working directory and download an example dataset:
mkdir -p ${HOME}/tofu-quickstart && cd ${HOME}/tofu-quickstart
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/PSM6XBR1.tar
tar -xvf PSM6XBR1.tar && rm PSM6XBR1.tar
  1. Run the pipeline for quality control:
nextflow run ikmb/TOFU-MAaPO \
    -profile quickstart \
    --reads '*_R{1,2}.fastq.gz' \
    --cleanreads \
    --outdir results

The --cleanreads flag copies quality controlled FASTQ files to the results directory.

With SRA IDs

  1. Obtain your personal NCBI API key:
    Go to to NCBI -> Account -> Account Settings -> API Key Management.
  2. Run the pipeline using an SRA Run ID:
nextflow run ikmb/TOFU-MAaPO \
    -profile quickstart \
    --sra 'SRX3105436' \
    --apikey **YOUR_NCBI_API_KEY** \
    --cleanreads \
    --outdir results

Running metabolic gene/pathway estimation with HUMAnN

In the first run, include the following flags to download required databases and run quality control and HUMAnN:

nextflow run ikmb/TOFU-MAaPO \
    -profile custom \
    -c tofu.config \
    --reads '*_R{1,2}.fastq.gz' \
    --humann \
    --updatehumann \
    --updatemetaphlan \
    --metaphlan_db /path/to/store/metaphlan/db \
    --humann_db '/path/to/store/humann/db' \
    --outdir results

In subsequent runs, exclude the database update flags --updatehumann and --updatemetaphlan:

nextflow run ikmb/TOFU-MAaPO \
    -profile custom \
    -c tofu.config \
    --reads '*_R{1,2}.fastq.gz' \
    --humann \
    --metaphlan_db /path/to/store/metaphlan/db \
    --humann_db '/path/to/store/humann/db' \
    --outdir results

Hint: The paths for the databases can also be entered in the config file, so that you no longer need to enter them in the command line call.

Running metagenome assembly

  1. In the first run, include the flag --updategtdbtk for the initial database setup:
nextflow run ikmb/TOFU-MAaPO \
    -profile custom \
    -c tofu.config \
    --reads '*_R{1,2}.fastq.gz' \
    --assembly \
    --updategtdbtk \
    --gtdbtk_reference '/path/to/download/gtdbtk_db/to' \
    --outdir results
  1. For subsequent runs, exclude the database update flag:
nextflow run ikmb/TOFU-MAaPO \
    -profile custom \
    -c tofu.config \
    --reads '*_R{1,2}.fastq.gz' \
    --assembly \
    --gtdbtk_reference '/path/to/download/gtdbtk_db/to' \
    --outdir results

Running taxonomic abundance estimation with MetaPhlAn

  1. In your first run, to download required databases add the --updatemetaphlan flag:
nextflow run ikmb/TOFU-MAaPO \
    -profile custom \
    -c tofu.config \
    --reads '*_R{1,2}.fastq.gz' \
    --metaphlan \
    --updatemetaphlan \
    --metaphlan_db '/path/to/store/metaphlan/db' \
    --outdir results
  1. In subsequent runs, skip the update flag:
nextflow run ikmb/TOFU-MAaPO \
    -profile custom \
    -c tofu.config \
    --reads '*_R{1,2}.fastq.gz' \
    --metaphlan \
    --metaphlan_db '/path/to/store/metaphlan/db' \
    --outdir results

For detailed usage options, refer to the usage documentation.

Documentation

All further documentation about the pipeline can be found in the docs/ directory or under the links below:

  1. Installation and configuration
  2. Add host genomes to TOFU-MAaPO
  3. Available options
  4. Outputs structure

Funding

The project was funded by the German Research Foundation (DFG) Research Unit 5042 - miTarget INF.