Taxonomic Or FUnctional Metagenomic Assembly and PrOfiling = TOFU-MAaPOÂ
TOFU-MAaPO is a Nextflow pipeline designed for the analysis of metagenomic short reads.
It provides comprehensive functionalities for:
- Quality control
- Taxonomic profiling and microbial abundance estimation
- Metabolic pathway analysis
- Assembly of metagenome-assembled genomes (MAGs)
The pipeline is compatible with any Linux system and requires only two dependencies:
- Nextflow (workflow manager)
- Singularity (as the container engine)
No software installation step is needed — Nextflow automatically downloads all necessary containers and tools.
- TOFU-MAaPO
- Table of content
- Pipeline Structure
- Key features
- Quick start
- Documentation
- Funding
TOFU-MAaPO accepts the following types of input:
- Single- or paired-end metagenomic shotgun sequencing FASTQ files
- A CSV file listing samples and their associated FASTQ files
- Direct download of sequencing data from SRA using project, sample or run IDs
The pipeline can download and install the required databases for GTDBtk, MetaPhlAn and HUMAnN. Refer to the usage documentation for more details.
Following tools need manual creation or download of required databases:
The quality control includes:
- PreQC quality assessment with fastqc
- Read trimming with BBtools or fastp
- Phix and artifact removal with BBtools
- optional host decontamination with Bowtie2
- PostQC quality assessment with fastqc
- MultiQC Report
TOFU-MAaPO can perform following analysis:
Generate taxonomic abundance profiles for your samples with:
- MetaPhlAn4
- Sylph
- Salmon and/or
- Kraken2 (with optional Bracken)
Utilizes HUMAnN (v3.6) to identify microbial metabolic genes/pathways.
Reads are assembled into contigs using Megahit (for individual samples, grouped samples, or combined samples).
Contigs are catalogued and indexed using Minimap2.
Binning is performed with up to five tools:
- Metabat2
- Concoct
- Maxbin2
- Semibin2 and/or
- Vamb
The bins are refined and merged where appropriate using MAGScoT: Single-copy microbial marker genes from the Genome Taxonomy Database (GTDB) are used to profile bins. Hybrid candidate bins are created by comparing marker gene profiles across different binning algorithms (based on user-defined thresholds).
Following steps are performed with all refined bins:
- Taxonomic annotation with GTDB-TK.
- Quality assessment with Checkm.
- Coverage Analysis
TOFU-MAaPO requires significant computational resources. Ensure your system meets the following minimum requirements:
- CPU: At least 16 cores.
- RAM: At least 128 GB (e.g., Semibin may require up to 200 GB, and GTDB-TK up to 100 GB).
For large datasets, it is recommended to run the pipeline on a high-performance computing (HPC) system.
Nextflow requires Java. We recommend using SDKMAN for easy Java installation:
# Install SDKMAN
curl -s https://get.sdkman.io | bash
# Install Java Temurin with SDKMAN (other Java versions might cause bugs)
sdk install java 17.0.10-tem
# Confirm that java is available in version 17.0.10-tem
java -version
#In case another java version is shown: Create and activate a sdk environment in the directory you want to execute the Nextflow pipeline
sdk env init
sdk env
To install and test Nextflow:
# Install Nextflow in your current directory:
curl -s https://get.nextflow.io | bash
# Make Nextflow executable:
chmod +x nextflow
# Try a simple Nextflow demo
nextflow run hello
You can install Singularity via:
- the Singularity Quickstart Guide or
- Conda (no
sudo
rights requriered):
# Create a new conda environment for Singularity
conda create --name sing_env -c conda-forge -c bioconda singularity=3.8
# Activate environment
conda activate sing_env
# Check whether Singularity has been successfully installed
singularity --version
# Also make sure you can run an example container
singularity run library://sylabsed/examples/lolcow
Use the following command to download or update the pipeline:
nextflow pull ikmb/TOFU-MAaPO
You will find the pipeline code stored in ${HOME}/.nextflow/assets/ikmb/TOFU-MAaPO
.
TOFU-MAaPO includes a pre-configured quickstart profile for local testing:
- Cores: Limited to 4 per process.
- RAM: Limited to 32 GB.
- Directory: Designed to run in the user's home directory.
Note: The quickstart profile is not recommended for real metagenome data analysis usage.
To fully utilize TOFU-MAaPO on an HPC or other systems, you must create a custom configuration file specifying:
- Available CPU cores and memory.
- Scheduler settings (e.g., local or SLURM).
- Paths for reference databases.
Refer to the installation and configuration documentation for details.
TOFU-MAaPO offers following input options:
- FASTQ (.fastq.gz) files: Single or paired-end reads stored locally.
- SRA IDs: Run, sample, or project IDs (comma-separated).
- Create your working directory and download an example dataset:
mkdir -p ${HOME}/tofu-quickstart && cd ${HOME}/tofu-quickstart
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/PSM6XBR1.tar
tar -xvf PSM6XBR1.tar && rm PSM6XBR1.tar
- Run the pipeline for quality control:
nextflow run ikmb/TOFU-MAaPO \
-profile quickstart \
--reads '*_R{1,2}.fastq.gz' \
--cleanreads \
--outdir results
The --cleanreads
flag copies quality controlled FASTQ files to the results
directory.
- Obtain your personal NCBI API key:
Go to to NCBI -> Account -> Account Settings -> API Key Management. - Run the pipeline using an SRA Run ID:
nextflow run ikmb/TOFU-MAaPO \
-profile quickstart \
--sra 'SRX3105436' \
--apikey **YOUR_NCBI_API_KEY** \
--cleanreads \
--outdir results
In the first run, include the following flags to download required databases and run quality control and HUMAnN:
nextflow run ikmb/TOFU-MAaPO \
-profile custom \
-c tofu.config \
--reads '*_R{1,2}.fastq.gz' \
--humann \
--updatehumann \
--updatemetaphlan \
--metaphlan_db /path/to/store/metaphlan/db \
--humann_db '/path/to/store/humann/db' \
--outdir results
In subsequent runs, exclude the database update flags --updatehumann
and --updatemetaphlan
:
nextflow run ikmb/TOFU-MAaPO \
-profile custom \
-c tofu.config \
--reads '*_R{1,2}.fastq.gz' \
--humann \
--metaphlan_db /path/to/store/metaphlan/db \
--humann_db '/path/to/store/humann/db' \
--outdir results
Hint: The paths for the databases can also be entered in the config file, so that you no longer need to enter them in the command line call.
- In the first run, include the flag
--updategtdbtk
for the initial database setup:
nextflow run ikmb/TOFU-MAaPO \
-profile custom \
-c tofu.config \
--reads '*_R{1,2}.fastq.gz' \
--assembly \
--updategtdbtk \
--gtdbtk_reference '/path/to/download/gtdbtk_db/to' \
--outdir results
- For subsequent runs, exclude the database update flag:
nextflow run ikmb/TOFU-MAaPO \
-profile custom \
-c tofu.config \
--reads '*_R{1,2}.fastq.gz' \
--assembly \
--gtdbtk_reference '/path/to/download/gtdbtk_db/to' \
--outdir results
- In your first run, to download required databases add the
--updatemetaphlan
flag:
nextflow run ikmb/TOFU-MAaPO \
-profile custom \
-c tofu.config \
--reads '*_R{1,2}.fastq.gz' \
--metaphlan \
--updatemetaphlan \
--metaphlan_db '/path/to/store/metaphlan/db' \
--outdir results
- In subsequent runs, skip the update flag:
nextflow run ikmb/TOFU-MAaPO \
-profile custom \
-c tofu.config \
--reads '*_R{1,2}.fastq.gz' \
--metaphlan \
--metaphlan_db '/path/to/store/metaphlan/db' \
--outdir results
For detailed usage options, refer to the usage documentation.
All further documentation about the pipeline can be found in the docs/
directory or under the links below:
The project was funded by the German Research Foundation (DFG) Research Unit 5042 - miTarget INF.