Skip to content

Command line options

SHANG Jiayu edited this page Dec 26, 2024 · 33 revisions

Table of Contents

Commands

Commands are issued as the parameter on the command line and set the task to be run by the program.

The help options list can be printed on the console via:

# help for general options
phabox --help

# help for specific options
phabox2 --task [task] -h

#Example:
phabox2 --task phamer -h
phabox2 --task phagcn -h

We also listed the options below for your reference:

πŸ“šΒ  General options

The following parameters are common when running phabox2:

--task
    Select a program to run:
    end_to_end    || Run phamer, phagcn, phatyp, phavip, and cherry once (default)
    phamer        || Virus identification
    phagcn        || Taxonomy classification
    phatyp        || Lifestyle prediction
    cherry        || Host prediction
    phavip        || Protein annotation
    contamination || Contamination/proviurs detection
    votu          || vOTU grouping (ANI-based or AAI-based)
    tree          || Build phylogenetic trees based on marker genes


--dbdir
    Path of downloaded phabox2 database directory (required)

--outpth 
    Rootpth for the output folder (required)
    All the results, including intermediate files and final predictions, are stored in this folder. 

--contigs
    Path of the input FASTA file (required)

--proteins 
    FASTA file of predicted proteins. (optional)

--midfolder
    Midfolder for intermediate files. (optional)
    This folder will be created within the --outpth to store intermediate files.

--len
    Filter the length of contigs || default: 3000
    Contigs with length smaller than this value will not proceed 

--threads  
    Number of threads to use || default: all available threads

πŸ“šΒ  Special options

Please note that end_to_end task will run phamer, phagcn, cherry, phatyp, and phavip together. Thus, each task's options can also be used for the end_to_end task.

In addition, prediction with non-virus and low-confidence will not be used in the following taxonomy, host, and lifestyle prediction tasks.

The following parameters will be used in specific tasks:

πŸ“•Β  PhaMer (Virus identification)

usage: phabox2 --task phamer [options]

In-task options:

--reject 
    Reject sequences in which the percent proteins aligned to known phages is smaller than the value. 
    Default: 10
    Range from 0 to 20

If the proportion is too low, the prediction for downstream analysis will be unreliable.

πŸ“•Β  PhaGCN (Taxonomy)

Usage: phabox2 --task phagcn [options]

In-task options:

The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.

--aai
    Average amino acids identity  || default: 75 || range from 0 to 100 

--share
    Minimum shared number of proteins || default: 15 || range from 0 to 100

--pcov
    Protein-based coverage || default: 80 || range from 0 to 100

--draw
    Draw network examples for the query virus relationship. || default: N || Y or N

--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server. However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it. We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file) please check it out via: here

πŸ“•Β  CHERRY (Host)

Usage: phabox2 --task cherry [options]

In-task options:

The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.

--aai
    Average amino acids identity  || default: 75 || range from 0 to 100 

--share
    Minimum shared number of proteins || default: 15 || range from 0 to 100

--pcov
    Protein-based coverage || default: 80 || range from 0 to 100

--draw
    Draw network examples for the query virus relationship. || default: N || Y or N

--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server. However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it. We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file) please check it out via: here

The options below are used to predict CRISPRs based on MAGs.

--bfolder
    Path to the folder that contains MAGs || default: None

The options below are used to align contigs to CRISPRs.

--cpident
    Alignment identity for CRISPRs || default: 90 || range from 90 to 100

--ccov
    Alignment coverage for CRISPRs || default: 90 || range from 0 to 100

--blast
    BLAST program for CRISPRs || default: blastn || blastn or blastn-short
    blastn-short will lead to more sensitive results but require more time to execute the program 

The default parameters are optimized for predicting prokaryotic hosts for the virus with 98% accuracy (data from the NCBI RefSeq database). When making changes, make sure you understand 100% what they are.

--magonly
    Only predicting host based on the provided MAGs: Y or N || default: N
    Y will only predict the host based on the provided MAGs
    N will predict the host based on the MAGs and the reference database

πŸ“•Β  PhaTYP (Lifestyle)

usage: phabox2 --task phatyp [options]

In-task options:

There are no additional options for lifestyle prediction. Only need to follow the general options.

πŸ“•Β  PhaVIP (annotation)

Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same.

usage: phabox2 --task phavip [options]

πŸ“•Β  End to End (run the abovementioned tools once)

usage: phabox2 --task end_to_end [options]

In-task options:

The end-to-end task allow to skip the PhaMer(virus identification). If users already have the viral contigs as their inputs, they can run end-to-end task using --skip Y to skip the virus identification

--skip  
    Whether you want to skip the viruses identification (PhaMer) || default: N || Y or N

However, please noted that the default parameters is --skip N. We also added a log output that tells the user that PhaMer detected no viruses and stopped the following pipelines in the end-to-end task in --skip N condition.

πŸ“—Β  Contamination

Usage: phabox2 --task contamination [options]

In-task options:

--sensitive  
    Sensitive when search for the prokaryotic genes || default: N ||  Y or N
    Y will lead to more sensitive results but require more time to execute the program

πŸ“˜Β  vOTU grouping

Usage: phabox2 --task votu [options]

In-task options:

--mode  
    Mode for clustering ANI based or AAI based || default: ANI || ANI or AAI

AAI-based options:

--aai  
    Average amino acids identity for AAI based genus grouping || default: 75 || range from 0 to 100

--pcov  
    Protein-level coverage for AAI based genus grouping || default: 80 || range from 0 to 100

--share  
    Minimum shared number of proteins for AAI based genus grouping || default: 15 || range from 0 to 100

ANI-based options:

--ani
    Alignment identity for ANI-based clustering  || default: 95 || range from 0 to 100

--tcov
    Alignment coverage for ANI-based clustering || default: 85 || range from 0 to 100

πŸ“™Β  Pylogenetic tree

Usage: phabox2 --task tree [options]

In-task options:

--marker
    A list of markers used to generate tree || default: terl portal
    You can choose more than one marker to generate the tree from below:
    
    The marker genes were obtained from the RefSeq 2024:
        endolysin      || 91% prokaryotic virus have endolysin
        holin          || 75% prokaryotic virus have holin
        head           || 77% prokaryotic virus have marjor head
        portal         || 84% prokaryotic viruses have portal
        terl           || 92% prokaryotic viruses have terminase large subunit

        Using combinations of these markers can improve the accuracy of the tree 
        But will decrease the number of sequences in the tree.


--mcov
    Alignment coverage for matching marker genes || default: 50 || range from 0 to 100

--mpident
    Alignment identity for matching marker genes || default: 25 || range from 0 to 100

--msa
    Whether run msa || default: N || Y or N
    Y will run msa for the marker genes using mafft
    But this will require more time to execute the program

--tree
    Whether build a tree || default: N || Y or N
    Y will generate the tree based on the marker genes using FastTree
    But this will require more time to execute the program