diff --git a/README.md b/README.md index 0ec7d22..2e0a222 100644 --- a/README.md +++ b/README.md @@ -51,6 +51,7 @@ Alignment: - clustalo 1.2.4 (http://www.clustal.org/omega/) - mafft 7.464 (https://mafft.cbrc.jp/alignment/software/) +- MUSCLE 5.1 (https://drive5.com/muscle5/) Trimming: diff --git a/docs/conf.py b/docs/conf.py index 677a873..46e83ca 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -30,6 +30,7 @@ # NOTE: All the lines are after this are the theme-specific ones. These are # written as part of the site generation pipeline for this project. # !! MARKER !! + html_theme = "sphinx_rtd_theme" #html_theme = "furo" html_static_path = ['_static'] diff --git a/docs/getting-help/citation.rst b/docs/getting-help/citation.rst new file mode 100644 index 0000000..39a63eb --- /dev/null +++ b/docs/getting-help/citation.rst @@ -0,0 +1,5 @@ +.. _getting_help-citation:: + +==================== +Citing phylociraptor +==================== diff --git a/docs/getting-help/problems.rst b/docs/getting-help/problems.rst index 114212a..534c277 100644 --- a/docs/getting-help/problems.rst +++ b/docs/getting-help/problems.rst @@ -40,6 +40,12 @@ A possible way to solve this is to run this command before running phylociraptor This command disables the standard storage location of newly pulled singularity containers and uses a system wide tmp directory instead. +---------------------------------------------- +Singularity is not installed on my HPC Cluster +---------------------------------------------- + +Please get in touch with your local HPC support team and ask them to install Singularity for you. + ----------------------------------------------------- ImportError: Unable to import required dependencies: ----------------------------------------------------- @@ -60,9 +66,9 @@ Error running slurm prolog: This means that SLURM encountered a problem executing SLURM internal scripts prior to the job execution. There can be several different error codes associated with this problem and it may be necessary to contact your HPC support team to resolve what specifically goes wrong. In our experience it can already be enough to simply resubmit the job for the error to disappear. --------------------------- +------------------------------------------------------------------- When running phylociraptor I get the error IncompleteFilesException --------------------------- +------------------------------------------------------------------- An example of how such an error can look like is this: @@ -102,3 +108,14 @@ Several of my mafft alignment jobs fail --------------------------------------- We have encountered this when many long amino-acid sequences should be aligned. In such a case mafft can become quite memory hungry. When phylociraptor is run on a HPC cluster and mafft reaches the memory limit on a node the job will crash. A workaround is to add the flag :bash:`--memsave` to the mafft options. + +------------------------------------------ +My Maximum Likelihood trees don't complete +------------------------------------------ + +It is usually a good idea to check the respective log files in the :bash:`log` directory. On HPC systems with batch job submission there is usually a wall-time limit preventing jobs to run only for a maximum amount of time. +Especially with larg phylogenomic datasets this limit can be reached quite easily. The solution to this problem is specific to the cluster environment and you may have to contact your local HPC support team for advice on how to extend the wall time limit. + +On SLURM based systems another solution is to add the line :bash:`dependency: "singleton"` to the iqtree or raxmlng (depending on what you use) section in your cluster config file. Next you can run the phylociraptor mltree step several times. The singleton dependency setting will tell SLURM to create a chain of jobs that have the same name. Look `here https://slurm.schedmd.com/sbatch.html`_ for additional details. + + diff --git a/docs/getting-help/tips.rst b/docs/getting-help/tips.rst new file mode 100644 index 0000000..b1ff55b --- /dev/null +++ b/docs/getting-help/tips.rst @@ -0,0 +1,39 @@ +.. _getting_help-faqs:: + +=============== +Tips and Tricks +=============== + +On this page you can find short answers to frequently asked questions about phylociraptor. + +-------------------------- +What is phylociraptor? +-------------------------- + +Phylociraptor is a bioinformatics pipeline to create phylogenomic trees for a specified set of species using different alignment, trimming and tree reconstruction methods. For more info look `here <../introduction/about.html>`_ . + +----------------------------------- +Where can I download phylociraptor? +----------------------------------- + +Phylocirpator is freely available on GitHub. You can get it `here `_ . + +------------------------------------------------------------- +Why have some of the genomes I specified not been downloaded? +------------------------------------------------------------- + +A failed genome download can happen due to serveral reasons. phylociraptor will always check if the taxon id of the specified species and the available genomes match. However there are cases in which they will not: + +1. The species name was misspelled +2. The genome is deposited under a different name: If the genome was deposited under an infraspecific name say a strain oder subspecies, you have to provide the full name including infraspecific names. +3. There is a problem with the connection to NCBI. +4. No genome is available for the specified species. + +If some of the species you specified are not present, you can always look into the log/genome_download.log file to see what has happened. + +---------------------------------------------- +Controlling the number of threads in -t serial +---------------------------------------------- + +By default phylociraptor will use the maximum number of available threads when it is run in :bash:`-t serial` mode. If you would like to change this you can use :bash:`-t serial=4` to restrict it to use only four threads for eaxmple. This only applies to parts which alread use multithreading, such as the tree calculation steps. + diff --git a/docs/indepth/output.rst b/docs/indepth/output.rst new file mode 100644 index 0000000..705ba69 --- /dev/null +++ b/docs/indepth/output.rst @@ -0,0 +1,243 @@ + +.. role:: bash(code) + :language: bash + + +.. _BUSCO: https://busco-archive.ezlab.org/ +.. _YAML: https://en.wikipedia.org/wiki/YAML +.. _Augustus: http://bioinf.uni-greifswald.de/augustus/ +.. _mafft: https://mafft.cbrc.jp/alignment/server/ +.. _trimal: http://trimal.cgenomics.org/ +.. _raxml-ng: https://github.com/amkozlov/raxml-ng +.. _iqtree: http://www.iqtree.org/ +.. _astral: https://github.com/smirarab/ASTRAL +.. _NCBI Genome Browser: https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/ +.. _biomartr: https://github.com/ropensci/biomartr + +==================================== +Understanding phylociraptor's output +==================================== + +Phylociraptor produces a lot of different output. The philsophy behind this is to make each step transpartent so that every file can be traced back to the original input files. Phylociraptor uses a pre-defined directory structure which is identical for every run. All results are contained in the :bash:`results` directory in the phylociraptor folder. + +This is the content of the results folder when every step of phylociraptor has finished successfully: + +.. code-block:: bash + + $ ls results/ + alignments assemblies checkpoints downloaded_genomes modeltest orthology phylogeny-0 phylogeny-60 phylogeny-70 report.html report_figure.pdy statistics + +We will go through every folder now in alphabethical order. + +---------- +alignments +---------- + +This folder will only show up in results after runnning :bash:`phylociraptor align`. + +.. code-block:: bash + + $ ls alignments/ + filtered full trimmed + +At the end of a phylociraptor run the folder alignments should include three folders. Each of them include another set of folders depending on which aligners have been used.Let's have a look into the full folder: + +.. code-block:: bash + + $ ls alignments/full + clustalo mafft muscle + +The folder :bash:`full` includes a folder for each aligner. These folders contain the actual full (untrimmed) alignments for each orthologous gene in FASTA format. + +The same logic applies to the :bash:`trimmed` and :bash:`filtered` folders, however the subfolders there are named slightly different to also incorporate the used trimmer: + +.. code-block:: bash + + $ ls alignments/trimmed + clustalo-aliscore clustalo-trimal mafft-aliscore mafft-trimal muscle-aliscore muscle-trimal + +As you can see we have six folders now. This is because phylocirpator will perform trimming for every aligner and trimmer combination and in this example we used three aligners (clustalo, mafft and muscle) and two trimmers (trimal, aliscore). + +---------- +assemblies +---------- + +This folder is created during the :bash:`phylociraptor setup` step. + +It contains symlinks to all input data used for a phylociraptor run. Regardless of its name, it may contain (a mixture of) genomes, transcriptomes and proteoms. + +.. code-block:: bash + + $ ls -l assemblies + total 2 + lrwxrwxrwx 1 reslp p71312 101 30. Mär 10:14 Amanita_muscaria.fasta.gz -> /binfl/lv71312/reslp/phylociraptor/results/downloaded_genomes/Amanita_muscaria_genomic_genbank.fna.gz + lrwxrwxrwx 1 reslp p71312 102 30. Mär 10:14 Neurospora_crassa.fasta.gz -> /binfl/lv71312/reslp/phylociraptor/results/downloaded_genomes/Neurospora_crassa_genomic_genbank.fna.gz + lrwxrwxrwx 1 reslp p71312 103 30. Mär 10:14 Tuber_melanosporum.fasta.gz -> /binfl/lv71312/reslp/phylociraptor/results/downloaded_genomes/Tuber_melanosporum_genomic_genbank.fna.gz + lrwxrwxrwx 1 reslp p71312 101 30. Mär 10:14 Usnea_hakonensis.fasta.gz -> /binfl/lv71312/reslp/phylociraptor/results/downloaded_genomes/Usnea_hakonensis_genomic_genbank.fna.gz + +As you can see from the listing above, in this case the directory contains symlinks to four genomes. You can also see the actual location of the files is in :bash:`results/downloaded_genomes` we will get to this folder later. Phylociraptor uses symlinks here to save disk-space will maintaining a consitent location and naming scheme for all files in the directory. + + +----------- +checkpoints +----------- + +This folder contains a set of files which help phylociraptor to keep track about which parts have been finished. Usually these files are irrelevant for the user. They are only for internal use of the pipeline. + +------------------ +downloaded_genomes +------------------ + +This folder will only appear when :bash:`phylociraptor setup` has finished. As mentioned above, it contains downloaded genomes from the NCBI database and additional information for each genome which will be displayed in the run report. + +.. code-block:: bash + + $ ls results/downloaded_genomes + Amanita_muscaria_db_genbank.tsv download_overview.txt not_downloaded.txt Tuber_melanosporum_genomic_genbank.fna.gz + Amanita_muscaria_genomic_genbank.fna.gz Neurospora_crassa_db_genbank.tsv successfully_downloaded_1.txt Usnea_hakonensis_db_genbank.tsv + assembly_summary_genbank.txt Neurospora_crassa_genomic_genbank.fna.gz successfully_downloaded.txt Usnea_hakonensis_genomic_genbank.fna.gz + download_overview_1.txt not_downloaded_1.txt Tuber_melanosporum_db_genbank.tsv + +As you can see there are several files in this folder. The fna.gz file is the gzipped genome assembly downloaded from NCBI. It is accompanied by a .tsv file which contains information about the assembly such was the accession number, the download URL, the Institution which uploaded the genome and more. The information in all .tsv files will be summarized in the HTML report. + +Additionally, the folder contains files which are for internal use in phylociraptor. They may only be relevant when debugging problems and are not directly relevant for the user. All these files end with .txt. + +--------- +modeltest +--------- + +This folder contains all results from modeltesting and gene-tree calculations. It will only appear after running :bash:`phylociraptor modeltest`. + +.. code-block:: bash + + $ ls results/modeltest + best_models_clustalo_aliscore.txt best_models_muscle_aliscore.txt genetree_filter_clustalo_aliscore.txt genetree_filter_muscle_aliscore.txt muscle-aliscore + best_models_clustalo_trimal.txt best_models_muscle_trimal.txt genetree_filter_clustalo_trimal.txt genetree_filter_muscle_trimal.txt muscle-trimal + best_models_mafft_aliscore.txt clustalo-aliscore genetree_filter_mafft_aliscore.txt mafft-aliscore + best_models_mafft_trimal.txt clustalo-trimal genetree_filter_mafft_trimal.txt mafft-trimal + +This folder contains several text files which summarize the best substitution models estimated by the iqtree MFP algorithm according to BIC in .txt format. These files are used for subsequent steps when creating a partitioned concatenated alignment for Maximum-Likelihood tree calculation. + +Additionally the subfolders contain all gene-tree calculation results. They are consistantly named with a combination of the name of the aligner and trimmer. For example the folder clustalo-trimal contains all gene tree results for each orthologous gene aligned with clustalo and trimmed with trimal: + +.. code-block:: bash + + $ ls results/modeltest/clustalo-trimal + EOG092C5OAL EOG092C5OPO EOG092C5Q82 EOG092C5S4U EOG092C5U93 EOG092C5UXM EOG092C5V3D EOG092C608 + $ ls results/modeltest/clustalo-trimal/EOG092C5OAL + EOG092C5OAL.bionj EOG092C5OAL.ckp.gz EOG092C5OAL.contree EOG092C5OAL.iqtree EOG092C5OAL.log EOG092C5OAL.mldist EOG092C5OAL.model.gz EOG092C5OAL.splits.nex EOG092C5OAL.treefil + +The .treefile from each individual analysis will be used later when a speciestree is calculated and when the bootstrap cut-off is applied. + +--------- +orthology +--------- + +In this folder all results from the :bash:`phylociraptor orthology` step is stored. + +.. code-block:: bash + + $ ls results/orthology + busco + $ ls results/orthology/busco + busco_runs busco_sequences busco_sequences_deduplicated busco_set busco_table.txt + + +On the first layer there is only a single folder :bash:`busco`, which is currently the only method to infer orthologous genes for the phylogeny. +The :bash:`busco` folder contains several additional folders: + +:bash:`busco_runs` contains the results of the individual busco runs for each sample in separate folders. + +:bash:`busco_sequences` contains extracted BUSCO sequences from each sample combined into a single file for each BUSCO. + +:bash:`busco_sequences_deduplicated` BUSCO sometimes reports more than one sequences even for single-copy BUSCOs. This folder contains FASTA files of each BUSCO gene from :bash:`busco_sequences` but with duplicates removed. + +:bash:`busco_set` contains the used busco set as spcified in the config.yaml file. + +Important for the pipline is the file :bash:`busco_table.txt`. It is a condensed summary of all individual BUSCO runs and is used to determine which samples will be used in subsequent steps based on BUSCO completeness. + +----------- +phylogeny-* +----------- + +This is probably the most important output folder. It contains the phylogenomic trees. When phylociraptor is finish calculating trees you should see one or more folders starting with :bash:`phylogeny` in your results folder. These folder contain results from phylogenomic tree calculations. The reason there are more than one is because you can subset which genes should be used based on the mean bootstrap support value of the gene trees. So, the folder :bash:`phylogeny-80` contains phylogenetic results only based on genes for which the mean bootstrap support was >80%. + +Let's have a look at a phylogeny folder more closely: + +.. code-block:: bash + + $ ls results/phylogeny-0 + astral concatenate iqtree njtree raxml + +As you can see there are three folders. Let us investigate the concatenate folder first, which contains information on concatenated alignments: + +.. code-block:: bash + + $ ls results/phylogeny-0/concatenate + clustalo-aliscore clustalo-trimal mafft-aliscore mafft-trimal muscle-aliscore muscle-trimal + $ ls results/phylogeny-0/concatenate/clustalo-aliscore + algn concat.fas concat.phy concat.sto partitions.txt partitions_unformated.txt statistics.txt + +The concatenate folder contains several subfolders for each aligner and trimmer combination. Each of these folders contains the following items: + +The folder :bash:`align`, which contains individual trimmed and filtered single gene alignments which are used to create the concatenated alignment. + +The files :bash:`concat.fas`, :bash:`concat.phy` and :bash:`concat.sto` contain the resulting concatenated alignment in FASTA, PHYLIP and STOCKHOLM format. + +The files :bash:`partitions.txt` and :bash:`partitions_unformated.txt` contain the partitioning scheme in a format raxml understands. These two files could differ slightly. This is because the names used to specify substitution models differs slightly between raxml and iqtree (which we use to estimate the model). Phylociraptor converts the modelnames from iqtree notation (:bash:`partitions_unformated.txt`) to raxml notation (:bash:`partitions.txt`). + +The :bash:`statistics.txt` file includes statistics of the genes used in the concatenated alignment, such as parsimony informative sites, length, etc. + + + +The next folder we can have a look at is the :bash:`iqtree` folder. It includes results from iqtree maximum-likelihood calculations based on the concatenated datasets: + +.. code-block:: bash + + $ ls results/phylogeny-0/iqtree + clustalo-aliscore clustalo-trimal mafft-aliscore mafft-trimal muscle-aliscore muscle-trimal + $ ls results/phylogeny-0/iqtree/mafft-aliscore + algn concat.best_model.nex concat.bionj concat.ckp.gz concat.contree concat.iqtree concat.log concat.mldist concat.nex concat.splits.nex concat.treefil + +Again this folder contains subfolder for each aligner and trimmer combination you specified in the config file. Each subfolder contains all output from iqtree as well as a folder (:bash:`algn`) with all alignments used to calculate the tree. + +The folder :bash:`raxml` contains all output from raxml for the different aligner and trimmer combinations. + +.. code-block:: bash + + $ ls results/phylogeny-0/raxml + clustalo-aliscore clustalo-trimal mafft-aliscore mafft-trimal muscle-aliscore muscle-trimal + $ ls results/phylogeny-0/raxml/mafft-trimal + concat.fas raxmlng.raxml.bestModel raxmlng.raxml.bootstraps raxmlng.raxml.mlTrees raxmlng.raxml.startTree + partitions.txt raxmlng.raxml.bestTree raxmlng.raxml.log raxmlng.raxml.rba raxmlng.raxml.support + +It includes the :bash:`concat.fas` file which is the concatenated alignment along with the :bash:`partitions.txt` file with partition specifications as well as the output from raxml. + +In the folder :bash:`astral` you will find species tree results from Astral: + +.. code-block:: bash + + $ ls results/phylogeny-0/astral + clustalo-aliscore clustalo-trimal mafft-aliscore mafft-trimal muscle-aliscore muscle-trimal + $ ls results/phylogeny-0/astral/muscle-trimal + species_tree.tre trees_muscle_trimal.tre + +Again there is a folder for each aligner and trimmer combination. Each folder contains two .tre files in Newick format. The :bash:`species_tree.tre` file is the species tree calculated by astral and the file :bash:`trees_muscle_trimal.tre` contains all individual gene-trees used to calculate the species tree. + +The last phylogeny results folder is the :bash:`njtree` folder which contains Neighbor-Joining trees calculated with quicktree. + +.. code-block:: bash + + $ ls results/phylogeny-0/njtree + clustalo-aliscore clustalo-trimal mafft-aliscore mafft-trimal muscle-aliscore muscle-trimal + $ ls results/phylogeny-0/njtree/muscle-aliscore + njtree.tre + +For each aligner and trimmer combinations it contains a single Newick tree file with the calculated NJ-tree. + + + + + + + diff --git a/docs/indepth/setup.rst b/docs/indepth/setup.rst index 5cf63d6..0025a8f 100644 --- a/docs/indepth/setup.rst +++ b/docs/indepth/setup.rst @@ -13,6 +13,7 @@ .. _biomartr: https://github.com/ropensci/biomartr .. _snakemake: https://snakemake.github.io/ .. _singularity: https://sylabs.io/ +.. _docker: https://docker.com/ ===================== @@ -25,9 +26,9 @@ Requirements Phylocirpator only has two requirements: -* `snakemake`_ 5.19.3+ +* `snakemake`_ 6.0.2 -* `singularity`_ 3.4.1+ +* `singularity`_ 3.4.1+ or `docker`_ All other software used is prepackaged in containers and no installation is needed. @@ -43,13 +44,13 @@ There are three simple options how to install phylocirator: .. code:: bash - git clone git@github.com:reslp/phylociraptor.git + git clone --recursive git@github.com:reslp/phylociraptor.git or .. code:: bash - git clone https://github.com/reslp/phylociraptor.git + git clone --recursive https://github.com/reslp/phylociraptor.git 3. If you don't have git, you can download the latest version also as ZIP file: diff --git a/docs/index.rst b/docs/index.rst index 9393150..50d9fdb 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -66,6 +66,7 @@ If you run into problems with phylociraptor please follow these steps: indepth/setup.rst indepth/config_files.rst indepth/indepth.rst + indepth/output.rst .. toctree:: :caption: Getting help @@ -75,4 +76,6 @@ If you run into problems with phylociraptor please follow these steps: getting-help/problems.rst getting-help/faqs.rst + getting-help/tips.rst getting-help/contact.rst + getting-help/citation.rst diff --git a/docs/introduction/minimal.rst b/docs/introduction/minimal.rst index dc2ae2a..3280c24 100644 --- a/docs/introduction/minimal.rst +++ b/docs/introduction/minimal.rst @@ -5,7 +5,7 @@ Tutorial - A minimal example ============================ -This document guides you through a first example on how to run phylociraptor. This minmal dataset consists of 6 fungal genomes, three belonging to the Ascomycota and three to +This document guides you through a first example on how to run phylociraptor. This minimal dataset consists of 6 fungal genomes, three belonging to the Ascomycota and three to Basidiomycota which will be downloaded during the first step of the run. Downstream analyses are carried out on 10 single-copy orthologous genes which will be identified in each genome with a modified BUSCO set. The so recovered genes will be aligned and trimmed and we will calculate single-gene trees for each gene. After this, we will perform a Maximum-Likelihood analyses on the full dataset and infer a species tree. The whole analysis should take about 1 hour to complete.