Latest updates from dev (#74)

* update hmmer to working version with PGAP * update hmmer to working version with PGAP (#59) (#60) * Adding option for using `bakta` (#58) * first bakta addition test * Fixing error when phigaro output was empty (#55) (#56) * adding checkIfExists rules to input loading * Update build.sh * fixed report if else statement * updated version and changelog * fix example samplesheet url * Create nf-core-bacannot-compare_logo_dark.png * Update phigaro.nf * Update CHANGELOG.md * Update CHANGELOG.md * Update .gitignore * Update bacannot.nf * workflow runs from top to bottom * workflow correctly working * changing version manifest * explaining how to use bakta * adding variables to generic annotation * adding victors backup db * report_general understands bakta * report_resistance understands bakta * update hmmer to working version with PGAP * update hmmer to working version with PGAP (#59) * update hmmer to working version with PGAP (#59) (#60) * Update resfinder2gff.py * Update addBedtoolsIntersect.R * Update addBedtoolsIntersect.R * Update addBedtoolsIntersect.R * Update addBedtoolsIntersect.R * Update docker.config * download db if not available * Update merge_annotations.nf * Update merge_annotations.nf * update process label * Update merge_annotations.nf * fix docker file ownerships * added retry label for bakta * update changelog * changing to new version * add falmeida-py scripts to docker * Create generate_dockers.sh * Minor improvements for custom database annotation (#67) * Add step making Blast DB when annotating custom db Blast database throw error for no alias or index file found for nucleotide database, when using blastn. Thus, make blast database before execute blastn to custom database on genome. * Remove space when generate link in Shiny markdown Whitespace occurs when generated links for custom database report in Shiny markdown. Specify seperate string to emtpy string to fix this. * fix: script was calling an out of bounds value * handle exceptions and missing values * increment test profile * also build for proteins * Use correct version of customdb_gffs in REPORTS module Co-authored-by: fmalmeida <felipemarques89@gmail.com> * fixed docker images * resfinder now uses conda env * rgi does not have --exclude_nudge anymore * add circos conf files * fix indentation * add circos sub-workflow calling * add circos tool * added karyotype and gc_skew modules * fix indentation * modules now have a channel suitable for summary generation * include summary generation module * update docker image * add aro index tsv * add amrfinder2tsv module * add bacannot json summary generation * update docker images * add new modules * added binaries for circos plot * update docker files * add label to module * finalise modules preparing data for circos * include whole circos 'pathway' in workflow * add circos module * circos plot currently functional * re-organised circos outputs * ameliorate default version of circos plot * fixed docker image * fixed custom annotation problem in server * added circos plot with easy_circos * circos plot properly being generated * update changelog * update readme * update documentation * fix tool name * fix dir * update indentation * fixed schema * add pipefail * fix docker image Co-authored-by: Cy <38537368+lam-c@users.noreply.github.com>
fmalmeida · Dec 19, 2022 · aa871de · aa871de
1 parent fe1c085
commit aa871de
Show file tree

Hide file tree

Showing 62 changed files with 6,658 additions and 785 deletions.
diff --git a/.zenodo.json b/.zenodo.json
@@ -2,7 +2,7 @@
     "description": "<p>The pipeline</p>\n\n<p>bacannot, is a customisable, easy to use, pipeline that uses state-of-the-art software for comprehensively annotating prokaryotic genomes having only Docker and Nextflow as dependencies. It is able to annotate and detect virulence and resistance genes, plasmids, secondary metabolites, genomic islands, prophages, ICEs, KO, and more, while providing nice an beautiful interactive documents for results exploration.</p>", 
     "license": "other-open", 
     "title": "fmalmeida/bacannot: A generic but comprehensive bacterial annotation pipeline", 
-    "version": "v3.1.7", 
+    "version": "v3.2", 
     "upload_type": "software",
     "creators": [
         {

diff --git a/README.md b/README.md
@@ -53,6 +53,7 @@ Its main steps are:
 | Custom annotation from formatted FASTA or NCBI protein IDs | [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs) |
 | Merge of annotation results | [bedtools](https://bedtools.readthedocs.io/en/latest/) |
 | Genome Browser renderization | [JBrowse](http://jbrowse.org/) |
+| Circos plot generation | [easy_circos](https://easy_circos.readthedocs.io/en/latest/index.html) |
 | Renderization of automatic reports and shiny app for results interrogation | [R Markdown](https://rmarkdown.rstudio.com/), [Shiny](https://shiny.rstudio.com/) and [SequenceServer](https://sequenceserver.com/) |
 
 🎯 In order to increase the accuracy of prokka annotation, this pipeline includes an additional HMM database to prokka's defaults. It can be either TIGRFAM (smaller but curated) or PGAP (bigger comprehensive NCBI database that contains TIGRFAM).
@@ -89,10 +90,10 @@ These images have been kept separate to not create massive Docker image and to a
     * After installed, you need to download the required Docker images
 
     ```bash
-    docker pull fmalmeida/bacannot:v3.1_misc    ;
-    docker pull fmalmeida/bacannot:v3.1_perlenv ;
-    docker pull fmalmeida/bacannot:v3.1_pyenv   ;
-    docker pull fmalmeida/bacannot:v3.1_renv    ;
+    docker pull fmalmeida/bacannot:v3.2_misc    ;
+    docker pull fmalmeida/bacannot:v3.2_perlenv ;
+    docker pull fmalmeida/bacannot:v3.2_pyenv   ;
+    docker pull fmalmeida/bacannot:v3.2_renv    ;
     docker pull fmalmeida/bacannot:jbrowse      ;
     ```
 

diff --git a/assets/aro_index.tsv b/assets/aro_index.tsv
diff --git a/bin/GCcalc.py b/bin/GCcalc.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+
+## calculate GC
+from __future__ import division
+
+# __author__ = "Wenchao Lin"
+# __copyright__ = "Copyright 2014,Tianjin Biochip Corp."
+# __license__ = "GPL"
+# __version__ = "1.0.0"
+# __maintainer__ = "Wenchao Lin"
+# __email__ = "linwenchao@yeah.net"
+
+
+from Bio import SeqIO
+import sys
+from optparse import OptionParser
+import numpy as np
+
+def GC_content_window(s):
+
+    """
+    Return GC content of input sequence
+    """
+
+    gc = sum(s.count(x) for x in ['G','C','g','c','S','s'])
+    gc_content = gc/float(len(s))
+    return round(gc_content,4)
+
+
+def GC_skew_window(s):
+
+    """
+    Reuturn GC skew of a sequence
+    """
+
+    g = s.count('G')+s.count('g')
+    c = s.count('C')+s.count('c')
+
+    try:
+        skew = (g-c)/float(g+c)
+    except ZeroDivisionError:
+        skew = 0
+    return round(skew,4)
+
+
+def main():
+
+    """
+    Calculate GC content and GC skew of input Fasta sequence
+    """
+
+    usage = "usage: %prog -f input.fa [-w 1000] [-s 1000]"
+    parser = OptionParser(usage = usage)
+    parser.add_option("-f","--file",dest="filename",help="Input Fasta format file",metavar="FASTA")
+    parser.add_option("-w","--window",dest="WindowSize",help="default:1000 WindowSize to calculate",default=1000,type='int')
+    parser.add_option("-s","--step",dest="StepSize",help="default:1000 StepSize for slide widows",default=1000,type='int')
+    (options,args) = parser.parse_args()
+
+    window = options.WindowSize
+    step = options.StepSize
+    seqobj = SeqIO.parse(options.filename,'fasta')
+
+    for record in seqobj:
+        pos_array = []
+        gc_content_value_array = []
+        gc_skew_value_array = []
+        name = record.id
+        seq = record.seq
+        start = 0
+        end = 0
+        gc = 0
+        gc_skew = 0
+
+        for i in range(0,len(seq),step):
+            subseq = seq[i:i+window]
+            gc_content = (GC_content_window(subseq))
+            gc_skew = (GC_skew_window(subseq))
+            start = (i + 1 if (i+1<=len(seq)) else i)
+            end = ( i + step if (i+ step<=len(seq)) else len(seq))
+            print ("%s\t%s\t%s\t%s\t%s" % (name,start,end,gc_content,gc_skew))
+
+
+if __name__ == '__main__':
+    main()
+    sys.exit()
diff --git a/bin/amrfinder2tsv.R b/bin/amrfinder2tsv.R
@@ -0,0 +1,129 @@
+#!/usr/bin/env Rscript
+# A supporting script for writing amrfinderplus results as csv!
+# To be used with fmalmeida/bacannot-compare
+# Author: Felipe Marques de Almeida <almeidafmarques@gmail.com>
+
+#################################################
+### Setting Help message and input parameters ###
+#################################################
+'usage: amrfinder2csv.R [ --input=<file> --output=<chr> --aro=<file> --sample=<chr> ]
+options:
+-i, --input=<file>       Bacannot JSON summary
+-o, --output=<chr>       Output file [default: out.tsv]
+-a, --aro=<file>         CARD ARO categories
+-s, --sample=<chr>       Filter results of only this sample [default: all]
+' -> doc
+
+# Parse parameters
+suppressMessages(library(docopt))
+opt <- docopt(doc)
+
+if (is.null(opt$input)){
+    stop("At least one argument must be supplied (input file)\n", call.=FALSE)
+}
+
+#
+# load libraries
+#
+library(jsonlite)
+library(stringr)
+library(dplyr)
+
+#
+# functions for CARD metadata
+#
+aro_index <- 
+    read.csv( opt$aro, sep = "\t" )
+aro_index$ARO.Accession <- 
+    str_remove(aro_index$ARO.Accession, "ARO:")
+
+get_aro_category <- function(aro) {
+    #aro <- 3003923
+    df <- aro_index %>% dplyr::filter(ARO.Accession == aro)
+    result <- str_split(df$Drug.Class, ";")[[1]][1] # tentar padronizar pegando so o primeiro
+    return(result)
+}
+
+get_aro_gene_name <- function(aro) {
+    # aro <- 3003923
+    df <- aro_index %>% filter(ARO.Accession == aro)
+    result <- df$CARD.Short.Name
+    return(result)
+}
+
+#
+# read annotation summary file
+#
+summary_file <-
+    read_json( opt$input )
+samples <- 
+    names( summary_file ) # get available samples in file
+
+#
+# start empty data.frames for parsing
+#
+identified_genes <- 
+    setNames(
+        data.frame(
+        matrix(ncol = 6, nrow = 0)
+        ), c("sample", "contig", "start", "end", "gene", "subclass")
+    )
+    without_aro      <- 
+    setNames(
+        data.frame(
+        matrix(ncol = 6, nrow = 0)
+        ), c("sample", "contig", "start", "end", "gene", "subclass")
+    )
+
+#
+# parse AMRFinderPlus Results
+#
+for (sample in samples) {
+    subset <- summary_file[[sample]]$resistance$amrfinderplus
+    all_entries <- names(subset)
+    for (entry in all_entries) {
+        if (entry != "total") {
+
+            if (subset[[entry]][['type']] == 'AMR') {
+
+                contig = subset[[entry]]$contig
+                start  = subset[[entry]]$start
+                end    = subset[[entry]]$end
+
+                if (is.null(subset[[entry]][['card_aro']])) {
+                subclass = subset[[entry]]$subclass
+                without_aro[nrow(without_aro) + 1,] <- 
+                    c( sample, contig, start, end, subset[[entry]]$gene, subclass )
+                } else {
+                if (is.null(subset[[entry]][['subclass']])) {
+                    subclass = "None"
+                } else {
+                    subclass = get_aro_category(subset[[entry]]$card_aro)
+                }
+                identified_genes[nrow(identified_genes) + 1,] <- 
+                    c( sample, contig, start, end, subset[[entry]]$gene, subclass )  
+                }
+
+            }
+        }
+    } 
+}
+
+#
+# filter results
+#
+if (as.character(opt$sample) != 'all') {
+    identified_genes <-
+        identified_genes %>% dplyr::filter( as.character(sample) == as.character(opt$sample) )
+}
+
+#
+# save table as file
+#
+write.table(
+    identified_genes,
+    file = as.character( opt$output ),
+    quote = FALSE,
+    sep = "\t",
+    row.names = FALSE
+)
diff --git a/bin/run_blasts.py b/bin/run_blasts.py
@@ -146,8 +146,16 @@ def summary(output):
         db=line["sseqid"].split('~~~')[0]
         gene=line["sseqid"].split('~~~')[1]
         acc=line["sseqid"].split('~~~')[2]
-        prodc=line["sseqid"].split('~~~')[3]
-        desc=line["sseqid"].split('~~~')[4]
+
+        if len(line["sseqid"].split('~~~')) == 5:
+            prodc=line["sseqid"].split('~~~')[3]
+            desc=line["sseqid"].split('~~~')[4]
+        else:
+            prodc=line["sseqid"].split('~~~')[3].split(' ')[0]
+            try:
+                desc=' '.join(line["stitle"].split('~~~')[3].split(' ')[1:-1])
+            except:
+                desc='Not found'
         # Subject coverage
         cov=round((100 * (line["length"] - line["gaps"]) / line["slen"]), 2)
         # Identity

diff --git a/bin/vfdb2tsv.R b/bin/vfdb2tsv.R
@@ -0,0 +1,93 @@
+#!/usr/bin/env Rscript
+# A supporting script for writing VFDB results as csv!
+# To be used with fmalmeida/bacannot-compare
+# Author: Felipe Marques de Almeida <almeidafmarques@gmail.com>
+
+#################################################
+### Setting Help message and input parameters ###
+#################################################
+'usage: vfdb2csv.R [ --input=<file> --sample=<chr> --output=<chr> ]
+options:
+-i, --input=<file>       Bacannot JSON summary
+-o, --output=<chr>       Output file [default: out.tsv]
+-s, --sample=<chr>       Filter results of only this sample [default: all]
+' -> doc
+
+# Parse parameters
+suppressMessages(library(docopt))
+opt <- docopt(doc)
+
+if (is.null(opt$input)){
+    stop("At least one argument must be supplied (input file)\n", call.=FALSE)
+}
+
+#
+# load libraries
+#
+library(jsonlite)
+library(stringr)
+library(dplyr)
+
+#
+# read annotation summary file
+#
+summary_file <-
+    read_json( opt$input )
+samples <- 
+    names( summary_file ) # get available samples in file
+
+#
+# start empty data.frames for parsing
+#
+identified_genes <- 
+    setNames(
+        data.frame(
+        matrix(ncol = 6, nrow = 0)
+        ), c("sample", "contig", "start", "end", "gene", "subclass")
+    )
+
+#
+# parse VFDB Results
+#
+for (sample in samples) {
+    subset <- summary_file[[sample]]$virulence$VFDB
+    all_entries <- names(subset)
+    for (entry in all_entries) {
+        if (entry != "total") {
+
+        contig   = subset[[entry]]$chr
+        start    = subset[[entry]]$start
+        end      = subset[[entry]]$end
+        subclass = subset[[entry]]$virulence_factor
+        gene     = subset[[entry]]$gene
+
+        identified_genes[nrow(identified_genes) + 1,] <- 
+        c( sample, contig, start, end, gene, subclass )
+
+        }
+    }
+}
+
+identified_genes <- 
+    identified_genes %>% 
+    distinct()
+
+#
+# filter results
+#
+if (as.character(opt$sample) != 'all') {
+    identified_genes <-
+        identified_genes %>% 
+        dplyr::filter( as.character(sample) == as.character(opt$sample) )
+}
+
+#
+# save table as file
+#
+write.table(
+    identified_genes,
+    file = as.character( opt$output ),
+    quote = FALSE,
+    sep = "\t",
+    row.names = FALSE
+)
diff --git a/conf/defaults.config b/conf/defaults.config
@@ -168,6 +168,10 @@ params {
 // User's custom database coverage threshold
   blast_custom_mincov = 65
 
+          /*
+           * Resources allocation configuration
+           * Defaults only, expecting to be overwritten
+           */
 // Select versions of bioconda quay.io additional tools
 // Tools that are not part of the core of the pipeline,
 // but can eventually be used by users
@@ -176,7 +180,6 @@ params {
   bakta_version     = '1.6.1--pyhdfd78af_0'
 
 // Max resource options
-// Defaults only, expecting to be overwritten
   max_memory                 = '20.GB'
   max_cpus                   = 16
   max_time                   = '40.h'