Skip to content

Commit

Permalink
Latest updates from dev (#74)
Browse files Browse the repository at this point in the history
* update hmmer to working version with PGAP

* update hmmer to working version with PGAP (#59) (#60)

* Adding option for using `bakta` (#58)

* first bakta addition test

* Fixing error when phigaro output was empty (#55) (#56)

* adding checkIfExists rules to input loading

* Update build.sh

* fixed report if else statement

* updated version and changelog

* fix example samplesheet url

* Create nf-core-bacannot-compare_logo_dark.png

* Update phigaro.nf

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update .gitignore

* Update bacannot.nf

* workflow runs from top to bottom

* workflow correctly working

* changing version manifest

* explaining how to use bakta

* adding variables to generic annotation

* adding victors backup db

* report_general understands bakta

* report_resistance understands bakta

* update hmmer to working version with PGAP

* update hmmer to working version with PGAP (#59)

* update hmmer to working version with PGAP (#59) (#60)

* Update resfinder2gff.py

* Update addBedtoolsIntersect.R

* Update addBedtoolsIntersect.R

* Update addBedtoolsIntersect.R

* Update addBedtoolsIntersect.R

* Update docker.config

* download db if not available

* Update merge_annotations.nf

* Update merge_annotations.nf

* update process label

* Update merge_annotations.nf

* fix docker file ownerships

* added retry label for bakta

* update changelog

* changing to new version

* add falmeida-py scripts to docker

* Create generate_dockers.sh

* Minor improvements for custom database annotation (#67)

* Add step making Blast DB when annotating custom db

Blast database throw error for no alias or index file found for
nucleotide database, when using blastn. Thus, make blast database
before execute blastn to custom database on genome.

* Remove space when generate link in Shiny markdown

Whitespace occurs when generated links for custom database report in
Shiny markdown. Specify seperate string to emtpy string to fix this.

* fix: script was calling an out of bounds value

* handle exceptions and missing values

* increment test profile

* also build for proteins

* Use correct version of customdb_gffs in REPORTS module

Co-authored-by: fmalmeida <felipemarques89@gmail.com>

* fixed docker images

* resfinder now uses conda env

* rgi does not have --exclude_nudge anymore

* add circos conf files

* fix indentation

* add circos sub-workflow calling

* add circos tool

* added karyotype and gc_skew modules

* fix indentation

* modules now have a channel suitable for summary generation

* include summary generation module

* update docker image

* add aro index tsv

* add amrfinder2tsv module

* add bacannot json summary generation

* update docker images

* add new modules

* added binaries for circos plot

* update docker files

* add label to module

* finalise modules preparing data for circos

* include whole circos 'pathway' in workflow

* add circos module

* circos plot currently functional

* re-organised circos outputs

* ameliorate default version of circos plot

* fixed docker image

* fixed custom annotation problem in server

* added circos plot with easy_circos

* circos plot properly being generated

* update changelog

* update readme

* update documentation

* fix tool name

* fix dir

* update indentation

* fixed schema

* add pipefail

* fix docker image

Co-authored-by: Cy <38537368+lam-c@users.noreply.github.com>
  • Loading branch information
fmalmeida and lam-c authored Dec 19, 2022
1 parent fe1c085 commit aa871de
Show file tree
Hide file tree
Showing 62 changed files with 6,658 additions and 785 deletions.
2 changes: 1 addition & 1 deletion .zenodo.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"description": "<p>The pipeline</p>\n\n<p>bacannot, is a customisable, easy to use, pipeline that uses state-of-the-art software for comprehensively annotating prokaryotic genomes having only Docker and Nextflow as dependencies. It is able to annotate and detect virulence and resistance genes, plasmids, secondary metabolites, genomic islands, prophages, ICEs, KO, and more, while providing nice an beautiful interactive documents for results exploration.</p>",
"license": "other-open",
"title": "fmalmeida/bacannot: A generic but comprehensive bacterial annotation pipeline",
"version": "v3.1.7",
"version": "v3.2",
"upload_type": "software",
"creators": [
{
Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Its main steps are:
| Custom annotation from formatted FASTA or NCBI protein IDs | [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs) |
| Merge of annotation results | [bedtools](https://bedtools.readthedocs.io/en/latest/) |
| Genome Browser renderization | [JBrowse](http://jbrowse.org/) |
| Circos plot generation | [easy_circos](https://easy_circos.readthedocs.io/en/latest/index.html) |
| Renderization of automatic reports and shiny app for results interrogation | [R Markdown](https://rmarkdown.rstudio.com/), [Shiny](https://shiny.rstudio.com/) and [SequenceServer](https://sequenceserver.com/) |

🎯 In order to increase the accuracy of prokka annotation, this pipeline includes an additional HMM database to prokka's defaults. It can be either TIGRFAM (smaller but curated) or PGAP (bigger comprehensive NCBI database that contains TIGRFAM).
Expand Down Expand Up @@ -89,10 +90,10 @@ These images have been kept separate to not create massive Docker image and to a
* After installed, you need to download the required Docker images

```bash
docker pull fmalmeida/bacannot:v3.1_misc ;
docker pull fmalmeida/bacannot:v3.1_perlenv ;
docker pull fmalmeida/bacannot:v3.1_pyenv ;
docker pull fmalmeida/bacannot:v3.1_renv ;
docker pull fmalmeida/bacannot:v3.2_misc ;
docker pull fmalmeida/bacannot:v3.2_perlenv ;
docker pull fmalmeida/bacannot:v3.2_pyenv ;
docker pull fmalmeida/bacannot:v3.2_renv ;
docker pull fmalmeida/bacannot:jbrowse ;
```

Expand Down
5,022 changes: 5,022 additions & 0 deletions assets/aro_index.tsv

Large diffs are not rendered by default.

85 changes: 85 additions & 0 deletions bin/GCcalc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env python3

## calculate GC
from __future__ import division

# __author__ = "Wenchao Lin"
# __copyright__ = "Copyright 2014,Tianjin Biochip Corp."
# __license__ = "GPL"
# __version__ = "1.0.0"
# __maintainer__ = "Wenchao Lin"
# __email__ = "linwenchao@yeah.net"


from Bio import SeqIO
import sys
from optparse import OptionParser
import numpy as np

def GC_content_window(s):

"""
Return GC content of input sequence
"""

gc = sum(s.count(x) for x in ['G','C','g','c','S','s'])
gc_content = gc/float(len(s))
return round(gc_content,4)


def GC_skew_window(s):

"""
Reuturn GC skew of a sequence
"""

g = s.count('G')+s.count('g')
c = s.count('C')+s.count('c')

try:
skew = (g-c)/float(g+c)
except ZeroDivisionError:
skew = 0
return round(skew,4)


def main():

"""
Calculate GC content and GC skew of input Fasta sequence
"""

usage = "usage: %prog -f input.fa [-w 1000] [-s 1000]"
parser = OptionParser(usage = usage)
parser.add_option("-f","--file",dest="filename",help="Input Fasta format file",metavar="FASTA")
parser.add_option("-w","--window",dest="WindowSize",help="default:1000 WindowSize to calculate",default=1000,type='int')
parser.add_option("-s","--step",dest="StepSize",help="default:1000 StepSize for slide widows",default=1000,type='int')
(options,args) = parser.parse_args()

window = options.WindowSize
step = options.StepSize
seqobj = SeqIO.parse(options.filename,'fasta')

for record in seqobj:
pos_array = []
gc_content_value_array = []
gc_skew_value_array = []
name = record.id
seq = record.seq
start = 0
end = 0
gc = 0
gc_skew = 0

for i in range(0,len(seq),step):
subseq = seq[i:i+window]
gc_content = (GC_content_window(subseq))
gc_skew = (GC_skew_window(subseq))
start = (i + 1 if (i+1<=len(seq)) else i)
end = ( i + step if (i+ step<=len(seq)) else len(seq))
print ("%s\t%s\t%s\t%s\t%s" % (name,start,end,gc_content,gc_skew))


if __name__ == '__main__':
main()
sys.exit()
129 changes: 129 additions & 0 deletions bin/amrfinder2tsv.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#!/usr/bin/env Rscript
# A supporting script for writing amrfinderplus results as csv!
# To be used with fmalmeida/bacannot-compare
# Author: Felipe Marques de Almeida <almeidafmarques@gmail.com>

#################################################
### Setting Help message and input parameters ###
#################################################
'usage: amrfinder2csv.R [ --input=<file> --output=<chr> --aro=<file> --sample=<chr> ]
options:
-i, --input=<file> Bacannot JSON summary
-o, --output=<chr> Output file [default: out.tsv]
-a, --aro=<file> CARD ARO categories
-s, --sample=<chr> Filter results of only this sample [default: all]
' -> doc

# Parse parameters
suppressMessages(library(docopt))
opt <- docopt(doc)

if (is.null(opt$input)){
stop("At least one argument must be supplied (input file)\n", call.=FALSE)
}

#
# load libraries
#
library(jsonlite)
library(stringr)
library(dplyr)

#
# functions for CARD metadata
#
aro_index <-
read.csv( opt$aro, sep = "\t" )
aro_index$ARO.Accession <-
str_remove(aro_index$ARO.Accession, "ARO:")

get_aro_category <- function(aro) {
#aro <- 3003923
df <- aro_index %>% dplyr::filter(ARO.Accession == aro)
result <- str_split(df$Drug.Class, ";")[[1]][1] # tentar padronizar pegando so o primeiro
return(result)
}

get_aro_gene_name <- function(aro) {
# aro <- 3003923
df <- aro_index %>% filter(ARO.Accession == aro)
result <- df$CARD.Short.Name
return(result)
}

#
# read annotation summary file
#
summary_file <-
read_json( opt$input )
samples <-
names( summary_file ) # get available samples in file

#
# start empty data.frames for parsing
#
identified_genes <-
setNames(
data.frame(
matrix(ncol = 6, nrow = 0)
), c("sample", "contig", "start", "end", "gene", "subclass")
)
without_aro <-
setNames(
data.frame(
matrix(ncol = 6, nrow = 0)
), c("sample", "contig", "start", "end", "gene", "subclass")
)

#
# parse AMRFinderPlus Results
#
for (sample in samples) {
subset <- summary_file[[sample]]$resistance$amrfinderplus
all_entries <- names(subset)
for (entry in all_entries) {
if (entry != "total") {

if (subset[[entry]][['type']] == 'AMR') {

contig = subset[[entry]]$contig
start = subset[[entry]]$start
end = subset[[entry]]$end

if (is.null(subset[[entry]][['card_aro']])) {
subclass = subset[[entry]]$subclass
without_aro[nrow(without_aro) + 1,] <-
c( sample, contig, start, end, subset[[entry]]$gene, subclass )
} else {
if (is.null(subset[[entry]][['subclass']])) {
subclass = "None"
} else {
subclass = get_aro_category(subset[[entry]]$card_aro)
}
identified_genes[nrow(identified_genes) + 1,] <-
c( sample, contig, start, end, subset[[entry]]$gene, subclass )
}

}
}
}
}

#
# filter results
#
if (as.character(opt$sample) != 'all') {
identified_genes <-
identified_genes %>% dplyr::filter( as.character(sample) == as.character(opt$sample) )
}

#
# save table as file
#
write.table(
identified_genes,
file = as.character( opt$output ),
quote = FALSE,
sep = "\t",
row.names = FALSE
)
12 changes: 10 additions & 2 deletions bin/run_blasts.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,16 @@ def summary(output):
db=line["sseqid"].split('~~~')[0]
gene=line["sseqid"].split('~~~')[1]
acc=line["sseqid"].split('~~~')[2]
prodc=line["sseqid"].split('~~~')[3]
desc=line["sseqid"].split('~~~')[4]

if len(line["sseqid"].split('~~~')) == 5:
prodc=line["sseqid"].split('~~~')[3]
desc=line["sseqid"].split('~~~')[4]
else:
prodc=line["sseqid"].split('~~~')[3].split(' ')[0]
try:
desc=' '.join(line["stitle"].split('~~~')[3].split(' ')[1:-1])
except:
desc='Not found'
# Subject coverage
cov=round((100 * (line["length"] - line["gaps"]) / line["slen"]), 2)
# Identity
Expand Down
93 changes: 93 additions & 0 deletions bin/vfdb2tsv.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/usr/bin/env Rscript
# A supporting script for writing VFDB results as csv!
# To be used with fmalmeida/bacannot-compare
# Author: Felipe Marques de Almeida <almeidafmarques@gmail.com>

#################################################
### Setting Help message and input parameters ###
#################################################
'usage: vfdb2csv.R [ --input=<file> --sample=<chr> --output=<chr> ]
options:
-i, --input=<file> Bacannot JSON summary
-o, --output=<chr> Output file [default: out.tsv]
-s, --sample=<chr> Filter results of only this sample [default: all]
' -> doc

# Parse parameters
suppressMessages(library(docopt))
opt <- docopt(doc)

if (is.null(opt$input)){
stop("At least one argument must be supplied (input file)\n", call.=FALSE)
}

#
# load libraries
#
library(jsonlite)
library(stringr)
library(dplyr)

#
# read annotation summary file
#
summary_file <-
read_json( opt$input )
samples <-
names( summary_file ) # get available samples in file

#
# start empty data.frames for parsing
#
identified_genes <-
setNames(
data.frame(
matrix(ncol = 6, nrow = 0)
), c("sample", "contig", "start", "end", "gene", "subclass")
)

#
# parse VFDB Results
#
for (sample in samples) {
subset <- summary_file[[sample]]$virulence$VFDB
all_entries <- names(subset)
for (entry in all_entries) {
if (entry != "total") {

contig = subset[[entry]]$chr
start = subset[[entry]]$start
end = subset[[entry]]$end
subclass = subset[[entry]]$virulence_factor
gene = subset[[entry]]$gene

identified_genes[nrow(identified_genes) + 1,] <-
c( sample, contig, start, end, gene, subclass )

}
}
}

identified_genes <-
identified_genes %>%
distinct()

#
# filter results
#
if (as.character(opt$sample) != 'all') {
identified_genes <-
identified_genes %>%
dplyr::filter( as.character(sample) == as.character(opt$sample) )
}

#
# save table as file
#
write.table(
identified_genes,
file = as.character( opt$output ),
quote = FALSE,
sep = "\t",
row.names = FALSE
)
5 changes: 4 additions & 1 deletion conf/defaults.config
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,10 @@ params {
// User's custom database coverage threshold
blast_custom_mincov = 65

/*
* Resources allocation configuration
* Defaults only, expecting to be overwritten
*/
// Select versions of bioconda quay.io additional tools
// Tools that are not part of the core of the pipeline,
// but can eventually be used by users
Expand All @@ -176,7 +180,6 @@ params {
bakta_version = '1.6.1--pyhdfd78af_0'

// Max resource options
// Defaults only, expecting to be overwritten
max_memory = '20.GB'
max_cpus = 16
max_time = '40.h'
Expand Down
Loading

0 comments on commit aa871de

Please sign in to comment.