-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Jupyter, RCurl, rjson, IRKernal, pheatmap, ggplot2, RColorBrewer, XML, foreach, parallel, doParallel, data.table, utils, rlist, crul, jsonlite, R.utils, rvest, colorspace, recommenderlab, RAM
Install dependencies using install.packages(c(Jupyter, RCurl, rjson, IRKernal, pheatmap, ggplot2, RColorBrewer, XML, foreach, parallel, doParallel, data.table, utils, rlist, crul, jsonlite, R.utils, rvest, colorspace, recommenderlab, RAM))
BacDiveApiCrawler.R
BacMapCrawler.R
CleanProTrait.R
CombineData.R
CreateDataTable.R
ParseIJSEM.R
Utility.R
bacdive.crawler()
retrieves information from the BacDive API, organizing it into a formatted table
bacdive.crawler(usrname, pass, num_requests = 10, save_file = TRUE)
usrname the username for a verified BacDive account
pass the password for a corresponding BacDive account
num_requests the number of bacterial entries to asynchronously download
save_file if true, saves a .csv to the working directory containing the information extracted from the BacDive API
Designed to traverse the API provided by BacDive. The BacDive API provides a database that can easily queried, providing microbial physiology data in the JSON format. Each specie contains its own ‘page’, which details information such as taxonomy, morphology, strain information, and more. This script currently selectively chooses certain traits to record, meaning that there is more data that could be chosen to extracted, if implemented.
Returns a data.frame containing information extracted from BacDive
Because this traverses the site’s API, it is still limited by internet speeds and the rate at which the site’s server responds. This can be detrimental to the speed at which the script can run. In addition, if num_requests
is set to too high, it may be too demanding for BacDive's server. Lastly, BacDive contains many more strains than the the number of species. All of these strains are collected even though this method was implemented to only extract species information.
bacmap.crawler()
scrapes the BacMap Database for microbial phenotypic information
bacmap.crawler(url = "http://bacmap.wishartlab.com/", num_requests = 10)
url the page of the Bacmap website from which to begin scrapping. Must be a page containing a table of microbial entries. NOTE: The default homepage is the most stable from which to start, and due to the fast speed of the method under 'normal' internet connections, changing the homepage will not have notable runtime effects.
num_requests the number of bacterial entries to asynchronously download
Webscrapes the BacMap website for microbial phenotype information according to the website's structure as of August, 2019. Expects there to be a table from which to determine all URLs from which to get microbial phenotype information. Expects all microbial phenotype pages to also contain a table with the same phenotype entries. Condenses this information from every microbe into one table.
Returns a data.frame containing information extracted from BacMap. Also downloads a copy of this table to a .csv, locally
clean.protrait()
retrieves information from a file downloaded from the ProTrait Atlas, formatting it into a table
clean.protrait(save_file = TRUE)
save_file if true, saves a .csv to the working directory containing the information extracted from the ProTrait Atlas
Designed to extract information from a table created by ProTrait. It lacks a format that generalizes traits, instead listing each type of trait (gram-positive, pathogenic in animals, aerobe, etc) as its column. Therefore, this script organizes this table into generalized traits, providing for an easy way to use this table for purposes such as annotation. It will first check if the ProTrait file already exists in the working directory. If it does not, it will download the file to the working directory and start formatting it.
Returns a data.frame containing information extracted from ProTrait
parse.ijsem()
is a method for parsing the International Journal of Systematic and Evolutionary Microbiology database contains phenotypic information about microbes
parse.ijsem()
If a local copy of the raw IJSEM datable exists locally, this function will look for it first. Otherwise, it will download a copy locally and then begin parsing it
Returns a data.frame containing information extracted from the metadata and saves a .csv locally
read.excel
is a method for reading a excel spread sheet and converting this into a data frame representation
read.excel(path)
path the path to the desired excel spreadsheet
This can be used to read an excel spreadsheet containing additional user-supplied data about microbes. This can then be used in combination with combine.table to merge it with the online databases to create a single, comprehensive data table.
Returns a data frame representation
combine.data()
combines the given tables into a single, formatted table
combine.data(data, save_file = TRUE)
data a list of tables that will be combined
save_file if true, saves a .csv to the working directory containing the information resulting from the combined table
CombineData.R is a script that merges given tables. This script is required because the column labels produced for each of these tables are different and there are different traits extracted in general. This script works to create one cohesive table. It also runs the following, additional methods for cleaning up the table: merging repetitive species entries, correcting traits that have synonyms, renaming nutrition requirements, and ordering each entry with multiple traits alphabetically.
In addition, this function requires an external file that contains the names of each column (represents a type of trait such as oxygen requirement, gram stain, etc.). This defaults to an internal file. This file supplied be a .csv with the following specifications: the first row must be the desired column names of the final table, and each entry under these names are the column names of the data tables that are going be to be merged. For example, if the tables that are going to be combined have "Oxygen Requirement" information, one which is called "Oxygen tolerance" and the other called "Oxygen preference", have a column named "Oxygen Requirement". Then have "Oxygen tolerance" and "Oxygen preference" as entries under this column.
Returns a data.frame containing a information from the combined tables
ParseHMDB.R
parse.hmdb()
downloads all metabolite information from the HMDB website and parses through the XML file, converting it into a readable data table that is saved locally to a .csv.
parse.hmdb(file = 'hmdb_metabolites.xml', link = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip')
file the name of the local file for the hmdb data table
link the web address from where to download the hmdb database
Parses the HMDB data table, which is in the XML format, into a data frame containing species as rows, and traits as columns. In the original XML format, the information is nested and difficult to read information from. This method flattens out all the nested information and reformats it as a table. This method does not look for certain information, but keeps all categories found.
Returns a data.frame containing information extracted from HMDB
This method currently does not keep multiple entries under the same trait category. For example, if the data has multiple entries for disease, it will only keep the first entry under 'disease'.\
HeatMap.R
save.figure()
is a method for saving ggplots to a png image
load.abundance.data(figure, file_location = '', width = 5, height = 6)
figure the ggplot to save
file_location a path directing where to save the figure
width the width of the saved image
height he height of the saved image
Saves a png image to the given path
load.abundance.data()
is a method for loading abundance table in .csv files in the appropriate format for use with the heat map creating functions
load.abundance.data(path, column = 1)
path the path from the working directory to the .csv file containing the abundance table
column the column number containing the feature names
The abundance table needs to be loaded into R in such a way that the row names are the feature names, the sample names are the column names, and all its values are numerics.
Returns a numerical matrix created from the abundance table
load.meta.data()
is a method for loading metadata in .csv files in the appropriate format for use with the heat map creating functions
load.meta.data(path, tax_column = 1)
path the path from the working directory to the .csv file containing the metadata
tax_column the column number containing the taxonomical or sample (ie identifying) name for the metadata
This can be used to load feature or sample metadata. Metadata needs to be loaded in such a way that the row names are the identifying names and the traits are the column names.
Returns a data.frame containing information extracted from the metadata
This will eliminate all duplicate entries from the metadata without merging their data resulting in potential data loss.
create.correlogram()
creates a heat map based on the correlation of features given an abundance table and feature metadata.
create.correlogram(data, feature_meta, show = TRUE, omit = TRUE, cluster_distance_method = 'euclidean', unique_colors = TRUE)
data abundance data in a numerical matrix
feature_meta a data.frame containing feature metadata
show if true, will display the graph upon completion
omit if true, will delete all microbial entries that are missing metadata
cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)
unique_colors if true, will attempt to choose the most distinct colors, does not work well for continuous values
The features need to be the rows of the abundance data.
Returns a pheatmap with the following components: row hclusters, column hclusters, kmeans, and gtable
\
create.correlogram()
creates a heat map based on the correlation of features given an abundance table and feature metadata.
create.correlogram(data, feature_meta, show = TRUE, omit = TRUE, cluster_distance_method = 'euclidean', unique_colors = TRUE)
data abundance data in a numerical matrix
feature_meta a data.frame containing feature metadata
show if true, will display the graph upon completion
omit if true, will delete all microbial entries that are missing metadata
cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)
unique_colors if true, will attempt to choose the most distinct colors, does not work well for continuous values
The features need to be the rows of the abundance data.
Returns a pheatmap with the following components: row hclusters, column hclusters, kmeans, and gtable
\
multi.correlogram()
creates multiple correlograms using multiple data sets
multi.correlogram(data_tables, sample_datas, omit = FALSE)
data_tables a list of matrices (sample data like abundance tables, intensity values, etc.)
sample_datas a list of data frames with annotation data (microbial physiology data, metabolite physiochemical data, etc)
omit whether to eliminate samples with no annotation data
The features need to be the rows of the sample data.
Returns a list of correlograms, one for each combination of sample data
one.v.all()
uses the create.heatmap function, but filters the metadata such that it labels only a single feature category and type, labeling all others as 'other'
one.v.all(data, sample_meta, feature_meta, which = 2, percentile = 0.75, show = FALSE, column, trait, cluster_distance_method = "euclidean")
data abundance data in a numerical matrix
sample_meta a data.frame containing sample metadata
feature_meta a data.frame containing feature metadata
which a number representing whether to filter the sample(1) or feature(2) metadata
percentile a filter for displaying only entries with a threshold correlation
show if true, will display the graph upon completion
column the column number with the feature category
trait the specific feature type to use
cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)
Compare only one feature type against all others in a feature category (ex. aerobic respiration v all other oxygen requirements). The features need to be the rows of the abundance data. Can supply any number of feature categories, but only one will be used.
Returns a pheatmap with the following components: row hclusters, column hclusters, kmeans, and gtable
all.one.v.all()
uses the one.v.all function, creates a heatmap for every feature type found
all.one.v.all <- function(data, sample_meta, feature_meta, which = 2, percentile = 0.75, show = FALSE, column, directory='', cluster_distance_method = "euclidean")
data abundance data in a numerical matrix
sample_meta a data.frame containing sample metadata
feature_meta a data.frame containing feature metadata
which a number representing whether to filter the sample(1) or feature(2) metadata
percentile a filter for displaying only entries with a threshold correlation
show if true, will display the graph upon completion
column the column number with the feature category
directory the path from the working directory to where the file should be saved
cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)
Creates a heatmap for every feature type found (ex. 3 forms of oxygen requirements). The features need to be the rows of the abundance data. Can supply any number of feature categories, but only one will be used. Will automatically name the files based on the trait