PathAnalyser is a flexible and user-friendly R package that provides functionality for assessing ER and HER2 pathway activity in breast cancer transcriptomic datasets using a gene expression signature. Typically, gene signatures can be broadly classified into the following three categories:
- gene-sets - list of genes without information regarding the strength or direction of association with a phenotype
- weighted gene lists - lists of genes with numerical weights representing strength and direction of association with a phenotype
- gene-signature containing only direction of association with the phenotype (i.e. a list of up-regulated and down-regulated genes).
Several currently available packages / algorithms classify samples using gene-sets such as GSVA and GSEA or weighted gene lists such as the PAM50 algorithm. However, despite the third type of signature (direction-associated gene signatures) being reported by numerous publications, there is currently no software tools available for classifying samples based on these signatures. PathAnalyser addresses this need by providing functionality for classifying samples by pathway activity using these highly reported and widely available gene signatures.
- Summary
- PathAnalyser workflow
- Installation
- Input File Formats
- Quick Start
- Acessing help
- Questions, bug reports or issues
- If you wish to know more
The typical workflow for using the package is outlined below:
Note: Assessment of classification is optional and can only occur if true pathway class labels (e.g. "positive", "negative" or "uncertain") are available for the transcriptomic dataset.
PathAnalyser needs the following:
- R (tested on version 4.1.1)
- GSVA Bioconductor package (1.42.0) required by classification algorithm
- The following R libraries: (The number is the version tested during development)
GSVA (1.40.1) reader (1.0.6)
edgeR (3.34.1) ggplot2 (3.3.5)
limma (3.48.3) reshape2 (1.4.4)
plotly (4.10.0)
Note: The package is platform-independent; it was developed and runs on multiple operating systems (Windows, MacOS, Linux).
All dependencies should be installed together with the PathAnalyser package, however, they can be installed separately. To install all required CRAN dependencies of PathAnalyser, type the following in R:
install.packages(c("ggfortify", "ggplot2", "plotly", "reader", "reshape2"))
All Bioconductor dependencies can be installed by typing the following in R:
# if not previously installed
install.packages("BiocManager")
BiocManager::install(c("GSVA","edgeR", "limma"))
You can download the latest source tarball file (ending in .tar.gz) from the latest release section on the PathAnalyser GitHub repository page.
Then to install this local source package type the following in R:
library(utils)
install.packages("PathAnalyser_1.0.0.tar.gz", repos = NULL, type = "source", dependencies=TRUE)
For instructions for more advanced installations (e.g. for developers), please consult the vignette.
PathAnalyser can read two types of input data files:
-
A gene expression data set file containing a table with sample names or IDs as column names, and gene names as row names
An example gene expression data set file: -
Gene signature files (for up-regulated gene-set and another for down-regulated gene-set) which must be provided in either the gene-set file format (GRP) (shown below) or gene matrix transposed file format (GMT). Currently, PathAnalyser requires 2 gene signatures files (one for the up-regulated and one for the down-regulated gene set) for a gene signature of a given pathway.
An example gene signature file for the up-regulated gene-set of a gene signature:
Once the package is installed, to start using PathAnalyser simply load the PathAnalyser package in R:
library(PathAnalyser)
To read a gene expression data set file (tab or comma value separated files i.e. files with extension .tsv/.csv/.txt) into a matrix format for classification, type the following in R:
data_mat <- read_expression_data("gene_expr.txt")
To read the two signature files comprising the up-regulated gene set and down-regulated gene set of the gene signature, type the following in R using up_sig_file
and down_sig_file
parameters for the up-regulated and down-regulated gene-set of the signature respectively:
sig_df <- read_signature(up_sig_file="up_gene_sig.grp", down_sig_file="down_gene_sig.grp")
The classification functions of PathAnalyser require the expression dataset to be normalised.
For unnormalised RNA-seq data only, perform logCPM transformation on an unnormalised (raw count-containing) gene expression matrix (data_mat
) generated by read_expression_data
using log_cpm_transform
:
norm_data <- log_cpm_transform(data_set)
Note: Microarray datasets must be normalised prior to performing classification using PathAnalyser, as the package currently does not contain functionality for normalising microarray datasets.
For further quality control and data pre-processing including filtering genes from the gene expression matrix that are not present in the gene signature data frame, or those genes lacking expression values in < 10% of the total number of samples call the check_signature_vs_dataset
with the logCPM transformed gene expression matrix (norm_data
) and gene signature data frame (sig_df
):
norm_data <- check_signature_vs_dataset(norm_data, sig_df)
Note: Genes with multiple names (typically found in microarray data sets), often delimited with "///", will be dropped from the gene expression matrix, regardless of the presence of one of these gene names in the gene signature.
Pathway-based classification using percentile thresholds, can be performed by using the classify_gsva_percent function with a normalised gene expression matrix and gene signature data frame:
# Using default percentile threshold (quartile = 25%)
norm_data <- classify_gsva_percent(norm_data, sig_df)
A custom percentile threshold can be provided by the user for tuning the pathway-based classification, by adding the percent_thresh
parameter:
# Using a 50th percentile threshold (50%)
classes_df <- classify_gsva_percent(norm_data, sig_df, percent_thresh=50)
The generated output (classes_df
) of the classification function is a data frame containing samples names as the first column and the predicted activity class for a given pathway as the second column ("Active", "Inactive", "Uncertain").
An interactive PCA plot for visualising the pathway-based classification of samples can be achieved by using the classes_PCA
function with the normalised expression matrix (norm_data
), the data frame produced by the classify_gsva_percent
function (classes_df
) and the pathway of interest:
classes_PCA(norm_data, classes_df, pathway = "ER")
If true pathway class labels are available for the classified dataset, users can obtain evaluation metrics for the classification such as accuracy, sensitivity, specificity etc using the calculate_accuracy
function with the following arguments:
-
true_labels
: a data frame containing sample names as the first column and true pathway class labels ("positive", "negative", "uncertain") in a column named after the pathway of interest -
predicted_labels
: a data frame containing the same sample names in the first column, named "sample", and predicted pathway class labels ("Active", "Inactive" or "Uncertain") as the second column which is named "class" -
pathway
: pathway of interest (Note: this pathway name must be identical to the name of the column containing the true pathway activity labels.For example:
confusion_matrix <- calculate_accuracy("Sample_labels.txt", classes_df,
pathway = "ER")
For further examples of using PathAnalyser in pathway-based classification analysis, please refer to the demo script (under /demo folder) and use the provided supplementary data, or read the vignette.
To access help pages for any of the functions or built-in data provided by PathAnalyser, prefix the name of the function or data set with a question mark, e.g. to get additional information on the read_signature
function, type the following in R:
?read_signature
For any questions, feature requests, bug reports or issues regarding the latest version of PathAnalyser, please use the "issues" tab located at the top-left of the GitHub repository page.
If the PathAnalyser GitHub repository is public, look in the vignette here: http://ozlemkaradeniz.github.io/PathAnalyser/
Alternatively, the vignette can be accessed within R, by typing the following (after PathAnalyser has been successfully installed):
browseVignettes(package = "PathAnalyser")