Merge pull request #2 from EBI-Metabolights/MTBLS-175-0.2-documentation

Mtbls 175 0.2 documentation
EBI-Metabolights · Sep 29, 2022 · a1f6d93 · a1f6d93
2 parents efd7452 + fad4dd6
commit a1f6d93
Show file tree

Hide file tree

Showing 24 changed files with 192 additions and 92 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,11 @@
 Package: ImperialNMRTool
 Type: Package
 Title: Tool for Nuclear Magnetic Resonance based spectrography.
-Version: 0.0.1.0
+Description: Workflow for automatic processing and annotation of 1D 1H-NMR spectra from MetaboLights repository.
+    Entrypoint to the workflow is hurricane.r, and it takes an argument which is the location 
+    of a parameters file, which itself contains the location of the files the workflow needs,
+    and the workflow parameters. To read more visit the github repository at github.com/EBI-Metabolights/icl_nmr_R
+Version: 0.2.0.0
 Date: 2022-09-12
 Authors@R: c(
     person("Michael T. Judge", "PostDoc / Developer", role = c("aut")),
@@ -11,5 +15,4 @@ Maintainer: Callum Martin <cmartin@ebi.ac.uk>
 Depends: R (>= 4.0.0)
 Imports: optparse, BiocManager, stringr, MassSpecWavelet, yaml, pbapply, devtools, batch, flexclust, rlang, pracma, RcppHungarian, grid, lattice, modeltools, stats4
 License: GPL-2
-Description: Workflows for automatic processing and annotation of 1D 1H-NMR spectra from MetaboLights repository.
 RoxygenNote: 7.2.1
diff --git a/R/annotation_matching.R b/R/annotation_matching.R
@@ -1,4 +1,4 @@
-#' Annotation Module
+#' Annotation Matching Module
 #'
 #' Given picked peaks from statistical decomposition module, do a simple
 #' comparison between each query chemical shift list (STOCSY-derived)

diff --git a/R/detectSpecPeaks.R b/R/detectSpecPeaks.R
@@ -1,14 +1,21 @@
 #' Peak detection for spectra
 #'
-#' Divide the whole spectra into smaller segments and detect peaks by using MassSpecWavelet package.
-#' Note that, the peak lists could be found by using other methods, this function is just a choice.
+#' Divide the whole spectra into smaller segments and detect
+#' peaks by using MassSpecWavelet package.
+#' Note that, the peak lists could be found by using other
+#' methods, this function is just a choice.
 #'
 #'
-#' @param X The spectral dataset in matrix format in which each row contains a single sample
+#' @param X The spectral dataset in matrix format in which each row contains
+#' a single sample
 #' @param nDivRange The size of a single small segment after division of spectra
-#' @param scales The parameter of peakDetectionCWT function of MassSpecWavelet package, look it up in the original function.
-#' @param baselineThresh It will remove all peaks under an intensity set by baselineThresh.
-#' @param SNR.Th The parameter of peakDetectionCWT function of MassSpecWavelet package, look it up in the original function. If you set -1, the function will itself re-compute this value.
+#' @param scales The parameter of peakDetectionCWT function of MassSpecWavelet
+#' package, look it up in the original function.
+#' @param baselineThresh It will remove all peaks under an intensity set by
+#' baselineThresh.
+#' @param SNR.Th The parameter of peakDetectionCWT function of
+#' MassSpecWavelet package, look it up in the original function.
+#' If you set -1, the function will itself re-compute this value.
 #' @param verbose A boolean value to allow print out process information.
 #'
 #' @return The peak lists of the spectra

diff --git a/R/exportMatches.R b/R/exportMatches.R
@@ -1,4 +1,4 @@
-#' exportMatches
+#' Export Matches to filesystem.
 #'
 #' Export matches to file for visualization and storage
 #' Ideally, this function will allow:
@@ -8,21 +8,23 @@
 #' (including those with null matches):
 #'   - scores_matrix
 #'     - data frame containing 1 row for each non-zero scored match, and columns
-#'       with match info/metadata such as hmdb_id, ref spectrum id, (Jaccard) scores,
-#'      number of matches, number of possible matches in ref, compound name, instrument
-#'       properties, experiment properties, etc.
+#'       with match info/metadata such as hmdb_id, ref spectrum id, (Jaccard)
+#'      scores,
+#'      number of matches, number of possible matches in ref, compound name,
+#'       instrument, properties, experiment properties, etc.
 #'   - filteredResults
 #'     - list of lists, each list contains additional info for each row in
 #'       scores_matrix which could not fit into a data frame. Some info is
 #'       duplicated in the scores_matrix, but the important data here is:
 #'       - peaklists: reference and target peaklists (matched lists)
-#'       - matchlist: detailed match info for the matched peaks only (both ref and target):
+#'       - matchlist: detailed match info for the matched peaks only
+#'         (both ref and target):
 #'         - "ppms" of the peaks
 #'         - relative "intensities" of the reference peaks (if known)
 #'         - "inds" in the starting ref or target list, respectively
 #' Extract data and put in tall matrix (all matched peaks).
-#' Define Format: In order to plot the matches as two lists of scatter points and
-#' lines connecting the matched target and ref points, we need:
+#' Define Format: In order to plot the matches as two lists of scatter
+#' points and lines connecting the matched target and ref points, we need:
 #'   - all ref peak positions and intensities
 #'   - all target peaks and intensities
 #'   - the connection from each matched ref to its target
@@ -45,7 +47,6 @@ exportMatches <- function(matches,
   matchPairs <- NULL
   referenceList <- NULL
   targetList <- NULL
-  # length(matches)
 
   # Figure out which matches are empty and exclude them
   hasData <- unlist(lapply(

diff --git a/R/hurricane.R b/R/hurricane.R
@@ -7,15 +7,19 @@
 #' 4] visualisation [in progress in upstream repository]
 #' Also saves to RDS outputs from statistical decomposition
 #' A default params.yaml file will be used if none is supplied.
-#' The params file looks like this:
-#' 
+#' The outline of the params file can be found in the repo readme.md
+#'
 #' @param params_loc location of the params.yaml file containing
 #'  pipeline run parameters.
+#' @param params_obj A populated yaml object containing the parameters.
+#' This parameter is intended for use when being run from galaxy.
 #' @return N/A but .RDS and .TSV files are saved, see description.
 #' @export
-hurricane <- function(params_loc) {
+hurricane <- function(params_loc, params_obj) {
     default <- FALSE
-    if (missing(params_loc)) {
+    if (isFALSE(missing(params_obj))) {
+        run_params <- params_obj
+    } else if (missing(params_loc)) {
         # load default params
         filepath <- system.file(
             "extdata", "default_params.yaml", package="ImperialNMRTool")
@@ -25,9 +29,7 @@ hurricane <- function(params_loc) {
         # load supplied params
         run_params <- yaml::yaml.load_file(params_loc)
     }
-    # it may be that a user supplies their own params file but it
-    # is not complete. What do we do then? Fill in any missing fields
-    # with the corresponding values from default params?
+    # currently loading default files does not work.
     if (isTRUE(default)) {
         peaks_path <- system.file(
             "extdata", "peaks.RDS", package="ImperialNMRTool")

diff --git a/R/ind2subR.R b/R/ind2subR.R
@@ -6,7 +6,8 @@
 #' https://stackoverflow.com/questions/4452039/converting-between-matrix-subscripts-and-linear-indices-like-ind2sub-sub2ind-in
 #'
 #' MTJ 28FEB2022
-#' @param inds The linear indices to be converted IE linear index (1 = [1,1], 2 = [2,1], 3 = [3,1], 4 = [1,2], 5 = [2,2])
+#' @param inds The linear indices to be converted IE linear index
+#'  (1 = [1,1], 2 = [2,1], 3 = [3,1], 4 = [1,2], 5 = [2,2])
 #' @param m The number of rows IE m = nrow(matrix)
 #' @return List of row and column indices.
 #' @export

diff --git a/R/matchCluster2.R b/R/matchCluster2.R
@@ -1,16 +1,28 @@
 
-#' matchCluster2
-#' 
-#' description
+#' Match Cluster 2
+#'
+#' provide an alternative matching method for target and ref clusters.
+#' Calculate all pairwise distances between all target cluster peaks
+#' and reference cluster peaks and filter for tolerance.
+#' Match each peak with its closest ref peak:
+#'   A) start with lowest overall distance, assign target[i] to ref[r]
+#'     peak, record ref[r] as unavailable (remove its matches in the
+#'     distance matrix)
+#'     find next lowest distance match, and repeat until no more matches
+#'   B) find the best set of matches which satisfy the following:
+#'     ppm ranks are preserved (no criss-crossing of match lines)
+#'     ref peaks can be assigned to at most one target peak
+#'     best match set determined by sum of distances
+#'
 #' @param t target peaks.
 #' @param r reference peaks.
 #' @param tol tolerance in ppm.
 #' @param method matching methodology as a string, defaults to 'itmin'.
 #' Can also be substructure based 'substruct'. optimal 'opt' not implemented.
-#' @return List of matches.
+#' @return List of match objects.
 #' @export
 matchCluster2 <- function(t, r, tol = 0.02, method = "itmin") {
-      # Initialize reporting vectors
+  # Initialize reporting vectors
 
   matched_targets <- NA
   length(matched_targets) <- length(t)
@@ -164,4 +176,4 @@ matchCluster2 <- function(t, r, tol = 0.02, method = "itmin") {
     distances = na.omit(distances),
     method = method
   ))
-}
+}
diff --git a/R/matchPeaksToRef.R b/R/matchPeaksToRef.R
@@ -1,4 +1,4 @@
-#' matchPeaksToRef
+#' Match Peaks to Reference Spectra
 #'
 #' Function to match target and reference spectral peaks by chemical shift
 #' Goncalo Graca, 16 February 2021, g.gomes-da-graca-at-imperial.ac.uk
@@ -15,7 +15,7 @@
 #' @param Itol tbc
 #' @param matchMethod matching methodology as a string, defaults to 'basic'
 #' @param intensity flag to indicate whether to factor in intensity to matching.
-#' @return list of resulting matches.
+#' @return list of resulting matches and metadata objects.
 #' @export
 matchPeaksToRef <- function(
   target, driver_ppm, reference, tol = 0.02, Itol = 20,

diff --git a/R/matchPeaksToRef2.R b/R/matchPeaksToRef2.R
@@ -1,4 +1,4 @@
-#' matchPeaksToRef2
+#' Match Peaks to reference Spectra 2.
 #'
 #' Function to match target and reference spectral peaks by chemical shift
 #' Goncalo Graca, 16 February 2021, g.gomes-da-graca-at-imperial.ac.uk
@@ -20,8 +20,6 @@
 matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
                              Itol = 20, matchMethod = "basic", intensity = FALSE) {
 
-  # Clean up params
-
   # To keep same argument form as before, if intensity arg is used,
   # switch matchMethod
   if (intensity == TRUE) {
@@ -31,9 +29,7 @@ matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
   # extract relevant metadata from reference
   hmdb_name <- reference$metadata$hmdb_name
   hmdb_id <- reference$metadata$hmdb_id
-  # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
-  # print(hmdb_name)
-  # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
   spectrum_id <- reference$metadata$spectrum_id
   reference <- reference$spectrum_peaks # NOTE: this overwrites reference
 
@@ -56,8 +52,6 @@ matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
     r <- reference[, 1]
   }
 
-  #########################################################
-
   # initialize matches and matched_targets reference and target peaks vectors
   matches <- NULL
   matches <- 0
@@ -71,7 +65,6 @@ matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
 
   matchList <- list()
 
-  #########################################################
   ## There are several matching methods that can be used:
 
   # Match using chemical shift and intensity
@@ -103,7 +96,6 @@ matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
     }
   }
 
-  #########################################################
   # Match using chemical shift only (2021 algorithm)
   if (matchMethod == "basic") {
     # browser()
@@ -162,11 +154,6 @@ matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
           method = matchMethod
         )
 
-        # debug when match is obtained
-        #>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-        # if (!is.null(matchList)){browser()}
-        #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
-
         matched_targets <- matchList$matchedTargets
         target_inds <- matchList$matchedTargetInds
         matched_refs <- matchList$matchedRefs
@@ -180,13 +167,10 @@ matchPeaksToRef2 <- function(target, driver_ppm, reference, tol = 0.02,
         }
       }
   }
-
-  ########################################################
   # Gather up scores
 
   score <- matches / (reference_peaks + target_peaks - matches)
 
-  ########################################################
   # Produce result struct(s)
 
   # Troubleshooting

diff --git a/R/stat_decomp.R b/R/stat_decomp.R
@@ -1,11 +1,26 @@
 #' Statistical Decomposition for spectra.
 #'
-#' Description
-#' @param peaks rds peaks file loaded into memory.
-#' @param spec rds spec file loaded into memory.
-#' @param params yaml object containg parameters for the pipeline.
+#' Given a spectral matrix and a list of picked peaks, loop
+#' through each picked peak and perform a STOCSY using it as
+#' a driver peak. Threshold the result at r >= 0.8, giving
+#' pseudospectra (target list). Pick peak these using standard settings,
+#' and store the peak clusters in ppeaks.
 #'
-#' @return list of stuff
+#' @param peaks picked peaks for representative spectrum, numeric vector
+#'  containing ppm indices of peaks
+#' @param spec spectral matrix consisting of n columns (one for each ppm
+#'  value). row  1 (named 'ppm') is the vector of ppm values. rows 2:m+1
+#' (names undefined) are the m sample measurements for each ppm value.
+#' @param params yaml object containg parameters for the pipeline.
+#' @return list of:
+#'   target.RDS: list with one element for each cluster, containing:
+#'       [1] ppm vector
+#'       [2] thresholded intensities for each ppm value
+#'   ppeaks.RDS: list with picked peaks from each cluster.
+#'   One element per cluster, containing:
+#'       numeric vector
+#'         col 1: (named "chemical-shift") ppm values
+#'         col 2: (named "intensity") intensities (covariance from STOCSY)
 #' @export
 stat_decomp <- function(peaks, spec, params) {
 

diff --git a/R/stocsydec.R b/R/stocsydec.R
@@ -1,6 +1,7 @@
-#' stocsydec
+#' Stocsy Decomposition
 #'
-#' Function to run 'STOCSY decomposition' Goncalo Graca, 15 September 2020, g.gomes-da-graca/at/imperial.ac.uk
+#' Function to run 'STOCSY decomposition' Goncalo Graca, 15 September
+#'  2020, g.gomes-da-graca/at/imperial.ac.uk
 #' @param spectra individual spectra file to run through decomp
 #' @param idx index number of spextra files to be run through decomp
 #' @param cthr cutoff threshold

diff --git a/R/sub2indR.R b/R/sub2indR.R
@@ -11,7 +11,5 @@
 #' @return linear indices
 #' @export
 sub2indR <- function(rows, cols, m) {
-
-  return((cols-1) * m + rows)
-
-}
+  return((cols - 1) * m + rows)
+}
diff --git a/README.md b/README.md
@@ -2,6 +2,7 @@
 
 A package 'wrapper' for the tools and functions written by michael judge and goncalo graca, as part of efforts to make the overall NMR pipeline functional in galaxy.
 
+
 ## Installation
 You can install the package using the R CLI straight from this repository.
 
@@ -13,4 +14,38 @@ You can install the package using the R CLI straight from this repository.
 
 4 - run `library(ImperialNMRTool)` to load the package into the session.
 
-To quickly verify if the package is installed you can run `?indsubr` . This will bring up the manual page for this function, and if loaded means that the package itself was installed and loaded successfully. The manual pages for each function contain the information you will need to run them. More detailed instruction may be added to either this repository or the manual pages at a later date.
+To quickly verify if the package is installed you can run `??hurricane` . This will bring up the manual page for this function, and if loaded means that the package itself was installed and loaded successfully. The manual pages for each function contain the information you will need to run them. 
+
+## Running the workflow
+
+If you are runnning this in a normal (non cluster / workflow manager) environment, it is easiest to do so in RStudio. If you have followed the installation steps above, the package is primed to be run. The entrypoint to the package, `ImperialNMRTool::hurricane()` takes one parameter, a params.yaml file. This file needs to be replete with the parameters you want to run the workflow with. This package has some example data that you can make use of, found in `/inst/extdata`.
+
+### Parameters file
+The parameters .yaml file for the workflow takes the following structure:
+
+
+```
+general_pars:
+  peaks_location: peaks.RDS     # location of peaks.RDS file.
+  spec_location: spec.RDS       # location of spec.RDS file.
+  output_dir: ~/                # tells the workflow where to write the output files.
+
+sd_pars:
+ cutoff : 0.8                   # standard cutoff for STOCSY
+ 
+am_pars:
+ rank_limit: 5                  # rank limit for exporting matches IE take the top 5 matches discard the rest
+ dist_thresh: 0.02              # no sense in allowing < precision of db peaks (0.01 ppm for hmdb)
+ matchMethod: hungarian_scaled  # or basic, itmin, hungarian, hungarian_scaled
+ refdb_file: hmdb_spectra_28FEB2022.RDS # location of reference spectra .RDS file.
+ 
+vis_pars:
+ matlab_root:  /Applications/MATLAB_R2021b.app/bin # deprecated, will be removed in future release
+``` 
+ The four parameters that need to be configured to your local setup in order for the workflow to run are peaks_location, spec_location, output_dor and refdb_file.
+
+ Once this file is prepared, you can run the workflow from RStudio / the command line via `ImperialNMRTool::hurricane("/path/to/your/params.yaml")`
+
+
+#### Help / contact
+This repository is maintained by the [MetaboLights](https://www.ebi.ac.uk/metabolights/) team at the EMBL-EBI. You can send any requests or queries to metabolights-help@ebi.ac.uk , quoting this repository.
diff --git a/man/annotation_matching.Rd b/man/annotation_matching.Rd