Version 2.0.3 on CRAN (#160)

* Add Zenodo DOI Badge. (#118) [Closes #74] * Add Zenodo DOI Badge. * Fix link [Closes #74] * speed up travis builds (#125) * removed the distribution and sudo entries from travis config - faster? * adding back sduo false and adding cache packages option * small updates to contributing guide (#133) * add example for running styler move contributing to .github folder * ignore the .github path * fix indexing order of operations error [fixes #119]. (#134) * added functionality to change mtry and sparsity in Urerf (#120) * added functionality to change mtry and sparsity in Urerf * ran styler on modified files and removed white space. * added tests for new RandMat functions. * Added the functionality to splitting based on BIC score using Mclust (#124) * added functionality to change mtry and sparsity in Urerf * Added functionality to split based on BIC score * Add LinearCombo arg to the Urerf fn * Add fast version of BIC * fix some minor errors (#141) Ran through styler and fixed some roxygen import and documentation. * fix issue #91 based on discussion in the comments. (#140) * fix issue #91 based on discussion in the comments. add some helper functions add test for new way of computing feature importance * remove need for library(Matrix) and update function parameteres. fix documentation typos [issue #91] * update test-FeatureImportance move `flipWeights` to helperFunctions * update Feature Importance to be more readable [@ben]. Merge RunFeature* into the same file. Update README with correct output names. * check-as-cran warning will now cause TravisCI to fail. (#142) * Print tree (#136) * added functionality to change mtry and sparsity in Urerf * ran styler on modified files and removed white space. * added tests for new RandMat functions. * added PrintTree function and modified NAMESPACE file to call PrintTree (I'm not sure this last step was necessary but it doesn't hurt. * Add documentation and adjust the formatting of the output. * the double comparison now relies on machine epsilon. (#149) * the double comparison now relies on machine epsilon. * fix for test not passing * move an assignment out of an if condition. (#151) Fixes issue #135 * Packed forest submodule (#152) * add packedForest submodule * update submodule to latest commitadd readme for submodule operations * update submodule readme * update submodule * update submodule (#154) * update submodule (#155) * Draft of v2.0.3 for CRAN (#156) * Draft of v2.0.3 for CRAN no warnings, errors, or notes on my Mac. * run README.Rmd * update submodule (#159)
neurodata · Feb 6, 2019 · 9873b68 · 9873b68
1 parent ae07faa
commit 9873b68
Show file tree

Hide file tree

Showing 41 changed files with 1,201 additions and 116 deletions.
diff --git a/.#build.sh b/.#build.sh
@@ -6,7 +6,7 @@ Rscript -e "Rcpp::compileAttributes()"
 # Rscript -e "install.packages('devtools', repos = 'http://cran.us.r-project.org')"
 # Rscript -e "install.packages('roxygen2', repos = 'http://cran.us.r-project.org')"
 ## RUN styler on directory
-# Rscript -e "styler::style_dir(style = tidyverse_style)"
+# Rscript -e "styler:::style_pkg(style=tidyverse_style)"
 Rscript -e "devtools::document('R')"
 
 R CMD build --resave-data .

diff --git a/.#build_win.ps1 b/.#build_win.ps1
@@ -5,7 +5,7 @@ Rscript.exe -e "Rcpp::compileAttributes()"
 # Rscript.exe -e "install.packages('devtools', repos = 'http://cran.us.r-project.org')"
 # Rscript.exe -e "install.packages('roxygen2', repos = 'http://cran.us.r-project.org')"
 ## RUN styler on directory
-# Rscript.exe -e "styler::style_dir(style = tidyverse_style)"
+# Rscript.exe -e "styler:::style_pkg(style=tidyverse_style)"
 Rscript.exe -e "devtools::document('R')"
 
 R.exe CMD build --resave-data .

diff --git a/.Rbuildignore b/.Rbuildignore
@@ -10,4 +10,6 @@
 ^.*\.so$
 ^.*\.Rproj$
 ^\.Rproj\.user$
-^CONTRIBUTING.md$
+^\.github$
+^src/packedForest$
+^src/submodule_readme.md$
diff --git a/CONTRIBUTING.md → .github/CONTRIBUTING.md b/CONTRIBUTING.md → .github/CONTRIBUTING.md
@@ -18,6 +18,13 @@ You are here to help on R-RerF?  First off, thank you!  Please read the followin
 ### Formatting
 
 * Run your code through [styler](http://styler.r-lib.org/) auto-formater
+
+    ```R
+    install.packages("styler")
+    library(styler)
+    styler:::style_pkg(style=tidyverse_style)
+    ```
+
 * Avoid modifying formatting outside the scope of your pull request
 * Use **TRUE** and **FALSE**, not **T** and **F**
 * Check for unnecessary whitespace with `git diff --check` before committing
@@ -28,7 +35,6 @@ We use the [testthat](https://github.com/r-lib/testthat) library for testing in
 
 * New features need tests
 * Tests should be fast, ideally each test should complete in under 5 seconds
-* Mark longer running tests with 
 * Bug fixes need [testthat](https://github.com/r-lib/testthat) functions (test the condition that was failing)
 
 ### Make your Pull Request

diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,4 @@
+[submodule "src/packedForest"]
+	path = src/packedForest
+	url = https://github.com/neurodata/packedForest
+	branch = master
diff --git a/.travis.yml b/.travis.yml
@@ -1,11 +1,14 @@
 language: r
-dist: trusty
+sudo: false
+cache: packages
+
+env:
+  global:
+    - WARNINGS_ARE_ERRORS=1
 
 r:
   - release
 
-sudo: false
-
 addons:
   apt:
     packages:

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: rerf
 Type: Package
 Title: Randomer Forest
-Version: 2.0.2
-Date: 2018-12-03
+Version: 2.0.3
+Date: 2019-02-06
 Authors@R: c(
         person("Jesse", "Patsolic", role = c("ctb", "cre"), email = "software@neurodata.io"),
         person("Benjamin", "Falk", role = "ctb", email = "falk.ben@jhu.edu"),
@@ -26,11 +26,11 @@ Description: R-RerF (aka Randomer Forest (RerF) or Random Projection
   algorithms is where the random linear combinations occur: Forest-RC
   combines features at the per tree level whereas RerF takes linear
   combinations of coordinates at every node in the tree.
-Depends: R (>= 3.3.0)
+Depends: R (>= 3.3.0), Rcpp (>= 1.0.0)
 License: Apache License 2.0 | file LICENSE
 URL: https://github.com/neurodata/R-RerF
 BugReports: https://github.com/neurodata/R-RerF/issues
-Imports: parallel, RcppZiggurat, utils, stats, dummies
+Imports: parallel, RcppZiggurat, utils, stats, dummies, mclust
 Suggests: roxygen2 (>= 5.0.0), testthat
 LinkingTo: Rcpp, RcppArmadillo
 SystemRequirements: GNU make

diff --git a/NAMESPACE b/NAMESPACE
@@ -5,6 +5,7 @@ export(FeatureImportance)
 export(OOBPredict)
 export(PackPredict)
 export(Predict)
+export(PrintTree)
 export(RandMatBinary)
 export(RandMatContinuous)
 export(RandMatCustom)
@@ -18,8 +19,11 @@ export(RandMatTSpatch)
 export(RerF)
 export(StrCorr)
 export(Urerf)
+import(Rcpp)
 importFrom(RcppZiggurat,zrnorm)
 importFrom(dummies,dummy)
+importFrom(mclust,Mclust)
+importFrom(mclust,mclustBIC)
 importFrom(parallel,clusterEvalQ)
 importFrom(parallel,clusterExport)
 importFrom(parallel,clusterSetRNGStream)
@@ -39,5 +43,6 @@ importFrom(stats,sd)
 importFrom(utils,combn)
 importFrom(utils,flush.console)
 importFrom(utils,object.size)
+importFrom(utils,tail)
 importFrom(utils,write.table)
 useDynLib(rerf)
diff --git a/NEWS.md b/NEWS.md
@@ -1,12 +1,30 @@
-Changes in 2.0.2:
+## Changes in 2.0.3:
+
+* The `PrintTree` function has been added to aid in viewing the
+  cut-points, features, and other statistics in a particular tree of a
+  forest.
+
+* Urerf now supports using the Bayesian information criterion (BIC) from
+  the `mclust` package for determining the best split.
+
+* Feature importance calculations now correctly handle features whose
+  weight vectors parametrize the same line.  Also, when the projection
+  weights are continuous we tabulate how many times a unique combination
+  of features was used, ignoring the weights.
+
+* An issue where the `split.cpp` function split the data `A` into `{A, {}}`
+  has been resolved by computing equivalence within some factor of
+  machine precision instead of exactly.
+
+## Changes in 2.0.2:
 
 * The option `rho` in the RerF function has been re-named to `sparsity`
   to match with the algorithm explanation.
 
 * The default parameters sent to the RandMat\* functions now properly
   account for categorical columns.
 
-* The defualts have changed for the following parameters:
+* The defaults have changed for the following parameters:
   * `min.parent = 1`
   * `max.depth  = 0`
   * `stratify   = TRUE`

diff --git a/R/BuildTree.R b/R/BuildTree.R
@@ -298,19 +298,6 @@ BuildTree <- function(X, Y, FUN, paramList, min.parent, max.depth, bagging, repl
     # them accordingly
     MoveLeft <- Xnode[1L:NdSize] <= ret$BestSplit
 
-    # Move samples left or right based on split
-    if (sum(MoveLeft) == 0 || sum(!MoveLeft) == 0) {
-      treeMap[CurrentNode] <- currLN <- currLN - 1L
-      ClassProb[currLN * -1, ] <- ClProb
-      NodeStack <- NodeStack[-1L] # pop node off stack
-      Assigned2Node[[CurrentNode]] <- NA # remove saved indexes
-      CurrentNode <- NodeStack[1L] # point to top of stack
-      if (is.na(CurrentNode)) {
-        break
-      }
-      next
-    }
-
     Assigned2Node[[NextUnusedNode]] <- Assigned2Node[[CurrentNode]][MoveLeft]
     Assigned2Node[[NextUnusedNode + 1L]] <- Assigned2Node[[CurrentNode]][!MoveLeft]
 
@@ -354,7 +341,7 @@ BuildTree <- function(X, Y, FUN, paramList, min.parent, max.depth, bagging, repl
   currLN <- currLN * -1L
   # create tree structure and populate with mandatory elements
   tree <- list(
-    "treeMap" = treeMap[1L:NextUnusedNode - 1L], "CutPoint" = CutPoint[1L:currIN], "ClassProb" = ClassProb[1L:currLN, , drop = FALSE],
+    "treeMap" = treeMap[1L:(NextUnusedNode - 1L)], "CutPoint" = CutPoint[1L:currIN], "ClassProb" = ClassProb[1L:currLN, , drop = FALSE],
     "matAstore" = matAstore[1L:matAindex[currIN + 1L]], "matAindex" = matAindex[1L:(currIN + 1L)], "ind" = NULL, "rotmat" = NULL,
     "rotdims" = NULL, "delta.impurity" = NULL
   )

diff --git a/R/FeatureImportance.R b/R/FeatureImportance.R
@@ -4,32 +4,115 @@
 #'
 #' @param forest a forest trained using the RerF function with argument store.impurity = TRUE
 #' @param num.cores number of cores to use. If num.cores = 0, then 1 less than the number of cores reported by the OS are used. (num.cores = 0)
+#' @param type character string specifying which method to use in
+#' calculating feature importance.
+#' \describe{
+#'   \item{'C'}{specifies that unique combinations of features
+#'   should be *c*ounted across trees.}
+#'   \item{'R'}{feature importance will be calculated as in *R*andomForest.}
+#'   \item{'E'}{calculates the unique projections up to *e*quivalence if
+#'   the vector of projection weights parametrizes the same line in
+#'   \eqn{R^p}.}
+#' }
 #'
-#' @return feature.imp
+#' @return a list with 3 elements, 
+#' \describe{
+#'   \item{\code{imp}}{The vector of scores/counts, corresponding to each feature.}
+#'   \item{\code{features}}{The features/projections used.}
+#'   \item{\code{type}}{The code for the method used.}
+#'   }
 #'
 #' @examples
 #' library(rerf)
+#' num.cores <- 1L
 #' forest <- RerF(as.matrix(iris[, 1:4]), iris[[5L]], num.cores = 1L, store.impurity = TRUE)
-#' feature.imp <- FeatureImportance(forest, num.cores = 1L)
+#'
+#' imp.C <- FeatureImportance(forest, num.cores, "C")
+#' imp.R <- FeatureImportance(forest, num.cores, "R")
+#' imp.E <- FeatureImportance(forest, num.cores, "E")
+#'
+#' fRF <- RerF(as.matrix(iris[, 1:4]), iris[[5L]],
+#'             FUN = RandMatRF, num.cores = 1L, store.impurity = TRUE)
+#'
+#' fRF.imp <- FeatureImportance(forest = fRF, num.cores = num.cores)
+#'
 #' @export
 #' @importFrom parallel detectCores makeCluster clusterExport parSapply stopCluster
 #' @importFrom utils object.size
 
-FeatureImportance <- function(forest, num.cores = 0L) {
+FeatureImportance <- function(forest, num.cores = 0L, type = NULL) {
+
+  ## choose method to use for calculating feature importance
+  if(is.null(type)){
+    if(identical(forest$params$fun, rerf::RandMatRF)){
+      type <- "R"
+    } else if (identical(forest$params$fun, rerf::RandMatBinary)) {
+      type <- "E"
+    } else {
+      type <- "C"
+    }
+  }
+
   num.trees <- length(forest$trees)
   num.splits <- sapply(forest$trees, function(tree) length(tree$CutPoint))
 
-  unique.projections <- vector("list", sum(num.splits))
+  forest.projections <- vector("list")
 
-  idx.start <- 1L
+  ## Iterate over trees in the forest to save all projections used
   for (t in 1:num.trees) {
-    idx.end <- idx.start + num.splits[t] - 1L
-    unique.projections[idx.start:idx.end] <- lapply(1:num.splits[t], function(nd) forest$trees[[t]]$matAstore[(forest$trees[[t]]$matAindex[nd] + 1L):forest$trees[[t]]$matAindex[nd + 1L]])
-    idx.start <- idx.end + 1L
+    tree.projections <- 
+      lapply(1:num.splits[t], function(nd) {
+             forest$trees[[t]]$matAstore[(forest$trees[[t]]$matAindex[nd] + 1L):forest$trees[[t]]$matAindex[nd + 1L]]
+      })
+
+    forest.projections <- c(forest.projections, tree.projections)
   }
-  unique.projections <- unique(unique.projections)
 
-  CompImportanceCaller <- function(tree, ...) RunFeatureImportance(tree = tree, unique.projections = unique.projections)
+  ## Calculate the unique projections used according to the distribution
+  ## of weights
+  if (identical(type, "C")) {
+    message("Message: Computing feature importance as counts of unique feature combinations.\n")
+    ## compute the unique combinations of features used in the
+    ## projections 
+    unique.projections <- unique(lapply(forest.projections, getFeatures))
+
+    CompImportanceCaller <- function(tree, ...) {
+      RunFeatureImportanceCounts(tree = tree, unique.projections = unique.projections)
+    }
+    varlist <- c("unique.projections", "RunFeatureImportanceCounts")
+  } 
+
+  if (identical(type, "R")) {
+    message("Message: Computing feature importance for RandMatRF.\n")
+    ## Compute the unique projections without the need to account for
+    ## 180-degree rotations.
+    unique.projections <- unique(forest.projections)
+
+    CompImportanceCaller <- function(tree, ...) {
+      RunFeatureImportance(tree = tree, unique.projections = unique.projections)
+    }
+    varlist <- c("unique.projections", "RunFeatureImportance")
+  }
+
+  if (identical(type, "E")) {
+    message("Message: Computing feature importance for RandMatBinary.\n")
+    ## compute the unique projections properly accounting for
+    ## projections that differ by a 180-degree rotation.
+    unique.projections <- uniqueByEquivalenceClass(
+      forest$params$paramList$p,
+      unique(forest.projections)
+    )
+
+    CompImportanceCaller <- function(tree, ...) {
+      RunFeatureImportanceBinary(
+        tree = tree,
+        unique.projections = unique.projections
+      )
+    }
+    varlist <- c("unique.projections", "RunFeatureImportanceBinary")
+  }
+
+
 
   if (num.cores != 1L) {
     if (num.cores == 0L) {
@@ -41,7 +124,7 @@ FeatureImportance <- function(forest, num.cores = 0L) {
     if ((utils::object.size(forest) > 2e9) |
       .Platform$OS.type == "windows") {
       cl <- parallel::makeCluster(spec = num.cores, type = "PSOCK")
-      parallel::clusterExport(cl = cl, varlist = c("unique.projections", "RunFeatureImportance"), envir = environment())
+      parallel::clusterExport(cl = cl, varlist = varlist, envir = environment())
       feature.imp <- parallel::parSapply(cl = cl, forest$trees, FUN = CompImportanceCaller)
     } else {
       cl <- parallel::makeCluster(spec = num.cores, type = "FORK")
@@ -58,5 +141,6 @@ FeatureImportance <- function(forest, num.cores = 0L) {
   sort.idx <- order(feature.imp, decreasing = TRUE)
   feature.imp <- feature.imp[sort.idx]
   unique.projections <- unique.projections[sort.idx]
-  return(feature.imp <- list(imp = feature.imp, proj = unique.projections))
+
+  return(feature.imp <- list(imp = feature.imp, features = unique.projections, type = type))
 }