- Recode some parallel algorithms with OpenMP. For now, functions
big_prodVec()
,big_cprodVec()
,big_colstats()
andbig_univLinReg()
have been recoded.
- Now detects and errors if there is not enough disk space to create an FBM.
- Fix
pcor()
for singular systems, e.g. whenx
has all the same values.
- Fix
summary()
andplot()
for old (< v1.3)big_sp_list
models.
- Add function
pcor()
to compute partial correlations.
-
Add two options in
big_spLinReg()
andbig_spLogReg()
;power_scale
for using a different scaling for LASSO andpower_adaptive
for using adaptive LASSO (where larger marginal effects are penalized less). See documentation for details. -
big_(c)prodVec()
andbig_(c)prodMat()
(re)gain ancores
parameter. Note that forbig_(c)prodMat()
, it might be beneficial to use the BLAS parallelism (withbigparallelr::set_blas_ncores()
) instead of this parameter, especially when the matrixA
is large-ish.
- Function
big_colstats()
can now be run in parallel (added parameterncores
).
- It is now possible to use C++ FBM accessors without linking to {RcppArmadillo}.
-
Functions
big_(c)prodMat()
andbig_(t)crossprodSelf()
now use much less memory, and may be faster. -
Add
covar_from_df()
to convert a data frame with factors/characters to a numeric matrix using one-hot encoding.
- Remove some 'Suggests' dependencies.
-
Add a new column
$all_conv
to output ofsummary()
forbig_spLinReg()
andbig_spLogReg()
to check whether all models have stopped because of "no more improvement". Also add a new parametersort
tosummary()
. -
Now
warn
(enabled by default) if some models may not have reached a minimum when usingbig_spLinReg()
andbig_spLogReg()
.
- Fix
In .self$nrow * .self$ncol : NAs produced by integer overflow
.
-
Make two different memory-mappings: one that is read-only (using
$address
) and one where it is possible to write (using$address_rw
). This enables to use file permissions to prevent modifying data. -
Also add a new field
$is_read_only
to be used to prevent modifying data (at least with<-
) even when you have write permissions to it. Functions creating an FBM now gain a parameteris_read_only
. -
Make vector accessors (e.g.
X[1:10]
) faster.
-
Move some code to new packages {bigassertr} and {bigparallelr}.
-
big_randomSVD()
gains arguments related to matrix-vector multiplication. -
assert_noNA()
is faster.
- Add
big_increment()
.
In plot.big_SVD()
,
-
Can now plot many PCA scores (more than two) at once.
-
Use
coord_fixed()
when plotting PCA scores because it is good practice. -
Use log-scale in scree plot to better see small differences in singular values.
-
Reexport
cowplot::plot_grid()
to merge multiple ggplots.
AUCBoot()
is now 6-7 times faster.
- Add parameters
center
andscale
to products.
- Fix a bug in
big_univLogReg()
for variables with no variation. IRLS was not converging, soglm()
was used instead. The problem is thatglm()
drops dimensions causing singularities so that Z-score of the first covariate (or intercept) was used instead of a missing value.
-
Use mio instead of boost for memory-mapping.
-
Add a parameter
base.row
topredict.big_sp_list()
and automatically detect if needed (as well as forcovar.row
). -
Possibility to subset a
big_sp_list
without losing attributes, so that one can access one model (corresponding to one alpha) even if it is not the 'best'. -
Add parameters
pf.X
andpf.covar
inbig_sp***Reg()
to provide different penalization for each variable (possibly no penalization at all).
Add %*%
, crossprod
and tcrossprod
operations for 'double' FBMs.
Now also returns the number of non-zero variables ($nb_active
) and the number of candidate variables ($nb_candidate
) for each step of the regularization paths of big_spLinReg()
and big_spLogReg()
.
- Parameters
warn
andreturn.all
ofbig_spLinReg()
andbig_spLogReg()
are deprecated; now always return the maximum information. Now provide two methods (summary
andplot
) to get a quick assessment of the fitted models.
-
Check of missing values for input vectors (indices and targets) and matrices (covariables).
-
AUC()
is now stricter: it accepts only 0s and 1s fortarget
.
$bm()
and$bm.desc()
have been added in order to get anFBM
as afilebacked.big.matrix
. This enables using {bigmemory} functions.
- Type
float
added.
big_write
added.
-
big_read
now has afilter
argument to filter rows, and argumentnrow
has been removed because it is now determined when reading the first block of data. -
Removed the
save
argument fromFBM
(and others); now, you must useFBM(...)$save()
instead ofFBM(..., save = TRUE)
.
-
You can now fill an FBM using a data frame. Note that factors will be used as integers.
-
Package {bigreadr} has been developed and is now used by
big_read
.
- There have been some changes regarding how conversion between types is checked. Before, you would get a warning for any possible loss of precision (without actually checking it). Now, any loss of precision due to conversion between types is reported as a warning, and only in this case. If you want to disable this feature, you can use
options(bigstatsr.downcast.warning = FALSE)
, or you can usewithout_downcast_warning()
to disable this warning for one call.
- change
big_read
so that it is faster (corresponding vignette updated).
-
possibility to add a "base predictor" for
big_spLinReg
andbig_spLogReg
. -
don't store the whole regularization path (as a sparse matrix) in
big_spLinReg
andbig_spLogReg
anymore because it caused major slowdowns. -
directly average the K predictions in
predict.big_sp_best_list
. -
only use the "PSOCK" type of cluster because "FORK" can leave zombies behind. You can change this with
options(bigstatsr.cluster.type = "PSOCK")
.
-
Fix a bug in
big_spLinReg
related to the computation of summaries. -
Now provides function
plus
to be used as thecombine
argument inbig_apply
andbig_parallelize
instead of'+'
.
- Before, this package used only the "PSOCK" type of cluster, which has some significant overhead. Now, it uses the "FORK" type on non-Windows systems. You can change this with
options(bigstatsr.cluster.type = "PSOCK")
. Uses "PSOCK" in 0.4.0.
- you can now provide multiple
$\alpha$ values (as a numeric vector) inbig_spLinReg
andbig_spLogReg
. One will be chosen by grid-search.
- fixed a bug in
big_prodMat
when using a dimension of 1 or 0.
- Package {bigstatsr} is published in Bioinformatics
- no scaling is used by default for
big_crossprod
,big_tcrossprod
,big_SVD
andbig_randomSVD
(before, there was no default at all)
-
Integrate Cross-Model Selection and Averaging (CMSA) directly in
big_spLinReg
andbig_spLogReg
, a procedure that automatically chooses the value of the$\lambda$ hyper-parameter. -
Speed up
big_spLinReg
andbig_spLogReg
(issue #12)
- Speed up AUC computations
- No longer use the
big.matrix
format of package bigmemory