Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introductory, model family, benchmarks, and technical specification vignettes #16

Merged
merged 65 commits into from
Aug 13, 2018

Conversation

jolars
Copy link
Owner

@jolars jolars commented Jul 24, 2018

This pull request adds three vignettes to the package and updates one:

  • A brief introductory vignette featuring a short example of fitting, plotting and predicting.
  • The benchmarks vignette has been updated with results based on running simulations on an amazon ec2 server. It now features benchmarks from all of the current families (gaussian, binomial and multinomial). Note that the results from the gaussian model are nonsensical at the moment. I believe that this is because the package isn't at all optimized for dense problems. These benchmarks will be updated before release.
  • A technical documentation featuring pseudocode for the algorithm and descriptions of the various preprocessing steps.
  • A vignette for the various model families, featuring an example for each one.

These are still a work in progress so I don't expect to merge this one right now, but I'd love for you to pitch in if you have any feedback.

Edit (18/13/8)

I have now also generated a pkgdown-site for the project, which I intend to link to as the final work product of gsoc.

@@ -13,7 +13,7 @@ test_that("test that all combinations run without errors", {
family = c("gaussian", "binomial", "multinomial"),
intercept = TRUE,
sparse = FALSE,
alpha = c(0, 0.5, 1),
alpha = c(0, 0.75, 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test with alpha at 0.5 elsewhere, so I just figured it might not hurt to change it up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@schmidt2017. SAGA handles both strongly and
non-strongly convex objectives -- even in the composite case -- making it
applicable to a wide range of problems such as generalized
linear regression with elastic net regularization, which is the problem
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generalized linear models is a bit more common terminology

Cite Zou and Hastie (2005) for the elastic net.


Before the algorithm is set loose on data, its parameters have to be set up
properly and its data (possibly) preprocessed. For illustrative purposes,
we will proceed by example and use Gaussian univariate multiple regression
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop multiple here - it's easily confused with family = "mgaussian" and the fact you are using sparsity / regularization should make the fact we have more than one predictor implicitly clear.

```{r}
sgdnet_sd <- function(x) sqrt(sum((x - mean(x))^2)/length(x))
n <- nrow(x)
x_bar <- colSums(x)/n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colMeans(x)


We then construct the $\lambda$ path to be a $\log_e$-spaced sequence starting
at $\lambda_{\text{max}}$ and finishing at
$\frac{\lambda_{\text{max}}}{\lambda_{\text{min ratio}}}$. Thus we have
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lambda_max times lambda_min_ratio , not divided by


## Background and motivation

**sgdnet** was developed as a Google Summer of Code 2018. Its goal was to be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave GSoC mentions until the end as a funding acknowledgement. I expect you'll keep maintaining and improving this past the end of GSoC-2018

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!

where $\mathcal{L}(\beta_0,\beta; \mathbf{y}, \mathbf{X})$ is the
log-likelihood of the model $\lambda$ is the regularization strength, and
$\alpha$,is the elastic net mixing parameter, such that
$\alpha = 1$ results in the lasso and $\alpha = 0$ the ridge penalty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citations


$$
\text{Pr}(Y_i = c) =
\frac{e^{\beta_{0_c}+\beta_c^\intercal \mathbf{x}_i}}{\sum_{k = 1}^m{e^{\beta_{0_k}+\beta_k^\intercal \mathbf{x}_i}}},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have K for the number of classes instead of m


which is the overspecified version of this model. As in **glmnet**
[@friedman2010],
which much of this packages functinality is modeled after, however,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functionality

[@friedman2010],
which much of this packages functinality is modeled after, however,
we rely on the regularization of model to take care of
this [@hastie2015, pp. 36-7].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify this a bit more. (I believe you are saying that regularization implicitly induces a sum-to-zero constraint)

@michaelweylandt
Copy link
Collaborator

This is a good start - I only reviewed the vignettes since I think the other changes are already under review as #15. I particularly appreciated the technical write up - I finally (sort of) get the point of the lagged update structure.

@michaelweylandt
Copy link
Collaborator

Did you want another round of comments on this?

@jolars
Copy link
Owner Author

jolars commented Aug 8, 2018

Hm, thanks but let's wait a little. I noticed that I had missed seeing to a couple of your comments too.

README.Rmd Outdated

## Versioning

**eulerr** uses [semantic versioning](https://semver.org/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong package name

README.Rmd Outdated
[ridge](https://en.wikipedia.org/wiki/Tikhonov_regularization)
and [lasso](https://en.wikipedia.org/wiki/Lasso_(statistics)) penalties.

**sgdnet** automatically fits the model across an automatically computed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use "automatically" twice in one sentence.

inst/COPYRIGHTS Outdated
@@ -19,7 +19,8 @@ along with this program. If not, see <https://www.gnu.org/licenses/>.
# scikit-learn

Parts of this package are modified translations of source code from the
Python package scikit-learn, which is licensed under the New BSD License
Python package scikit-learn (including the contributed module lightning),
which is licensed under the New BSD License
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly add URLs here

volume = {2},
shorttitle = {{{SAGA}}},
abstract = {In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.},
journal = {Advances in Neural Information Processing Systems},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIPS isn't a journal. Does Zotero support the InProceedings type?

abstract = {We analyze the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k$\surd$)O(1/k)O(1/$\backslash$sqrt\{k\}) to O(1 / k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1 / k) to a linear convergence rate of the form O($\rho$k)O($\rho$k)O($\backslash$rho \^k) for $\rho{}<$1$\rho{}<$1$\backslash$rho $<$ 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. This extends our earlier work Le Roux et al. (Adv Neural Inf Process Syst, 2012), which only lead to a faster rate for well-conditioned strongly-convex problems. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.},
language = {en},
number = {1-2},
journal = {Math. Program.},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full journal name

penalty.

When the lasso penalty is in place, the regularization imposed on the
coefficients takes the shape of an octahedron.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not generally (only in $p=3$ dimensions)


The goal of **sgdnet** is to be
a one-stop solution for fitting elastic net-penalized generalized linear
models. This is of course not novel in and by itself. Several packages,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too broad - I think it's unlikely we'll outperform glmnet in the p \gg n setting or the p \approx n \gg 1 setting. I'd focus our attention on the n \gg p setting.


By default, the feature matrix in **sgdnet** is centered and scaled to unit
variance. Here, however, we used the *biased* population standard deviation
and divide by $n$ rather than $n-1$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this to a footnote and rephrase. We're using the sample standard deviation here since we are only concerned with fitting the model to the data at hand. Since we aren't concerned with a population, biasedness (or not) is irrelevant


The step size, $\gamma$, in SAGA is constant throughout the algorithm.
For non-strongly convex objectives it is set to $1/(3L)$, where $L$ is the
*Lipschitz* constant. Since we are picking a single sample at a time,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be precise here - $L$ is the maximum Lipschitz constant of the term of the log-likelihood corresponding to a single observation, not the overall Lipschitz constant of the problem

nz <- Nonzeros(X[j]) # nonzero indices of X[j]

# Perform just-in-time updates of coefficients X is sparse
X[j] <- LaggedUpdate(j,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, I might focus on the dense (no lagged update) case first and then describe the lagged updates near the end.

@jolars jolars merged commit 7e9261a into master Aug 13, 2018
@jolars jolars deleted the algorithm-vignette branch August 13, 2018 19:13
@jolars
Copy link
Owner Author

jolars commented Aug 13, 2018

I'll merge this one in now and upload the pkgdown site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants