Approximation of lipschitz constant for use in step size #7

jolars · 2018-06-03T19:17:42Z

Currently, we are approximating the Lipschitz constant as in scikit-learn by using the maximum absolute rowise squared norm, whilst the theoretically best choice, if I am not mistaken, is to take the largest eigenvalue of the hessian. The reason for this behavior, as I understand it, is that the decomposition is computationally demanding but it does seem suboptimal, particularly since we are using a constant step size.

Here is a short demonstration (in R) of the different methods for least squares:

set.seed(1)
x <- scale(matrix(rnorm(1000000), nrow = 10000, ncol = 100))

rowNormsMax <- function(x) max(abs(rowSums(x^2)))
hessianEigen <- function(x) max(eigen(crossprod(x), FALSE, TRUE)$values/nrow(x))

rowNormsMax(x)
#> [1] 164.6416
hessianEigen(x)
#> [1] 1.199674

microbenchmark::microbenchmark(rowNormsMax(x), hessianEigen(x))
#> Unit: milliseconds
#>             expr       min       lq      mean    median        uq       max neval
#>   rowNormsMax(x)  4.591024  7.10048  9.094845  7.317713  7.635671 127.13514   100
#>  hessianEigen(x) 11.155758 11.26153 11.678655 11.558702 11.881869  13.99339   100

I have planned to run proper tests on this to figure out a best solution but I though I would just put it up here to remember to revisit it later.

Also, I wanted to check that I am not missing something. The difference seems to be huge?

michaelweylandt · 2018-06-08T15:15:35Z

In the call to eigen, you can set symmetric=TRUE since crossprod(x) is symmetric. You can also drop the abs in rowNormsMax since the sum of squares is guaranteed to be positive.

I think the rowNormsMax approach is essentially optimal here (though I just checked the math quickly). My hunch about it being a loose bound was incorrect.

SGD works by conceiving of the loss as E[l(X, y)] \approx 1/N \sum l(X_i, y_i) and, for SAGA, the relevant Lipschitz constant is one which applies to each term in the loss separately. For each $x_i$, the loss $(y_i - x_i^T\beta)^2 / 2= y_i^2 / 2 - y_i * (x_i^T\beta) + (x_i^T\beta)^2 / 2$ has gradient $ -x_i(y_i - x_i^T\beta)$ and Hessian $x_ix_i^T$. rowNormsMax is exactly the maximum value that can take.

Put another way, hessianEigen gives you the optimal step size you can take when you are dealing with all the data at once, while rowNormsMax is essentially a "single sample" hessian and is the maximum step size you can take when you see only one sample at a time. (Note that in the SAGA paper, L is the max Lipschitz constant of the gradient of each individual observation.)

hessianEigen lets you take larger steps because you get "cancellation" from the different samples, while rowNormsMax needs to take smaller step sizes to avoid overreacting to the sample it's currently using.

jolars · 2018-06-09T13:05:44Z

Ah, yes, that makes sense. Thanks!

What do you think about using an adaptive step size as the lightning package optionally does, @tdhock and @michaelweylandt ?

https://github.com/scikit-learn-contrib/lightning/blob/e755e39c810fbf0e2c08e16bd3697c51554dcda2/lightning/impl/sag_fast.pyx#L386

michaelweylandt · 2018-06-10T22:15:44Z

Potentially interesting and might buy some speed, but I think there's probably more low-hanging fruit performance-wise elsewhere. The paper they reference describes the step-size search in a fully smooth setting (for a slightly different algorithm), so I'd want to check the math before implementing, but it seems like it would work.

michaelweylandt mentioned this issue Jul 26, 2018

Introductory, model family, benchmarks, and technical specification vignettes #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximation of lipschitz constant for use in step size #7

Approximation of lipschitz constant for use in step size #7

jolars commented Jun 3, 2018

michaelweylandt commented Jun 8, 2018 •

edited

Loading

jolars commented Jun 9, 2018

michaelweylandt commented Jun 10, 2018

Approximation of lipschitz constant for use in step size #7

Approximation of lipschitz constant for use in step size #7

Comments

jolars commented Jun 3, 2018

michaelweylandt commented Jun 8, 2018 • edited Loading

jolars commented Jun 9, 2018

michaelweylandt commented Jun 10, 2018

michaelweylandt commented Jun 8, 2018 •

edited

Loading