Slow performance compared to `torch.linalg.householder_product` in forward pass #7

bytesnake · 2021-12-17T09:27:41Z

Problem

I'm using an orthonormal constrained matrix in a learnable filterbank setting. Now I want to optimize the training and run some profiling with torch, but getting strange results. Just want to double-check here whether I'm doing something wrong.

Code

I'm constructing the matrix during forward pass like this:

def __init__(self, ..):
       [..]

        # householder decomposition
        decomp, tau = torch.geqrf(filters)

        # assume that DCT is orthogonal
        filters = decomp.tril(diagonal=-1) + torch.eye(decomp.shape[0], decomp.shape[1])

        # register everything as parameter and set gradient flags
        self.filters = torch.nn.Parameter(filters, requires_grad=True)
        self.register_parameter('filter_q', self.filters)

def filters(self):
        valid_coeffs = self.filters.tril(diagonal=-1)
        tau = 2. / (1 + torch.norm(valid_coeffs, dim=0) ** 2)
        return torch.linalg.householder_product(valid_coeffs, tau)
        #return torch_householder_orgqr(valid_coeffs, tau)

Profiles

All profiles are created with the pytorch profiler with warmup of one and two trial runs:

Profile torch householder_product (matrix 512x512 f32)

forward pass: ~823us
backward pass: ~790ms

Marked forward pass and backward pass visible in light green:

Profile torch-householder (matrix 512x512 f32)

forward pass: ~240ms
backward pass: ~513ms

Questions

I'm not an expert in torch and do not follow the development closely. There is an issue pytorch/pytorch#50104 for integrating CUDA support to orgqr, may this cause the difference in time?

why is the torch-householder library much slower in the forward pass
is this performance expected from AD of a matrix w.r.t to its householder or am I doing something wrong here?
why does the number actually add up again to ~800ms, this makes me suspect that my profiling is doing something wrong but couldn't find a cause

I'm also happy to share the traces with you, please just ping then :)

The text was updated successfully, but these errors were encountered:

toshas · 2021-12-17T10:01:24Z

Thanks for this analysis, a couple of things to double check:

Can you please match the inputs sizes against the charts from https://github.com/toshas/torch-householder readme plots? This will give you the estimate of runtime speed and memory consumption ratios at the time I checked last
Are you performing warmup in both cases? It might be that the first invocation to the function is orders of magnitude slower than subsequent, e.g. due to extension compilation, CUDA initialization, etc. To get a better feel of performance, please make sure to specify warmup flags for the profiler and measure over several runs to average out fluctuations.

bytesnake · 2021-12-17T10:11:01Z

This will give you the estimate of runtime speed and memory consumption ratios at the time I checked last

the evaluation uses "thin" matrices, but in my case all are quadratic for perfect reconstruction. This means that for m=32 we would need to go to d=8192, which is eight-time off the performance comparison chart

Are you performing warmup in both cases?

yes I'm doing warmup of one to launch all CUDA kernels etc., but will try to increase

EDIT: tried that, but there is no difference for warmup=4

bytesnake · 2021-12-17T10:12:43Z

probably my expectations are too high and I need to reparametrize the model, but still want to double-check that I'm using householder correctly

toshas · 2021-12-17T10:27:42Z

Can you please run https://github.com/toshas/torch-householder/blob/master/tests/benchmark_one.py with the values of width (r), height (d), and batchsize (b) of interest?

bytesnake · 2021-12-17T10:29:11Z

$ python3 benchmark_one.py --repeats 5 --method hh_ours --b 1 --d 512 --r 512
70.72763475589454 1.2755393754559918e-06
$ python3 benchmark_one.py --repeats 5 --method hh_pt --b 1 --d 512 --r 512
318.1826988235116 1.3709068298339844e-06

to be really comparable, I have to modify the test. It creates B decompositions, but in my case a single one is shared for the whole batch

Now with single orthonormal matrix and B batches of data (I believe that the decomp. works fine ;)):

$ python3 benchmark_one.py --repeats 20 --method hh_ours --b 20 --d 512 --r 512
19.521982199512422
$ python3 benchmark_one.py --repeats 20 --method hh_pt --b 20 --d 512 --r 512
2.61343854945153

still not really comparable ..

patch.txt

toshas · 2021-12-24T13:51:50Z

I applied the patch and not sure what the modified benchmark is supposed to verify. The way it was implemented, it timed pure calls to the functions. After patching there is an extra inp tensor created, which is then directly multiplied by the param tensor. Meanwhile, the transformation result out is never used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance compared to `torch.linalg.householder_product` in forward pass #7

Slow performance compared to `torch.linalg.householder_product` in forward pass #7

bytesnake commented Dec 17, 2021

toshas commented Dec 17, 2021

bytesnake commented Dec 17, 2021 •

edited

Loading

bytesnake commented Dec 17, 2021

toshas commented Dec 17, 2021

bytesnake commented Dec 17, 2021 •

edited

Loading

toshas commented Dec 24, 2021

Slow performance compared to torch.linalg.householder_product in forward pass #7

Slow performance compared to torch.linalg.householder_product in forward pass #7

Comments

bytesnake commented Dec 17, 2021

Problem

Code

Profiles

Profile torch householder_product (matrix 512x512 f32)

Profile torch-householder (matrix 512x512 f32)

Questions

toshas commented Dec 17, 2021

bytesnake commented Dec 17, 2021 • edited Loading

bytesnake commented Dec 17, 2021

toshas commented Dec 17, 2021

bytesnake commented Dec 17, 2021 • edited Loading

toshas commented Dec 24, 2021

Slow performance compared to `torch.linalg.householder_product` in forward pass #7

Slow performance compared to `torch.linalg.householder_product` in forward pass #7

bytesnake commented Dec 17, 2021 •

edited

Loading

bytesnake commented Dec 17, 2021 •

edited

Loading