Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance compared to torch.linalg.householder_product in forward pass #7

Open
bytesnake opened this issue Dec 17, 2021 · 6 comments

Comments

@bytesnake
Copy link

Problem

I'm using an orthonormal constrained matrix in a learnable filterbank setting. Now I want to optimize the training and run some profiling with torch, but getting strange results. Just want to double-check here whether I'm doing something wrong.

Code

I'm constructing the matrix during forward pass like this:

def __init__(self, ..):
       [..]

        # householder decomposition
        decomp, tau = torch.geqrf(filters)

        # assume that DCT is orthogonal
        filters = decomp.tril(diagonal=-1) + torch.eye(decomp.shape[0], decomp.shape[1])

        # register everything as parameter and set gradient flags
        self.filters = torch.nn.Parameter(filters, requires_grad=True)
        self.register_parameter('filter_q', self.filters)

def filters(self):
        valid_coeffs = self.filters.tril(diagonal=-1)
        tau = 2. / (1 + torch.norm(valid_coeffs, dim=0) ** 2)
        return torch.linalg.householder_product(valid_coeffs, tau)
        #return torch_householder_orgqr(valid_coeffs, tau)

Profiles

All profiles are created with the pytorch profiler with warmup of one and two trial runs:

Profile torch householder_product (matrix 512x512 f32)

  • forward pass: ~823us
  • backward pass: ~790ms

Marked forward pass and backward pass visible in light green:

image

Profile torch-householder (matrix 512x512 f32)

  • forward pass: ~240ms
  • backward pass: ~513ms

image

Questions

I'm not an expert in torch and do not follow the development closely. There is an issue pytorch/pytorch#50104 for integrating CUDA support to orgqr, may this cause the difference in time?

  • why is the torch-householder library much slower in the forward pass
  • is this performance expected from AD of a matrix w.r.t to its householder or am I doing something wrong here?
  • why does the number actually add up again to ~800ms, this makes me suspect that my profiling is doing something wrong but couldn't find a cause

I'm also happy to share the traces with you, please just ping then :)

@toshas
Copy link
Owner

toshas commented Dec 17, 2021

Thanks for this analysis, a couple of things to double check:

  • Can you please match the inputs sizes against the charts from https://github.com/toshas/torch-householder readme plots? This will give you the estimate of runtime speed and memory consumption ratios at the time I checked last
  • Are you performing warmup in both cases? It might be that the first invocation to the function is orders of magnitude slower than subsequent, e.g. due to extension compilation, CUDA initialization, etc. To get a better feel of performance, please make sure to specify warmup flags for the profiler and measure over several runs to average out fluctuations.

@bytesnake
Copy link
Author

bytesnake commented Dec 17, 2021

This will give you the estimate of runtime speed and memory consumption ratios at the time I checked last

the evaluation uses "thin" matrices, but in my case all are quadratic for perfect reconstruction. This means that for m=32 we would need to go to d=8192, which is eight-time off the performance comparison chart

Are you performing warmup in both cases?

yes I'm doing warmup of one to launch all CUDA kernels etc., but will try to increase

EDIT: tried that, but there is no difference for warmup=4

@bytesnake
Copy link
Author

probably my expectations are too high and I need to reparametrize the model, but still want to double-check that I'm using householder correctly

@toshas
Copy link
Owner

toshas commented Dec 17, 2021

Can you please run https://github.com/toshas/torch-householder/blob/master/tests/benchmark_one.py with the values of width (r), height (d), and batchsize (b) of interest?

@bytesnake
Copy link
Author

bytesnake commented Dec 17, 2021

$ python3 benchmark_one.py --repeats 5 --method hh_ours --b 1 --d 512 --r 512
70.72763475589454 1.2755393754559918e-06
$ python3 benchmark_one.py --repeats 5 --method hh_pt --b 1 --d 512 --r 512
318.1826988235116 1.3709068298339844e-06

to be really comparable, I have to modify the test. It creates B decompositions, but in my case a single one is shared for the whole batch

Now with single orthonormal matrix and B batches of data (I believe that the decomp. works fine ;)):

$ python3 benchmark_one.py --repeats 20 --method hh_ours --b 20 --d 512 --r 512
19.521982199512422
$ python3 benchmark_one.py --repeats 20 --method hh_pt --b 20 --d 512 --r 512
2.61343854945153

still not really comparable ..

patch.txt

@toshas
Copy link
Owner

toshas commented Dec 24, 2021

I applied the patch and not sure what the modified benchmark is supposed to verify. The way it was implemented, it timed pure calls to the functions. After patching there is an extra inp tensor created, which is then directly multiplied by the param tensor. Meanwhile, the transformation result out is never used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants