-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance compared to torch.linalg.householder_product
in forward pass
#7
Comments
Thanks for this analysis, a couple of things to double check:
|
the evaluation uses "thin" matrices, but in my case all are quadratic for perfect reconstruction. This means that for
yes I'm doing warmup of one to launch all CUDA kernels etc., but will try to increase EDIT: tried that, but there is no difference for |
probably my expectations are too high and I need to reparametrize the model, but still want to double-check that I'm using householder correctly |
Can you please run https://github.com/toshas/torch-householder/blob/master/tests/benchmark_one.py with the values of width (r), height (d), and batchsize (b) of interest? |
$ python3 benchmark_one.py --repeats 5 --method hh_ours --b 1 --d 512 --r 512
70.72763475589454 1.2755393754559918e-06
$ python3 benchmark_one.py --repeats 5 --method hh_pt --b 1 --d 512 --r 512
318.1826988235116 1.3709068298339844e-06 to be really comparable, I have to modify the test. It creates B decompositions, but in my case a single one is shared for the whole batch Now with single orthonormal matrix and B batches of data (I believe that the decomp. works fine ;)): $ python3 benchmark_one.py --repeats 20 --method hh_ours --b 20 --d 512 --r 512
19.521982199512422
$ python3 benchmark_one.py --repeats 20 --method hh_pt --b 20 --d 512 --r 512
2.61343854945153 still not really comparable .. |
I applied the patch and not sure what the modified benchmark is supposed to verify. The way it was implemented, it timed pure calls to the functions. After patching there is an extra |
Problem
I'm using an orthonormal constrained matrix in a learnable filterbank setting. Now I want to optimize the training and run some profiling with torch, but getting strange results. Just want to double-check here whether I'm doing something wrong.
Code
I'm constructing the matrix during forward pass like this:
Profiles
All profiles are created with the pytorch profiler with warmup of one and two trial runs:
Profile torch householder_product (matrix 512x512 f32)
Marked forward pass and backward pass visible in light green:
Profile torch-householder (matrix 512x512 f32)
Questions
I'm not an expert in torch and do not follow the development closely. There is an issue pytorch/pytorch#50104 for integrating CUDA support to
orgqr
, may this cause the difference in time?torch-householder
library much slower in the forward passI'm also happy to share the traces with you, please just ping then :)
The text was updated successfully, but these errors were encountered: