-
-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: compute fbank on selected device #1529
Conversation
@hbredin Based on my previous testing, besides moving compute_fbank to the GPU, torch.vstack also runs very slowly on the CPU. pyannote-audio/pyannote/audio/pipelines/speaker_diarization.py Lines 341 to 345 in 0b45103
|
I don't observe this behavior on Google Colab (T4) import torch
waveforms = [torch.randn(1, 160000) for i in range(32)]
%%timeit
torch.vstack(waveforms)
# 3.63 ms ± 380 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
gpu = torch.device("cuda")
%%timeit
torch.vstack([w.to(gpu) for w in waveforms])
# 6.37 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Can you please double check? |
Sure, tomorrow when I'm at work, I'll run the program and see the results. |
@hbredin Hello, I've tested the behavior of the torch.vstack function on the CPU. When I run the following code, I've observed that this program consumes a significant amount of CPU resources, as shown in the figure below. It occupies 128 cores. When we run multiple processes in parallel to execute torch.vstack on the CPU, the performance becomes very slow. You can run this code on your computer and then use the 'top' command to check the CPU usage. import torch
from loguru import logger
import time
a = torch.randn((1, 1, 160000))
b = (a,)
start = time.perf_counter()
for i in range(1000000):
torch.vstack(b)
end = time.perf_counter()
logger.info(f"takes {end-start:>.2f}")
# output
# 2023-11-06 20:05:48.691 | INFO | __main__:<module>:13 - takes 44.06 |
Thanks. But I think we are still missing a comparison with GPU, don't we? |
GPU A100import torch
from loguru import logger
import time
a = torch.randn((1, 1, 160000))
a = a.cuda()
b = (a,)
start = time.perf_counter()
for i in range(1000000):
torch.vstack(b)
end = time.perf_counter()
logger.info(f"takes {end-start:>.2f}")
# output
# 2023-11-06 20:34:31.094 | INFO | __main__:<module>:14 - takes 12.19 |
Can you please check what happens when reducing |
export OMP_NUM_THREADS=64,It occupies 64 cores. The runtime has been reduced, which came as a pleasant surprise to me. 2023-11-07 14:20:28.237 | INFO | __main__:<module>:13 - takes 25.43 |
@asr-pub: this is a more generic attempt at solving #1522 as it uses the internal
self.device
that can be set byWeSpeakerPretrainedEmbeddding.to(device)
(and does not force using GPU when available). Can you please try it out and confirm that this solves your issue?@juanmc2005: side effect of this PR is that it should solved #1518. Can you please try it out and confirm?