Is it possible to run multiple files using FIleAudioSource and have the one final Inference result? #253

esphoenixc · 2024-11-25T07:11:11Z

esphoenixc
Nov 25, 2024

I attempted to increase the duration (audio chunk size) for processing, but when I set it too high, errors like NaN values occur, and I never receive any diarization results.

audio_source = FileAudioSource( file="long_audio.wav", sample_rate=16000, block_duration=60.0 )

block_duration: int Duration of each emitted chunk in seconds. Defaults to 0.5 seconds.

self.block_size = int(np.rint(block_duration * self.sample_rate))

My idea is to maintain the duration (audio chunk size) that the model is designed/trained for like 1 second. Instead of processing a single long audio file, I plan to split it into multiple smaller files. For example, a 10-minute audio file would be divided into five 2-minute chunks. Then, I would run FileAudioSource on each of these smaller files and combine the results to achieve a final inference for the entire 10-minute audio.

Is this approach feasible? If so, how might it impact the accuracy and processing speed compared to using FileAudioSource on the entire audio file at once? Specifically:

Accuracy: Will splitting the audio into smaller chunks affect the diarization performance? Could it potentially improve or degrade the results?
Performance: How will processing multiple smaller files compare in terms of speed and resource usage versus processing one large file?
Any insights or recommendations based on similar experiences would be greatly appreciated!

juanmc2005 · 2024-12-13T10:19:51Z

juanmc2005
Dec 13, 2024
Maintainer

Hi @esphoenixc,
The block duration of FileAudioSource is simply the size of the chunks that will be emitted to the streaming pipeline. These are supposed to be very short and equal to the duration of the stride/step, hence the default to 500ms. Increasing it doesn't really make sense and can cause inconsistencies because the pipeline will receive audio in larger chunks than it can handle.

If you want to increase the duration of audio sent to the model at once, you should take a look at duration in SpeakerDiarizationConfig. The chunks sent to the model will of course overlap according to the step, that you can also modify.

Concerning the splitting of files into smaller files to process sequentially. If you keep using the same SpeakerDiarization instance without calling reset() (which I think StreamingInference does but I'd have to check), the state from the previous file will be kept.
However, you should pay attention to the start and end padding that is added in FileAudioSource.
If you want to simulate the processing of a single long file with many files, you should pad the beginning of the (n)th file with the ending of the (n-1)th file for it to make sense.

May I ask why you want to split a file into multiple ones? The processing time will not be impacted by this splitting when you process in streaming. The only advantage I see is that you'd be able to "pause" and "resume" the pipeline.

3 replies

esphoenixc Jan 2, 2025
Author

Thank you for your response @juanmc2005

The reason for splitting a file into multiple chunks is to minimize the perceived processing time. While FileAudioSource with a single long audio file is already quite fast compared to the original Pyannote implementation—which took over 7 minutes to process an hour-long file—this approach combines the speed advantage of streaming with the accuracy of FileAudioSource. In my tests, FileAudioSource has consistently shown better accuracy than real-time streaming.

Let’s say we are recording an hour-long audio file. Instead of waiting for the full hour to finish before starting the processing, we can process the audio in smaller chunks as they become available. For example, once the first 30-minute segment is recorded, we can process it immediately. By the time the next 30-minute segment is recorded, we can process that as well. This way, each segment takes only its own duration (e.g., 30 minutes) to process, instead of waiting for the entire 1-hour recording to be completed and processed all at once. This method acts as a form of pseudo-streaming, enabling quicker access to results without compromising accuracy.

juanmc2005 Jan 2, 2025
Maintainer

I understand that strategy in a non-streaming scenario like pyannote.audio to accelerate processing, but for diart I don't think it brings anything to the table.

In fact, I think what you might want to do is simply recover diarization results from the audio as soon as possible, which is exactly what diart allows you to do.
Of course, you can run diart and wait for it to finish to recover the outputs as RTTM, but diart is built explicitly for the use case where you don't want to wait. As a matter of fact, you can specify how much you're willing to wait for your results with the latency parameter.

Using StreamingInference, you can use the attach_hooks() method to execute custom code as soon as a new diarization output is available, which will happen every step seconds (500ms by default).

Does this make sense or did I misunderstand?

esphoenixc Jan 8, 2025
Author

Thank you for your response, @juanmc2005.

I’d like to clarify the reasoning behind my approach. My goal isn’t to access intermediate results or reduce the overall processing time per se. Instead, it’s to reduce the perceived waiting time. Here’s the key idea:

Let’s say processing a 1-hour-long audio file takes 10 seconds. If we split the file into two 30-minute chunks, each chunk would take approximately 5 seconds to process (assuming linear scaling).

This means, when the first 30-minute chunk is available, we can process it in 5 seconds.
When the second 30-minute chunk is ready (i.e., at the end of the full hour), we only need another 5 seconds to process it.
In contrast, processing the entire 1-hour file in one go would take 10 seconds, which increases the perceived waiting time when the file is ready. By splitting the file, the final waiting time is halved while maintaining the same inference results because the diarization process wasn’t stopped or reset between chunks.

This approach doesn’t change the overall processing time but significantly improves the user experience by breaking it into smaller tasks and reducing the delay between recording completion and result availability.

Does this distinction make sense? I’d love to hear your thoughts on whether this aligns with diart’s features or if there are better ways to achieve this perceived latency reduction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to run multiple files using FIleAudioSource and have the one final Inference result? #253

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Is it possible to run multiple files using FIleAudioSource and have the one final Inference result? #253

esphoenixc Nov 25, 2024

Replies: 1 comment · 3 replies

juanmc2005 Dec 13, 2024 Maintainer

esphoenixc Jan 2, 2025 Author

juanmc2005 Jan 2, 2025 Maintainer

esphoenixc Jan 8, 2025 Author

esphoenixc
Nov 25, 2024

Replies: 1 comment 3 replies

juanmc2005
Dec 13, 2024
Maintainer

esphoenixc Jan 2, 2025
Author

juanmc2005 Jan 2, 2025
Maintainer

esphoenixc Jan 8, 2025
Author