Understanding the Role of self.audio_alignment in Audio-Visual Cross-Modal Models" #28

15253a · 2025-01-13T19:20:27Z

Hello, dear author. I'm a beginner, and I feel a bit embarrassed to bother you. I'm somewhat confused about this self.audio_alignment attribute. From what I understand, it's basically an estimate of how many audio tokens correspond roughly to one video frame. I also noticed that in your code, you truncate the audio tokens based on the video frames and this self.audio_alignment attribute, and I'm not entirely sure how that works. Thank you very much for your work, which has really broadened my understanding.

snoop2head · 2025-01-14T02:01:52Z

Dear @15253a ,

No worries at all—thanks for reaching out!

The attribute self.audio_alignment essentially helps determine how many audio tokens (from the neural audio codec, e.g., vq-wav2vec or wav2vec2) correspond to a single video frame. We arrived at this by looking at the audio and video sampling rates—16 kHz for audio and 25 fps for video—and selecting a hop size that keeps both streams in sync. Concretely, one video frame typically matches about four vector-quantized audio tokens (100 Hz) in vq-wav2vec.

We truncate the audio tokens at the end of the sequence to handle any remaining tokens that don't align neatly with the final video frame. If you check the “Audio Reconstruction Loss” section of our paper, you’ll see more detail on how and why this alignment is performed.

Thank you for your interest, and I hope this helps!

snoop2head · 2025-01-26T09:57:45Z

@15253a
I will close this issue for now. Please re-open the issue or file another one if you have any other questions!

snoop2head closed this as completed Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the Role of self.audio_alignment in Audio-Visual Cross-Modal Models" #28

Understanding the Role of self.audio_alignment in Audio-Visual Cross-Modal Models" #28

15253a commented Jan 13, 2025

snoop2head commented Jan 14, 2025

snoop2head commented Jan 26, 2025

Understanding the Role of self.audio_alignment in Audio-Visual Cross-Modal Models" #28

Understanding the Role of self.audio_alignment in Audio-Visual Cross-Modal Models" #28

Comments

15253a commented Jan 13, 2025

snoop2head commented Jan 14, 2025

snoop2head commented Jan 26, 2025