Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the Role of self.audio_alignment in Audio-Visual Cross-Modal Models" #28

Closed
15253a opened this issue Jan 13, 2025 · 2 comments

Comments

@15253a
Copy link

15253a commented Jan 13, 2025

Hello, dear author. I'm a beginner, and I feel a bit embarrassed to bother you. I'm somewhat confused about this self.audio_alignment attribute. From what I understand, it's basically an estimate of how many audio tokens correspond roughly to one video frame. I also noticed that in your code, you truncate the audio tokens based on the video frames and this self.audio_alignment attribute, and I'm not entirely sure how that works. Thank you very much for your work, which has really broadened my understanding.

@snoop2head
Copy link
Collaborator

Dear @15253a ,

No worries at all—thanks for reaching out!

The attribute self.audio_alignment essentially helps determine how many audio tokens (from the neural audio codec, e.g., vq-wav2vec or wav2vec2) correspond to a single video frame. We arrived at this by looking at the audio and video sampling rates—16 kHz for audio and 25 fps for video—and selecting a hop size that keeps both streams in sync. Concretely, one video frame typically matches about four vector-quantized audio tokens (100 Hz) in vq-wav2vec.

We truncate the audio tokens at the end of the sequence to handle any remaining tokens that don't align neatly with the final video frame. If you check the “Audio Reconstruction Loss” section of our paper, you’ll see more detail on how and why this alignment is performed.

Thank you for your interest, and I hope this helps!

@snoop2head
Copy link
Collaborator

@15253a
I will close this issue for now. Please re-open the issue or file another one if you have any other questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants