You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, dear author. I'm a beginner, and I feel a bit embarrassed to bother you. I'm somewhat confused about this self.audio_alignment attribute. From what I understand, it's basically an estimate of how many audio tokens correspond roughly to one video frame. I also noticed that in your code, you truncate the audio tokens based on the video frames and this self.audio_alignment attribute, and I'm not entirely sure how that works. Thank you very much for your work, which has really broadened my understanding.
The text was updated successfully, but these errors were encountered:
The attribute self.audio_alignment essentially helps determine how many audio tokens (from the neural audio codec, e.g., vq-wav2vec or wav2vec2) correspond to a single video frame. We arrived at this by looking at the audio and video sampling rates—16 kHz for audio and 25 fps for video—and selecting a hop size that keeps both streams in sync. Concretely, one video frame typically matches about four vector-quantized audio tokens (100 Hz) in vq-wav2vec.
We truncate the audio tokens at the end of the sequence to handle any remaining tokens that don't align neatly with the final video frame. If you check the “Audio Reconstruction Loss” section of our paper, you’ll see more detail on how and why this alignment is performed.
Thank you for your interest, and I hope this helps!
Hello, dear author. I'm a beginner, and I feel a bit embarrassed to bother you. I'm somewhat confused about this self.audio_alignment attribute. From what I understand, it's basically an estimate of how many audio tokens correspond roughly to one video frame. I also noticed that in your code, you truncate the audio tokens based on the video frames and this self.audio_alignment attribute, and I'm not entirely sure how that works. Thank you very much for your work, which has really broadened my understanding.
The text was updated successfully, but these errors were encountered: