You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for releasing the code for your great work! After reading the paper, I am trying to replicate the simultaneous translation experiment using CIF's output as the pre-decision policy.
However, I have a question about the underlying model used -- Wav2Vec2.0. From my understanding, it requires the complete audio input to encode the feature because it uses normal (non-causal) convnet and tf-encoder and does not support streaming, so how is the inference performed for the wav2vec_cif model? From the paper, it seems that you didn't perform any additional training for Wav2Vec2 to make it support partial input during stream-like inference.
I would greatly appreciate it if you could provide additional details about how evaluation is performed for the streaming case!
The text was updated successfully, but these errors were encountered:
Hi authors,
Thanks for releasing the code for your great work! After reading the paper, I am trying to replicate the simultaneous translation experiment using CIF's output as the pre-decision policy.
However, I have a question about the underlying model used -- Wav2Vec2.0. From my understanding, it requires the complete audio input to encode the feature because it uses normal (non-causal) convnet and tf-encoder and does not support streaming, so how is the inference performed for the wav2vec_cif model? From the paper, it seems that you didn't perform any additional training for Wav2Vec2 to make it support partial input during stream-like inference.
I would greatly appreciate it if you could provide additional details about how evaluation is performed for the streaming case!
The text was updated successfully, but these errors were encountered: