Skip to content

Latest commit

 

History

History
1495 lines (1184 loc) · 50.5 KB

PUBLICATIONS.md

File metadata and controls

1495 lines (1184 loc) · 50.5 KB

List of publications using Lingvo.

Translation

[1] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google's neural machine translation system: Bridging the gap between human and machine translation,” tech. rep., 2016. [ pdf ]
[2] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean, “Google's multilingual neural machine translation system: Enabling zero-shot translation,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 339--351, 2017. [ DOI | pdf ]
[3] A. Eriguchi, M. Johnson, O. Firat, H. Kazawa, and W. Macherey, “Zero-shot cross-lingual classification using multilingual neural machine translation,” arXiv preprint arXiv:1809.04686, 2018. [ pdf ]
[4] A. Bapna, M. X. Chen, O. Firat, Y. Cao, and Y. Wu, “Training deeper neural machine translation models with transparent attention,” in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [ pdf ]
[5] C. Cherry, G. Foster, A. Bapna, O. Firat, and W. Macherey, “Revisiting character-based neural machine translation with capacity and compression,” in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [ pdf ]
[6] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes, “The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation,” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2018. [ pdf ]
[7] J. Kuczmarski and M. Johnson, “Gender-aware natural language translation,” 2018. [ pdf ]
[8] R. Aharoni, M. Johnson, and O. Firat, “Massively multilingual neural machine translation,” 2019. [ pdf ]
[9] J. Luo, Y. Cao, and R. Barzilay, “Neural decipherment via minimum-cost flow: From ugaritic to linear b,” 2019. [ http ]
[10] N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz, R. Pang, W. Li, and C. Raffel, “Monotonic infinite lookback attention for simultaneous machine translation,” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2019. [ http ]
[11] M. Freitag, I. Caswell, and S. Roy, “Ape at scale and its implications on mt evaluation biases,” 2019. [ pdf | http ]
[12] N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry, W. Macherey, Z. Chen, and Y. Wu, “Massively multilingual neural machine translation in the wild: Findings and challenges,” 2019. [ arXiv | http ]
[13] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” in Advances in Neural Information Processing Systems, 2019. [ http ]

Speech recognition

[1] C.-C.Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[2] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[3] B. Li, T. N. Sainath, K. Sim, M. Bacchiani, E. Weinstein, P. Nguyen, Z. Chen, Y. Wu, and K. Rao, “Multi-Dialect Speech Recognition With a Single Sequence-to-Sequence Model,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[4] T. N. Sainath, P. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu, Z. Chen, and C. C. Chiu, “No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[5] D. Lawson, C. C. Chiu, G. Tucker, C. Raffel, K. Swersky, and N. Jaitly, “Learning hard alignments with variational inference,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[6] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[7] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C. C. Chiu, and A. Kannan, “Minimum Word Error Rate Training for Attention-based Sequence-to-sequence Models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[8] T. N. Sainath, C. C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, and Z. C. Z, “Improving the Performance of Online Neural Transducer Models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ pdf ]
[9] C. C. Chiu and C. Raffel, “Monotonic Chunkwise Attention,” in Proc. International Conference on Learning Representations (ICLR), 2018. [ pdf ]
[10] I. Williams, A. Kannan, P. Aleksic, D. Rybach, and T. N. S. TN, “Contextual Speech Recognition in End-to-End Neural Network Systems using Beam Search,” in Proc. Interspeech, 2018. [ pdf ]
[11] C. C. Chiu, A. Tripathi, K. Chou, C. Co, N. Jaitly, D. Jaunzeikare, A. Kannan, P. Nguyen, H. Sak, A. Sankar, J. Tansuwan, N. Wan, Y. Wu, and X. Zhang, “Speech recognition for medical conversations,” in Proc. Interspeech, 2018. [ pdf ]
[12] R. Pang, T. N. Sainath, R. Prabhavalkar, S. Gupta, Y. Wu, S. Zhang, and C. C. Chiu, “Compression of End-to-End Models,” in Proc. Interspeech, 2018. [ pdf ]
[13] S. Toshniwal, A. Kannan, C. C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2018. [ pdf ]
[14] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recognition,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2018. [ pdf ]
[15] B. Li, Y. Zhang, T. N. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[16] J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[17] U. Alon, G. Pundak, and T. N. Sainath, “Contextual speech recognition with difficult negative training examples,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[18] Y. Qin, N. Carlini, I. Goodfellow, G. Cottrell, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” in Proc. International Conference on Machine Learning (ICML), 2019. [ pdf ]
[19] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in arXiv, 2019. [ pdf ]
[20] B. Li, T. N. Sainath, R. Pang, and Z. Wu, “Semi-supervised training for end-to-end models via weak distillation,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[21] S.-Y. Chang, R. Prabhavalkar, Y. He, T. N. Sainath, and G. Simko, “Joint endpointing and decoding with end-to-end models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[22] J. Heymann, K. C. Sim, and B. Li, “Improving ctc using stimulated learning for sequence modeling,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[23] A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath, “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[24] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-Y. Chang, K. Rao, and A. Gruenstein, “Streaming end-to-end speech recognition for mobile devices,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[25] K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach, and P. Nguyen, “On the choice of modeling unit for sequence-to-sequence speech recognition,” in Proc. Interspeech, 2019. [ pdf ]
[26] C. Peyser, H. Zhang, T. N. Sainath, and Z. Wu, “Improving Performance of End-to-End ASR on Numeric Sequences,” in Proc. Interspeech, 2019. [ pdf ]
[27] D. Zhao, T. N. Sainath, D. Rybach, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” in Proc. Interspeech, 2019. [ pdf ]
[28] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li, M. Visontai, Q. Liang, T. Strohman, Y. Wu, I. McGraw, and C.-C. Chiu, “Two-pass end-to-end speech recognition,” in Proc. Interspeech, 2019. [ pdf ]
[29] C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. Sainath, and Y. Wu, “A comparison of end-to-end models for long-form speech recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019. [ pdf ]
[30] A. Narayanan, R. Prabhavalkar, C. Chiu, D. Rybach, T. Sainath, and T. Strohman, “Recognizing long-form speech using streaming end-to-end models,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019. [ pdf ]
[31] T. N. Sainath, R. Pang, R. Weiss, Y. He, C.-C. Chiu, and T. Strohman, “An attention-based joint acoustic and text on-device end-to-end model,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
[32] Z. Lu, L. Cao, Y. Zhang, C.-C. Chiu, and J. Fan, “Speech sentiment analysis via pre-trained features from end-to-end asr models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
[33] D. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. Le, and Y. Wu, “Specaugment on large scale datasets,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. [ pdf ]
[34] T. Sainath, Y. He, B. Li, A. Narayanan, R. Pang, A. Bruguier, S. yiin Chang, W. Li, R. Alvarez, Z. Chen, C. cheng Chiu, D. Garcia, A. Gruenstein, K. Hu, M. Jin, A. Kannan, Q. Liang, I. McGraw, C. Peyser, R. Prabhavalkar, G. Pundak, D. Rybach, Y. Shangguan, Y. Sheth, T. Strohman, M. Visontai, Y. Wu, Y. Zhang, and D. Zhao, “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
[35] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020. [ pdf ]
[36] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,” in Proc. Interspeech, 2020. [ pdf ]
[37] W. Li, J. Qin, C.-C. Chiu, R. Pang, and Y. He, “Parallel rescoring with transformer for streaming on-device speech recognition,” in Proc. Interspeech, 2020.
[38] D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu, and Q. V. Le, “Improved noisy student training for automatic speech recognition,” in Proc. Interspeech, 2020. [ pdf ]
[39] Y. Zhang, J. Qin, D. S. Park, W. Han, C.-C. Chiu, R. Pang, Q. V. Le, and Y. Wu, “Pushing the limits of semi-supervised learning for automatic speech recognition,” in NeurIPS 2020 Workshop on Self-Supervised Learning for Speech and Audio Processing, 2020. [ pdf ]
[40] C.-C. Chiu, A. Narayanan, W. Han, R. Prabhavalkar, Y. Zhang, N. Jaitly, R. Pang, T. N. Sainath, P. Nguyen, L. Cao, and Y. Wu, “Rnn-t models fail to generalize to out-of-domain audio: Causes and solutions,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2020. [ pdf ]
[41] S. Panchapagesan, D. S. Park, C.-C. Chiu, Y. Shangguan, Q. Liang, and A. Gruenstein, “Efficient knowledge distillation for rnn-transducer models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021. [ pdf ]
[42] A. Narayanan, T. N. Sainath, R. Pang, J. Yu, C.-C. Chiu, R. Prabhavalkar, E. Variani, and T. Strohman, “Cascaded encoders for unifying streaming and non-streaming asr,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021. [ pdf ]
[43] B. Li, A. Gulati, J. Yu, T. N. Sainath, C.-C. Chiu, A. Narayanan, S.-Y. Chang, R. Pang, Y. He, J. Qin, W. Han, Q. Liang, Y. Zhang, T. Strohman, and Y. Wu, “A better and faster end-to-end model for streaming asr,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021. [ pdf ]
[44] T. Doutre, W. Han, M. Ma, Z. Lu, C.-C. Chiu, R. Pang, A. Narayanan, A. Misra, Y. Zhang, and L. Cao, “Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021. [ pdf ]
[45] T. Doutre, W. Han, C.-C. Chiu, R. Pang, O. Siohan, and L. Cao, “Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models,” in Proc. Interspeech, 2021. [ pdf ]
[46] J. Yu, C.-C. Chiu, B. Li, S. yiin Chang, T. N. Sainath, Y. He, A. Narayanan, W. Han, A. Gulati, Y. Wu, and R. Pang, “Fastemit: Low-latency streaming asr with sequence-level emission regularization,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021. [ pdf ]
[47] Z. Lu, W. Han, Y. Zhang, and L. Cao, “Exploring targeted universal adversarial perturbations to end-to-end asr models,” in Proc. Interspeech, 2021. [ pdf ]
[48] Q. Li, Y. Zhang, B. Li, L. Cao, and P. C. Woodland, “Residual energy-based models for end-to-end speech recognition,” in Proc. Interspeech, 2021. [ pdf ]
[49] Q. Li, D. Qiu, Y. Zhang, B. Li, Y. He, P. C. Woodland, L. Cao, and T. Strohman, “Confidence estimation for attention-based sequence-to-sequence models for speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021. [ pdf ]

Language understanding

[1] A. Kannan, K. Chen, D. Jaunzeikare, and A. Rajkomar, “Semi-Supervised Learning for Information Extraction from Dialogue,” in Proc. Interspeech, 2018. [ pdf ]
[2] S. Yavuz, C. C. Chiu, P. Nguyen, and Y. Wu, “CaLcs: Continuously Approximating Longest Common Subsequence for Sequence Level Optimization,” in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [ pdf ]
[3] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur, P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2018. [ pdf ]
[4] M. X. Chen, B. N. Lee, G. Bansal, Y. Cao, S. Zhang, J. Lu, J. Tsay, Y. Wang, A. M. Dai, Z. Chen, T. Sohn, and Y. Wu, “Gmail smart compose: Real-time assisted writing,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, 2019. [ pdf | http ]

Speech synthesis

[1] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ sound examples | pdf ]
[2] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio, “On using backpropagation for speech texture generation and voice conversion,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. [ sound examples | pdf ]
[3] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, 2018. [ sound examples | pdf ]
[4] W. N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” in Proc. International Conference on Learning Representations (ICLR), 2019. [ sound examples | pdf ]
[5] W. N. Hsu, Y. Zhang, R. J. Weiss, Y. A. Chung, Y. Wang, Y. Wu, and J. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” in NeurIPS 2018 Workshop on Interpretability and Robustness in Audio, Speech, and Language, 2018. [ pdf ]
[6] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, 2019. [ data | pdf ]
[7] F. Biadsy, R. J. Weiss, P. Moreno, D. Kanvesky, and Y. Jia, “Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,” in Proc. Interspeech, 2019. [ sound examples | pdf ]
[8] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. Interspeech, 2019. [ sound examples | pdf ]
[9] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. [ sound examples | pdf ]
[10] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg, B. Ramabhadran, and Y. Wu, “Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. [ sound examples | pdf ]
[11] J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling,” arXiv preprint arXiv:2010.04301, 2020. [ sound examples | pdf ]
[12] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wavegrad: Estimating gradients for waveform generation,” in Proc. International Conference on Learning Representations (ICLR), 2020. [ sound examples | pdf ]
[13] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan, “Wavegrad 2: Iterative refinement for text-to-speech synthesis,” in Proc. Interspeech, 2021. [ sound examples | pdf ]
[14] I. Elias, H. Zen, J. Shen, Y. Zhang, Y. Jia, R. J. Weiss, and Y. Wu, “Parallel tacotron: Non-autoregressive and controllable tts,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5709--5713, IEEE, 2021. [ sound examples | pdf ]
[15] I. Elias, H. Zen, J. Shen, Y. Zhang, Y. Jia, R. Skerry-Ryan, and Y. Wu, “Parallel tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling,” in Proc. Interspeech, 2021. [ sound examples | pdf ]
[16] Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “Png bert: Augmented bert on phonemes and graphemes for neural tts,” in Proc. Interspeech, 2021. [ sound examples | pdf ]

Speech translation

[1] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, “Sequence-to-sequence models can directly translate foreign speech,” in Proc. Interspeech, 2017. [ pdf ]
[2] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C. C. Chiu, N. Ari, S. Laurenzo, and Y. Wu, “Leveraging weakly supervised data to improve end-to-end speech-to-text translation,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [ pdf ]
[3] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu, “Direct speech-to-speech translation with a sequence-to-sequence model,” in Proc. Interspeech, 2019. [ sound examples | pdf ]
[4] Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Translatotron 2: Robust direct speech-to-speech translation,” arXiv preprint arXiv:2107.08661, 2021. [ sound examples | pdf ]

Speech enhancement

[1] S. Ding, Y. Jia, K. Hu, and Q. Wang, “Textual echo cancellation,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021. [ sound examples | pdf ]
[2] A. Narayanan, C.-C. Chiu, T. O'Malley, Q. Wang, and Y. He, “Cross-attention conformer for context modeling in speech enhancement for asr,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021. [ pdf ]
[3] T. O'Malley, A. Narayanan, Q. Wang, A. Park, J. Walker, and N. Howard, “A conformer-based asr frontend for joint acoustic echo cancellation, speech enhancement and speech separation,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021. [ pdf ]

Speaker and language recognition

[1] Q. Wang, Y. Yu, J. Pelecanos, Y. Huang, and I. L. Moreno, “Attentive temporal pooling for conformer-based streaming language identification in long-form speech,” 2022. [ pdf ]
[2] J. Pelecanos, Q. Wang, Y. Huang, and I. L. Moreno, “Parameter-free attentive scoring for speaker verification,” 2022. [ code | pdf ]
[3] W. Xia, H. Lu, Q. Wang, A. Tripathi, Y. Huang, I. L. Moreno, and H. Sak, “Turn-to-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022. [ code | pdf ]

Optimization

[1] R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer, “Second order optimization made practical,” arXiv preprint arXiv:2002.09018, 2020. [ pdf ]
[2] N. Agarwal, R. Anil, E. Hazan, T. Koren, and C. Zhang, “Disentangling adaptive gradient methods from learning rates,” arXiv preprint arXiv:2002.11803, 2020. [ pdf ]
[3] R. Anil, V. Gupta, T. Koren, and Y. Singer, “Memory efficient adaptive optimization,” in Advances in Neural Information Processing Systems, pp. 9749--9758, 2019. [ pdf ]