- This is a college course project about Diacritic restoration problem in Vietnamese text using Deep learning based models.
- Techniques applied:
- Word tokenizations
- Models: RNN-LSTM, GRU, Bidirectional RNN approach
- Metric: Bleu-score and accuracy
- A Large-scale Vietnamese News Text Classification Corpus
- This dataset was used in the following paper:
A Comparative Study on Vietnamese Text Classification Methods Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo. In Proceedings of IEEE International Conference on Research, Innovation and Vision for the Future (RIVF 2007) (long), 2007.
- The source data is split into single sentences giving a dataset of 500,000 data points.
- Feature extraction and models training (and so on) in this repo are implemented in Google Colab.
- All codes are organized in
name.ipynb
files.
- All references are cited in the report file.
@INPROCEEDINGS{9530818,
author={Tran, Quang-Linh and Lam, Gia-Huy and Duong, Van-Binh and Do, Trong-Hop},
booktitle={2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)},
title={A Study on Diacritic Restoration Problem in Vietnamese Text using Deep Learning based Models},
year={2021}, volume={}, number={}, pages={306-310}, doi={10.1109/COMNETSAT53002.2021.9530818}
}