An implementation of Dual Learning For Machine Translation and Joint Training for Neural Machine Translation Models with Monolingual Data on tensorflow.
This project depend heavily on nematus v0.3.
Nematus requires the following packages:
- Python >= 2.7 ( After 2018.12.20 the master branch of namatus is based on py3.5. )
- tensorflow ( I use 1.4.0. )
See more details about nematus in above link.
And I use kenlm as language model:
- Kenlm ( pip install https://github.com/kpu/kenlm/archive/master.zip )
It seems you need complie it from source code for getting binary executing file. See more details about kenlm in the link above.
The code inside which related to language model are independent, so you could use other language model as long as it could offer the function of score a sentence .
You shall prepare the following models:
- A pair of small parallel dataset of two language.
- A pair of large monolingual dataset of two language.
- NMT model X 2, using nematus and small dataset.
- Language model X2 , using the script of /LM/train_lm.py. You need set the KENLM_PATH and TEMP_DIR inside.
I preprocessed dataset by subword.
Then set the parameter in /test/test_train_dual.sh , especial :
- LdataPath
- SdataPath
- modelDir
- LMDir
Description as their name. And you could write your own training script, see the following new added configs for dual learning:
parameter | description |
---|---|
--dual | active dual learning or joint training |
--para | active parallel dataset using in dual learning |
--reinforce | active dual learning |
--alpha | weight of lm score in dual learning. |
--joint | active joint training |
--model_rev \n --saveto_rev | reverse model file name |
--reload_rev | load existing model from this path. Set to "latest_checkpoint" to reload the latest checkpoint in the same directory of --model |
--source_lm | language model (source) |
--target_lm | language model (target) |
--lms | language models (one for source, and one for target.) |
--source_dataset_mono | parallel training corpus (source) |
--target_dataset_mono | parallel training corpus (target) |
--datasets_mono | parallel training corpus (one for source, and one for target.) |
For replaying the paper of Dual Learning For Machine Translation, you need add --reinforce. For replaying the paper Joint Training for Neural Machine Translation Models with Monolingual Data, you need add --joint.
I randomly pick up 400000 pairs of parallel sentences from corpus Europarl German-English, divide them into 3 parts:
- 80000 pairs sentences seen as parallel corpus.
- 300000 pairs sentences seen as monolingual corpus.
- 20000 pairs sentences as valid dataset.
Then train a pair of initial models with 80000 pairs sentences for 35 epochs.
The result of dual learning isn't good, Later I would push the result.
The joint training works well.
Model | Original | Epoch1 | Epoch2 | Epoch3 | Epoch4 | Epoch5 | Epoch6 | Epoch7 | Epoch8 | Epoch9 | Epoch10 | Epoch11 | Epoch12 | Epoch13 | Epoch14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EN-DE | 3.5025 | 3.3009 | 3.6395 | 5.6207 | 9.0302 | 11.0943 | 12.5482 | 13.4416 | 14.0149 | 14.5954 | 14.8751 | 14.9155 | 15.0892 | 15.0941 | 15.1386 |
DE-EN | 4.8898 | 5.2038 | 6.3034 | 8.6047 | 13.0508 | 16.3928 | 18.4444 | 19.6504 | 20.3632 | 20.7215 | 21.0472 | 21.3191 | 21.6728 | 21.8632 | 22.0694 |
- basically replay the paper of Dual Learning For Machine Translation.
- bug fixed.
- bug fixed and improve the info displaying.
- basically replay the paper of Joint Training for Neural Machine Translation Models with Monolingual Data.
- bug fixed.
- Improve the efficiency when creating fake corpus.
- bug fixed.