Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the training guide #239

Merged
merged 31 commits into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ec14ae5
Update training guide
eu9ene Oct 31, 2023
b1b8bef
Merge branch 'main' into training_guide
eu9ene Oct 31, 2023
074163c
Fix docs
eu9ene Oct 31, 2023
7598359
Merge branch 'main' into training_guide
eu9ene Oct 31, 2023
01b5fc9
Add index file
eu9ene Oct 31, 2023
015f385
Merge remote-tracking branch 'origin/training_guide' into training_guide
eu9ene Oct 31, 2023
839cd84
Remove header
eu9ene Oct 31, 2023
5cbbf11
Fix docs link
eu9ene Nov 1, 2023
8a31214
Remove tensorboard section
eu9ene Nov 1, 2023
f7cb176
Add theme
eu9ene Nov 1, 2023
0fde7dc
Update navigation
eu9ene Nov 1, 2023
df4e819
Add logo
eu9ene Nov 1, 2023
0a2ae9b
Use absolute links
eu9ene Nov 1, 2023
7f16f62
Fix code links
eu9ene Nov 1, 2023
982df97
Fix code links
eu9ene Nov 1, 2023
157d679
Fix link
eu9ene Nov 2, 2023
37a5e28
Clarify what config is
eu9ene Nov 3, 2023
d63e50f
Fix note for bicleaner
eu9ene Nov 3, 2023
189fb00
Fix typo
eu9ene Nov 3, 2023
d605b77
Fix link
eu9ene Nov 3, 2023
8761ba5
Fix mentioning of Marian
eu9ene Nov 3, 2023
4558e30
Remove "my"
eu9ene Nov 3, 2023
416f799
Make note about snakemake more visible
eu9ene Nov 3, 2023
28c380c
Fix phrasing
eu9ene Nov 3, 2023
4a9803b
Add link to bilceaner paper
eu9ene Nov 4, 2023
7c31e16
Add clarifications
eu9ene Nov 4, 2023
d6c1693
Add links to default training configs
eu9ene Nov 4, 2023
68cb740
Add reference to bilceaner section
eu9ene Nov 4, 2023
9baa0aa
Small fixes
eu9ene Nov 4, 2023
da6a64b
Merge branch 'main' into training_guide
eu9ene Nov 4, 2023
4d15553
Merge branch 'main' into training_guide
eu9ene Nov 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -119,13 +119,13 @@ dag:
################################################

# OpusCleaner is a data cleaner for training corpus
# More details are in docs/opus-cleaner.md
# More details are in docs/cleaning.md
opuscleaner-ui:
poetry install --only opuscleaner
opuscleaner-server serve --host=0.0.0.0 --port=8000

# Utils to find corpus etc
install utils:
install-utils:
poetry install --only utils

# Black is a code formatter for Python files. Running this command will check that
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.

[Documentation](/docs)
[Documentation](https://mozilla.github.io/firefox-translations-training/)

## Pipeline

Expand Down
12 changes: 12 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
remote_theme: just-the-docs/just-the-docs
#color_scheme: dark
title: Firefox Translations Training
description: Documentation for the Firefox Translations training pipelines
heading_anchors: true
# doesn't work
favicon_ico: "img/logo.svg"
# Aux links for the upper right navigation
aux_links:
"GitHub":
- "https://github.com/mozilla/firefox-translations-training"

84 changes: 84 additions & 0 deletions docs/cleaning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
layout: default
title: Data cleaning
nav_order: 5
---

# Data cleaning

Making datasets less noisy to improve quality of translation.

## Regular pipeline


Config setting:
```
use-opuscleaner: false
```

### Dataset fixing

Some datasets require fixes like detokenization.
Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes).
Naming convention:
- `<dataset_name>.sh` for parallel dataset cleaning
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
- `/` in dataset name should be replaced with `_`

### Cleaning scripts

Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.


### Bicleaner
eu9ene marked this conversation as resolved.
Show resolved Hide resolved

It is recommended to use Bicleaner ML models to filter noisy data.
See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner).


## OpusCleaner

Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.

Config setting:
```
use-opuscleaner: true
```

## Custom filter configs
The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset
to get a training corpus with less noise and train higher quality translation models.

Filtering rules can be tuned in an interactive UI.

### Installation

Install the OpusCleaner UI on a server.
See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner).

For local usage: run from a poetry shell `make opuscleaner-ui`.
Then go to `http://0.0.0.0:8000`.

### Making filters

Choose a language pair and download the required OPUS datasets.
They will correspond to `opus_...` training datasets in the training pipeline config.

Configure cleaning rules for the datasets in the UI.

Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/`.

### Default config

If no custom config was specifed for the dataset,
the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.

Modify if needed. Some rules require specifying source or target language.
The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
The generated default config will be copied to the target dataset cleaning directory.

### Running

Enable OpusCleaner in the training pipeline config and run the pipeline as usual.
OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script.
49 changes: 10 additions & 39 deletions docs/data.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Data
---
layout: default
title: Datasets
nav_order: 4
---

This section includes instructions on how to find and configure datasets and cleaning procedures.
# Dataset importers

## Dataset importers

Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml).
Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml).

Example:
```
Expand All @@ -25,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"

You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.

Set up a local [poetry](https://python-poetry.org/) environment.
```
Expand All @@ -36,38 +38,7 @@ python utils/find-corpus.py en ru sacrebleu
```
Make sure to check licenses of the datasets before using them.

### Adding a new importer
## Adding a new importer

Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `<prefix>.sh`
Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh`
and accepts the same parameters as the other scripts from the same folder.

## Dataset fixing

Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes).
Naming convention:
- `<dataset_name>.sh` for parallel dataset cleaning
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
- `/` in dataset name should be replaced with `_`

## Dataset cleaning
Some parallel datasets require more aggressive filtering.
Dataset specific Bicleaner thresholds can be set in config.
`0` means skipping filtering entirely (useful for Paracrawl).

Example:

```
experiment:
...
bicleaner:
default-threshold: 0.5
dataset-thresholds:
opus_ParaCrawl/v8: 0
mtdata_neulab_tedtalksv1_train: 0.6
```

### OpusCleaner

Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.

See more details in the [dedicated doc](opus-cleaner.md).
6 changes: 6 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
layout: default
title: Development
nav_order: 7
---

# Development

## Architecture
Expand Down
4 changes: 4 additions & 0 deletions docs/img/logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
layout: default
title: Home
nav_order: 1
description: "Firefox Translations Training documentation."
eu9ene marked this conversation as resolved.
Show resolved Hide resolved
permalink: /
---

# Firefox Translations training
Training pipelines for Firefox Translations machine translation models.

The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.

## Training pipeline

The pipeline is capable of training a translation model for a language pair end to end.
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
Some settings, especially low resource languages might require extra tuning.

We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .

## Learning resources

- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
- [Model training guide](training-guide.md) - practical advice on how to use the pipeline
- [Reference papers](references.md)


## Acknowledgements
This project uses materials developed by:
- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/))
- Many other open source projects and research papers (see [References](references.md))
47 changes: 0 additions & 47 deletions docs/opus-cleaner.md

This file was deleted.

21 changes: 21 additions & 0 deletions docs/orchestrators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: default
title: Orchestrators
nav_order: 6
has_children: true
has_toc: false
---

# Orchestrators

An orchestrator is responsible for workflow management and parallelization.

Supported orchestrators:

- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
eu9ene marked this conversation as resolved.
Show resolved Hide resolved
It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
[Usage instructions](task-cluster.md).
- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster.
[Usage instructions](snakemake.md).

Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future.
11 changes: 8 additions & 3 deletions docs/pipeline-steps.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
layout: default
title: Pipeline steps
nav_order: 3
---

# Pipeline steps

Expand All @@ -10,14 +15,14 @@ Step | Description | Bottleneck | Comments
--- | --- | --- | ---
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py).
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py).
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning).
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk |
Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU |
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode.
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive.
Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization.
Expand Down
9 changes: 8 additions & 1 deletion docs/references.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
layout: default
title: References
nav_order: 8
---

# References

Here is a list of selected publications on which the training pipeline is based.
Expand All @@ -15,7 +21,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020

3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2


4. [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020)

5. [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019)
Expand All @@ -32,3 +37,5 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL.
15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016)
16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018)
17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022)
18. [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947) (Yoon Kim, Alexander M. Rush, EMNLP 2016)
20 changes: 7 additions & 13 deletions docs/snakemake.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
layout: default
title: Snakemake
nav_order: 2
parent: Orchestrators
---

# Snakemake

This section included the instructions on how to run the pipeline
Expand Down Expand Up @@ -284,16 +291,3 @@ The main directories inside `SHARED_ROOT` are:
│ └ ru-en
│ └ test
│ └ clean_corpus.log


## Utilities

### Tensorboard

To see training graphs run tensorboard:

```
make install-tensorboard
make tensorboard
```
Then port forward 6006.
Loading