Unigram Tokenization

This is a simple Python implementation of Unigram Tokenization.

Algorithm Description

Use Byte Pair Encoding (BPE) tokenizer to create an arbitrarily large vocabulary $\mathcal{V}$.
Let the distribution on tokens be denoted as $p(x_i) = \frac{count(x_i)}{\sum_{j=1}^{N}count(x_j)}$.
Use the hard EM algorithm to estimate the distribution $p(x_i)$:
- Repeat these steps until convergence
  - Employ the Viterbi algorithm to find the best tokenization $\mathcal{T}$.
  - Fix the best tokenization $\mathcal{T}$ and maximize likelihood: $$P(X) = \prod_{i=1}^{N_{\mathcal{T}}} p(x_i)$$
Shrink the vocabulary $\mathcal{T}$ by a multiplication factor $\alpha$:
- Calculate the loss if token $x_i$ is replaced with the Viterbi path of token ${x_i}$.
- Sort by loss
- Shrink the vocabulary so that $|\mathcal{T_{new}}| = (1 - \alpha)|\mathcal{T}_{old}|$.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
unigram.ipynb		unigram.ipynb