Skip to content

NLP research experiments, built on PyTorch within the AllenNLP framework.

License

Notifications You must be signed in to change notification settings

epwalsh/nlp-models

Repository files navigation

nlp-models

CircleCI codecov

NLP research experiments, built on PyTorch within the AllenNLP framework.


The goal of this project is to provide an example of a high-quality personal research library. It provides modularity, continuous integration, high test coverage, a code base that emphasizes readability, and a host of scripts that make reproducing any experiment here as easy as running a few make commands. I also strive to make nlp-models useful by implementing practical modules and models that extend AllenNLP. Sometimes I'll contribute pieces of what I work on here back to AllenNLP after it has been thoroughly tested.

Overview

At a high-level, the structure of this project mimics that of AllenNLP. That is, the submodules in nlpete are organized in exactly the same way as in allennlp. But I've also provided a set of scripts that automate frequently used command sequences, such as running tests or experiments. The Makefile serves as the common interface to these scripts:

  • make train: Train a model. This is basically a wrapper around allennlp train, but provides a default serialization directory and automatically creates subdirectories of the serialization directory for different runs of the same experiment.
  • make tensorboard: Run a tensorboard instance locally.
  • make test: Equivalent to running make typecheck, make lint, make unit-test, and make check-scripts.
  • make typecheck: Runs the mypy typechecker.
  • make lint: Runs pydocstyle, flake8, and black.
  • make unit-test: Runs all unit tests with pytest.
  • make check-scripts: Runs a few other scripts that check miscellaneous things not covered by the other tests.
  • make create-branch: A wrapper around the git functionality to create a new branch and push it upstream. You can name a branch after an issue number with make create-branch issue=NUM or give it an arbitrary name with make create-branch name="my-branch".
  • make data/DATASETNAME.tar.gz: Extract a dataset in the data/ directory. Just replace DATASETNAME with the basename of one of the .tar.gz files in that directory.

Getting started

The recommended way to setup a Python environment is using Pipenv. After installing Pipenv, just run

pipenv install --pre --dev --python 3.6

from within the root of your clone of this repository.

Swap out 3.6 for 3.7 if you wish to use Python 3.7.

After your environment is setup you can define your own experiments with an AllenNLP model config file, and then run

make train
# ... follow the prompts to specify the path to your model config and serialization directory.

As an example which you should be able to run immediately, I've provided an implementation of CopyNet and an artificial dataset to experiment with. To train this model, run the following:

# Extract data.
make data/greetings.tar.gz

# Train model. When prompted for the model file, enter "experiments/greetings/copynet.json".
# This took (~3-5 minutes on a single GTX 1070).
make train

NOTE: All of the model configs in the experiments/ folder are defined to run on GPU #0. So if you don't have a GPU available or want to use a different GPU, you'll need to modify the trainer.cuda_device field in the experiment's config file.

Models implemented

CopyNet: A sequence-to-sequence model that incorporates a copying mechanism, which enables the model to copy tokens from the source sentence into the target sentence even if they are not part of the target vocabulary. This architecture has shown promising results on machine translation and semantic parsing tasks. For examples in use, see

Datasets available

For convenience, this project provides a handful of training-ready datasets and scripts to pull and proprocess some other useful datasets. Here is a list so far:

Greetings: A simple made-up dataset of greetings (the source sentences) and replies (the target sentences). The greetings are things like "Hi, my name is Jon Snow" and the replies are in the format "Nice to meet you, Jon Snow!". This is completely artificial and is just meant to show the usefullness of the copy mechanism in CopyNet.

# Extract data.
make data/greetings.tar.gz

NL2Bash: A challenging dataset that consists of bash one-liners along with corresponding expert descriptions. The goal is to translate the natural language descriptions into the bash commands.

# Extract data.
make data/nl2bash.tar.gz

WMT 2015: Hosted by fast.ai, this is a dataset of 22.5 million English / French sentence pairs that can be used to train an English to French or French to English machine translation system.

# Download, extract, and preprocess data (big file, may take around 10 minutes).
./scripts/data/pull_wmt.sh

Issues and improvements

If you've found a bug or have any questions, please feel free to submit an issue on GitHub. I always appreciate pull requests as well.