The code in this repo is based on this post by Nadbor. However, the repo is self-contained, so one can understand the process just by going through the files here.
We have adapted the code so it works with python 3, implement the process using pytorch
, and use this project to illustrate the differences between keras
and pytorch
Our aim here is the following: given some text that comes from multiple sources (e.g two different programming languages or two different writers), can we train a RNN to find which characters belong to each of the sources?
For example, the text below is comprised by bits of python
and scala
code stitched together. Can we find out where the python
ends/begins and the scala
begins/ends?
from __future__ import division
from functools import partial
import warnings
import numpy as np
from scipy import linalg
from scipy.sparse import issparse, csr_matr F[A])(implicit val F: MonadPlus[F]) extends Ops[F[A]] {
////
impoix
from . import check_random_state
from .fixrt Leibniz.===
def filter(f: A => Boolean): F[A] =
F.filter(self)(f)
def withFilter(f: A => Boolean): F[A] =
filter(f)
final def uniteU[T](implicit T: Unapply[Foldable, Aes import np_version
from ._logistic_sigmoid import _log_logistic_sigmoid
from ..extern]): F[T.A] =
F.uniteU(self)(T)
def unite[T[_], B](implicit ev: A === T[B], T: Foldable[T]): F[B] = {
val ftb: F[T[B]] = ev.subst(seals.six.moves import xrange
from .sparsefuncs_fast import csr_row_norms
from .validation import check_array
from ..exceptions import NonBLASDotWarning
Just so is all clear since the beginning, is not that we have a favourite frame. We like both, keras
and pytorch
. Our aim here is to illustrate their differences using a relatively simple model with a non-trivial data-preparation process. Having said that, here we go.
There is a run_all.sh
file that you could run by
bash run_all.sh
However, I would recommend to use run_all.sh
to follow the flow first and run script by script, or perhaps better, have a look to:
full_process_keras.ipynb
full_process_pytorch.ipynb
The full process is detailed in those two notebooks. I would recommend to go first through full_process_keras.ipynb
and then full_process_pytorch.ipynb
.
Nonetheless, the "flow of the code" is:
-
Get the data. We will download 2 sets of files. One to compare
python
andscala
sequences of text and another to compare Jane Austen's and Shakespeare's booksbash get_data.sh
-
Prepare input files:
python prepare_input_files.py data/austen 'austen.txt' data/austen_clean python prepare_input_files.py data/shakespeare/ 'shakespeare.txt' data/shakespeare_clean python prepare_input_files.py data/scikit-learn '*.py' data/sklearn_clean python prepare_input_files.py data/scalaz '*.scala' data/scalaz_clean
-
I have included a little script to split the books in a way that each partition "makes sense", meaning contains enough text and the text corresponds to chapters, episodes, etc:
python split_ebooks.py
-
train/test split
python train_test_split.py data/austen_clean/ 0.25 python train_test_split.py data/shakespeare_clean/ 0.25 python train_test_split.py data/sklearn_clean/ 0.25 python train_test_split.py data/scalaz_clean/ 0.25
-
Train the network, with
keras
orpytorch
, to distinguish betweenpython
orscala
code (just change the corresponding directories to compare Austen and Shakespeare)python train_keras.py models/model_keras data/sklearn_clean/ data/scalaz_clean --bidirectional python train_pytorch.py models/model_pytorch data/sklearn_clean/ data/scalaz_clean --bidirectional
-
Tag the characters as being
python
orscala
code:python apply_tagger_keras.py models/model_keras output/sklearn_or_scala_preds_keras data/sklearn_clean/ data/scalaz_clean python apply_tagger_pytorch.py models/model_pytorch output/sklearn_or_scala_preds_pytorch data/sklearn_clean/ data/scalaz_clean
-
And finally plot the output to pretty html files:
python plot_predictions.py output/sklearn_or_scala_preds_keras output/sklearn_or_scala_preds_keras_html python plot_predictions.py output/sklearn_or_scala_preds_pytorch output/sklearn_or_scala_preds_pytorch_html
Again, I recommend having a look to the notebooks.
Details on why this is "cool" or can be useful can be found in the already mentioned post by Nadbor.
Thanks and credit to Nadbor for having the idea and trigger a fun project.