TensorFlow implementation of Entropy SGD: Biasing gradient descent into wide valleys. The entropy-SGD optimization algorithm uses geometric information about the energy landscape to bias the optimization algorithm toward flat regions of the loss function, which may aid generalization.
The CIFAR-10 dataset will be automatically downloaded and converted to tfrecord format when first run. The default is
to run on CIFAR-10 with the entropy-SGD optimizer with 20 Langevin iterations on a wide residual network (28x10). For a complete list of options run python3 train.py -h
, e.g. to run
on CIFAR-10 using the entropy-sgd optimizer with 5 Langevin iterations:
# Check command line arguments
$ python3 train.py -h
# Run
$ python3 train.py -opt entropy-sgd -L 5
The default hyperparameters (used to report all results) can be accessed and set in the config.py
file under config_train
. Most should
be self-explanatory. For parameters labelled 'entropy-sgd specific', you may need to refer to the original paper.
Checkpoints and Tensorboard scalars are saved beneath their respective directories.
Coming soon...
Both CIFAR-10/CIFAR-100 models are trained with the same hyperparameters and learning rate schedule specified in the original paper. The dataset is subjected to meanstd preprocessing and random rotations+reflections. Convergence when training on both datasets is compared with vanilla SGD and SGD with Nesterov momentum. The accuracy reported is the average of 5 runs with random weight initialization.
Models trained without entropy-SGD are run for 200 epochs, models trained with entropy-SGD are run with L=20 for 10 epochs, with the hyperparameters specified as in the CIFAR-10 run in the original paper.
Entropy-SGD seems to be outperformed by SGD + momentum. Retraining by applying momentum to the SGLD and outer loop updates.
# Plots showing convergence of entropy-sgd v. sgd here.
- Python 3.6
- TensorFlow 1.4
- Original lua implementation.
- Simplfying neural nets by discovering flat minima Related work by Schmidhuber and Hochreiter.
- Stochastic gradient Langevin dynamics. Used to approximate the expectation in the update step
- PDEs for optimizing deep neural networks. Followup work by Chaudhari et. al