entropy-sgd-tf

TensorFlow implementation of Entropy SGD: Biasing gradient descent into wide valleys. The entropy-SGD optimization algorithm uses geometric information about the energy landscape to bias the optimization algorithm toward flat regions of the loss function, which may aid generalization.

Instructions

The CIFAR-10 dataset will be automatically downloaded and converted to tfrecord format when first run. The default is to run on CIFAR-10 with the entropy-SGD optimizer with 20 Langevin iterations on a wide residual network (28x10). For a complete list of options run python3 train.py -h, e.g. to run on CIFAR-10 using the entropy-sgd optimizer with 5 Langevin iterations:

# Check command line arguments
$ python3 train.py -h
# Run
$ python3 train.py -opt entropy-sgd -L 5

The default hyperparameters (used to report all results) can be accessed and set in the config.py file under config_train. Most should be self-explanatory. For parameters labelled 'entropy-sgd specific', you may need to refer to the original paper. Checkpoints and Tensorboard scalars are saved beneath their respective directories.

Multi-GPU

Coming soon...

Results

Both CIFAR-10/CIFAR-100 models are trained with the same hyperparameters and learning rate schedule specified in the original paper. The dataset is subjected to meanstd preprocessing and random rotations+reflections. Convergence when training on both datasets is compared with vanilla SGD and SGD with Nesterov momentum. The accuracy reported is the average of 5 runs with random weight initialization.

Models trained without entropy-SGD are run for 200 epochs, models trained with entropy-SGD are run with L=20 for 10 epochs, with the hyperparameters specified as in the CIFAR-10 run in the original paper.

CIFAR-10

Entropy-SGD seems to be outperformed by SGD + momentum. Retraining by applying momentum to the SGLD and outer loop updates.

CIFAR-100

# Plots showing convergence of entropy-sgd v. sgd here.

Dependencies

Python 3.6
TensorFlow 1.4

Related work

Original lua implementation.
Simplfying neural nets by discovering flat minima Related work by Schmidhuber and Hochreiter.
Stochastic gradient Langevin dynamics. Used to approximate the expectation in the update step
PDEs for optimizing deep neural networks. Followup work by Chaudhari et. al

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
checkpoints		checkpoints
tensorboard		tensorboard
tfrecords		tfrecords
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data.py		data.py
diagnostics.py		diagnostics.py
evaluate.py		evaluate.py
graphdef.py		graphdef.py
model.py		model.py
network.py		network.py
optimizer.py		optimizer.py
sgld.py		sgld.py
train.py		train.py
wrn_momentum.slurm		wrn_momentum.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

entropy-sgd-tf

Instructions

Multi-GPU

Results

CIFAR-10

CIFAR-100

Dependencies

Related work

About

Releases

Packages

Languages

License

Justin-Tan/entropy-sgd-tf

Folders and files

Latest commit

History

Repository files navigation

entropy-sgd-tf

Instructions

Multi-GPU

Results

CIFAR-10

CIFAR-100

Dependencies

Related work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages