This neural system for visual question answering is roughly based on the paper "Dynamic Memory Networks for Visual and Textual Question Answering" by Xiong et al. (ICML2016). The input is an image and a question about the image, and the output is a one-word answer to this question. It uses a convolutional neural network to extract visual features from the image, and uses a bi-directional GRU recurrent neural network to fuse these features. Meanwhile, it uses either a GRU recurrent neural network or a positional encoding scheme to encode the question. Then, it utilizes a dynamic memory network with an attention mechanism to generate the answer based on this information. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.
- Tensorflow (instructions)
- NumPy (instructions)
- OpenCV (instructions)
- Natural Language Toolkit (NLTK) (instructions)
- Pandas (instructions)
- Matplotlib (instructions)
- tqdm (instructions)
-
Preparation: Download the COCO train2014 and val2014 images here. Put the COCO train2014 images in the folder
train/images
, and put the COCO val2014 images in the folderval/images
. Download the VQA v1 training and validation questions and annotations here. Put the filemscoco_train2014_annotations.json
andOpenEnded_mscoco_train2014_questions.json
in the foldertrain
. Similarly, put the filemscoco_val2014_annotations.json
andOpenEnded_mscoco_val2014_questions.json
in the folderval
. Furthermore, download the pretrained VGG16 net here or ResNet50 net here if you want to use it to initialize the CNN part. -
Training: To train a model using the VQA v1 training data, first setup various parameters in the file
config.py
and then run a command like this:
python main.py --phase=train \
--load_cnn \
--cnn_model_file='./vgg16_no_fc.npy'\
[--train_cnn]
Turn on --train_cnn
if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models
. If you want to resume the training from a checkpoint, run a command like this:
python main.py --phase=train \
--load \
--model_file='./models/xxxxxx.npy'\
[--train_cnn]
To monitor the progress of training, run the following command:
tensorboard --logdir='./summary/'
- Evaluation: To evaluate a trained model using the VQA v1 validation data, run a command like this:
python main.py --phase=eval --model_file='./models/xxxxxx.npy'
The result will be shown in stdout. Furthermore, the generated answers will be saved in the file val/results.json
.
- Inference:
You can use the trained model to answer any questions about any JPEG images! Put such images in the folder
test/images
. Also, create a CSV file containing your questions (this file should have three fields: image, question, question_id), and put it in the foldertest
. Then run a command like this:
python main.py --phase=test --model_file='./models/xxxxxx.npy'
The generated answers will be saved in the folder test/results
.
A pretrained model with default configuration can be downloaded here. This model was trained solely on the VQA v1 training data. It achieves accuracy 60.35% on the VQA v1 validation data. Here are some successful examples:
- Dynamic Memory Networks for Visual and Textual Question Answering Caiming Xiong, Stephen Merity, Richard Socher. ICML 2016.
- Visual Question Answering (VQA) dataset
- Implementing Dynamic memory networks by YerevaNN
- Dynamic memory networks in Theano
- Dynamic Memory Networks in Tensorflow