Welcome to PIXTALES - The story behind the picture, a deep learning approach to image captioning. The magic behind this repository is all about connecting visuals to language - generating a narrative for every image. Our project is driven by the power of neural networks and deep learning, aiming to create meaningful and accurate descriptions for any image.
PIXTALES is a project that uses a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to create a model capable of describing the content of images. We utilize the CNN as an "encoder" to transform an input image into a complex feature representation, and the RNN acts as a "decoder", turning those features into a rich, human-readable text.
Our Objective is to implement different models with different configurations and structures that are able to train and provide some results, and highlight the best results. Another goal is to try different ways to test our model, using different metrics and testing it with different datasets such as COCO dataset or Flickr30k dataset, but in principle we are going to use the Flickr8k dataset.
You can get all the data you will need in the following folder from google drive so you don't need to search for it.
First, you need to set clone this repository in your local (or virtual) machine
git@github.com:DCC-UAB/dlnn-project_ia-group_05.git
You will need to install the needed libraries and set up a good environment with pytorch, spacy, PIL and many more. You will also need to download the en_core_web_sm language model using:
python -m spacy download en_core_web_sm
Next, to train either of the models, use:
for model without attention:
cd Pixtales/Image_captioning
for model with attention:
cd Pixtales/Image_captioning_with_attention
then:
python train.py
If you want to use our already trained models just load the checkpoint we have prepared in a folder from drive. Here you can found checkpoints for different models, you can download from here to use them for testing the model or training it a little more:
Checkpoints of different models
and run evaluation:
python evaluation.py
The code in this repository primarily consists of a model implementation (CNNtoRNN), a dataset loading function (get_loader), a utils.py file that is used to save and load the models and a train file for training the model and saving the model with in a pth file called checkpoint; the training uses all previous files. We also have an evaluation.py file that loads a model already trained to do some test and show some examples, it provides utility functions for generating and visualizing image captions and calculating BLEU scores, a popular metric for evaluating the quality of generated text in comparison to reference text.
Our evaluation focuses on two areas: the BLEU score and a visual inspection of the generated captions. The BLEU score gives us a quantitative measure of our model's performance, while the visual inspection of the generated captions lets us qualitatively assess the model's output. The original and generated captions are printed, and the average BLEU score across all images is computed.
This project is an exciting journey into the intersection of computer vision and natural language processing. We hope that this project can serve as a helpful resource for those interested in image captioning, deep learning, and AI in general.
It´s worth to note that our main goal has been always learning as much as we can, that´s why we chose this project, we really thought that would be the most challenging as well as self-rewarding, and so it has been, we put a lot of effort but it paid off completely, regardless of te mark.
We appreciate having had this experience that has helped us a lot to understand how complicated a Deep Learning project can be, but also how exciting it is to get results after having done hard work.
CS 152 NN—25: Attention: Image Captioning by Neil Rhodes
Authors: Neil De La Fuente, Maiol Sabater, Daniel Vidal
Subject: Neural Networks and Deep Learning.
Degreee in Artificial Intelligence, 2n course.
UAB, 2023.