VERBAL TO VISUAL - TEXT TO IMAGE

In this project we have explored and implemented different models for text - to -images generation in which we take in text it is converted into its embedding and then we get our image generated

USING XL+GANS

INPUT = "Yellow flower with black stemen"

OUTPUT:

INPUT="red flower"

OUTPUT:

INPUT = "a pink flower"

OUTPUT:

USING STACKGANS/Stage1: OUTPUT:

flower dataset

flickr8k:

APPROACHES AND THEORY

So in our project we explored the approches of gans like stack gans , dc gans , XLnets+gans , and we also explored modern hopefield model sp lets see all the approches

DCGANS

In our project, DCGAN serves as the core architecture for both the generator and the discriminator.
The generator uses convolutional layers to upsample a noise vector, conditioned by the text embeddings. This allows the generator to create structured images that correspond to the text input, like generating a "blue sky with clouds."
The discriminator is a CNN-based model that learns to distinguish between real images (from the dataset) and the generated images. It helps the generator improve by providing feedback on how realistic the images are.
Contribution: DCGAN’s convolutional layers allow the model to generate images with local coherence, capturing detailed patterns and structures. The use of DCGAN in your project improves the generation of high-quality images by leveraging deep convolutional networks, which are more effective at generating visually consistent and realistic images.

STACK GANS

Stage1 : The model converts this description into an embedding vector using a pre-trained text encoder (like a pre-trained language model). This embedding is then passed to the first stage of StackGAN, which generates a low-resolution image (e.g., 64x64 pixels) that roughly captures the overall shape, color, and basic structure of the object in the description.
Stage2 : This low-resolution image is passed into the second stage of StackGAN, along with the text embedding again. Here, the model refines the image, adding finer details like textures and lighting, and increases the resolution (e.g., 256x256 pixels), here residual blocks are used to capture the features more . This results in a more photorealistic and detailed image that matches the textual description.
Contribution: StackGAN enables your project to generate progressively refined images, with each stage improving the quality and resolution, making the generated images more realistic and aligned with the input text.

MORDERN HOPFIELD NETWORK

Modern Hopfield networks are used to store and retrieve complex patterns (in this case, embeddings) that correspond to text-image relationships. These networks iteratively update the embeddings to retrieve the most relevant stored patterns, which helps improve the fidelity of the generated images.
In our project, after obtaining an initial embedding from the text description (e.g., using XLNet), a Hopfield network layer might be used to refine or retrieve related patterns that can aid in the generation of images with more coherent details. The modern Hopfield network excels at associating and recalling these complex relationships, such as specific texture patterns, object shapes, or even lighting conditions described in the text.
Contribution: The use of modern Hopfield networks in your project ensures that the model can effectively recall detailed image patterns related to the input description. This results in more consistent and accurate generation, as the model can better preserve high-dimensional relationships between text and images.

##Reasearch papper for mhn

XLNets + gans

Making the use of XLNet and GANs, we were able to generate the images of flowers based on prompts. Where XLNet was used to generate the text embeddings from the captions present in the dataset.
GANs generated the image based on the text embeddings they were fed.
The combination of GANs + XLNet proves to be effective because the system can simultaneously learn from textual and visual data, enhancing its ability to generate images that are contextually aligned with the input text.
The discriminator in GANs provides feedback to the generator, allowing it to refine the generated images iteratively. When the generator uses high-quality text embeddings from XLNet as input, this feedback loop becomes more effective, leading to better convergence and higher-quality outputs XLNet outperformed BERT, because even using the BERT model we were not able to get an image which corresponds to the prompt.

DATASETS

flickr30k

hugging face flower deataset

way to download :

from datasets import load_dataset
# Load the Flowers BLIP Captions dataset
dataset = load_dataset("pranked03/flowers-blip-captions")
 dataset.save_to_disk("flowers-blip-captions")

oxford 102 flower dataset

FUTURE SCOPE

The stage 2 of staqck gans due to memory issue and limitted availibility of gpu access is still pending so our aim is to that in near future
MHN model due to less availibilty of resources is still remaining so we plan to do more resources in that also
Training on more dataset like the coco bird dataset will be our next step.
Reducing noise of the the output by using svre

MENTORS

Rohan Parab

Param Parekh

CONTRIBUTERS

Janvi Soni

Yadnyesh Patil

Mudit Jain

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
MHN		MHN
XLNet+gans		XLNet+gans
mini_projects		mini_projects
stackgan		stackgan
wgan+gp		wgan+gp
Project Report.docx		Project Report.docx
README.md		README.md
Screenshot 2024-10-19 043938.png		Screenshot 2024-10-19 043938.png
Screenshot 2024-10-19 044016.png		Screenshot 2024-10-19 044016.png
Screenshot 2024-10-19 044045.png		Screenshot 2024-10-19 044045.png
Screenshot 2024-10-19 045131.png		Screenshot 2024-10-19 045131.png
Screenshot 2024-10-19 045201.png		Screenshot 2024-10-19 045201.png
Screenshot 2024-10-19 045243.png		Screenshot 2024-10-19 045243.png
Screenshot 2024-10-19 045315.png		Screenshot 2024-10-19 045315.png
Screenshot 2024-10-19 045517.png		Screenshot 2024-10-19 045517.png
Screenshot 2024-10-19 082508.png		Screenshot 2024-10-19 082508.png
Screenshot 2024-10-19 083656.png		Screenshot 2024-10-19 083656.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VERBAL TO VISUAL - TEXT TO IMAGE

USING XL+GANS

APPROACHES AND THEORY

DCGANS

STACK GANS

MORDERN HOPFIELD NETWORK

XLNets + gans

DATASETS

FUTURE SCOPE

MENTORS

CONTRIBUTERS

About

Releases

Packages

Contributors 3

Languages

Captain-MUDIT/Verbal-to-Visual

Folders and files

Latest commit

History

Repository files navigation

VERBAL TO VISUAL - TEXT TO IMAGE

USING XL+GANS

APPROACHES AND THEORY

DCGANS

STACK GANS

MORDERN HOPFIELD NETWORK

XLNets + gans

DATASETS

FUTURE SCOPE

MENTORS

CONTRIBUTERS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages