IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation - ICLR 2025
This repository contains the official implementation of our IterComp (ICLR 2025).
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
Xinchen Zhang*, Ling Yang*, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, Bin Cui
Tsinghua University, Peking University, University of Oxford, USTC, LibAI Lab, Princeton University
Click for full abstract
Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.We collect composition-aware model preferences from multiple models and employ an iterative feedback learning approach to enable the progressive self-refinement of both the base diffusion model and reward models.
[2024.10] Our main code of IterComp is released.
[2024.10] Checkpoints of base diffusion model are publicly available on HuggingFace Repo.
1024*1024 Examples
1024*1024 Examples
Enhance RPG with IterComp
Enhance Omost with IterComp
Our checkpoints are publicly available on HuggingFace Repo. Use the code below to try our IterComp:
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("comin/IterComp", torch_dtype=torch.float16, use_safetensors=True)
pipe.to("cuda")
# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()
prompt = "An astronaut riding a green horse"
image = pipe(prompt=prompt).images[0]
image.save("output.png")
git clone https://github.com/YangLing0818/IterComp
cd IterComp
conda create -n IterComp python==3.8.10
conda activate IterComp
pip install -r requirements.txt
We provide the code of five models to build the model gallery (SD1.5, SDXL, SD3, FLUX-dev, and RPG). To obtain the composition-aware model preference dataset across three key aspects of compositional generation: Attribute Binding (Color, Shape, and Texture), Spatial Relationships, and Non-spatial Relationships, simply run the following command to generate the results:
python data/model_gallery.py --compositional_metric 'attribute_binding'
--compositional_metric should in {'attribute_binding, 'spatial_relationship', 'non_spatial_relationship'}
We recommend using additional methods that excel in different aspects of compositionality to expand model gallery, such as InstanceDiffusion and Omost. Please follow their codebases to generate images using the same settings as in our code.
We manually rank the images generated by the models in the model gallery, and the final results are saved in the following format:
{
"index": 7,
"prompt": "a blue car and a brown giraffe",
"image_sd15": "datasets/train/attribute_binding/sd15_prompt_7.png",
"image_sdxl": "datasets/train/attribute_binding/sdxl_prompt_7.png",
"image_sd3": "datasets/train/attribute_binding/sd3_prompt_7.png",
"image_flux": "datasets/train/attribute_binding/flux_prompt_7.png",
"image_rpg": "datasets/train/attribute_binding/rpg_prompt_7.png",
"image_instancediffusion": "datasets/train/attribute_binding/instancediffusion_prompt_7.png",
"rank": [4, 3, 5, 2, 1, 6]
},
The i-th element, j, in the rank list indicates that the j-th model in the model gallery is ranked in the i-th position. For example, the rank list [4, 3, 5, 2, 1, 6]
means that the 4th model in the model gallery, flux, is ranked 1st, the 3rd model, sd3, is ranked 2nd, and so on.
First, you should pair the images two by two based on the ranking to make the dataset:
python train/make_dataset.py
We provide a script for reward models training:
bash scripts/train_reward_models.sh
For each of the composition-aware reward models, you should follow this process to get the reward model.
The training dataset for multi-reward feedback learning is in data/itercomp_train_data.json
, after setting the paths for composition-aware reward models, you can finetune the base diffusion model as follow:
bash scripts/iterative_fl.sh
After the training of iteration 0, we can start a new iteration of refinement based on the optimized diffusion model.
In iteration 1, we expand model gallery with Omost and the optimized base diffusion model using:
python data/iterative_expand_gallery.py
The updating model gallery is saved in the following format:
{
"index": 7,
"prompt": "a blue car and a brown giraffe",
"image_sd15": "datasets/train/attribute_binding/sd15_prompt_7.png",
"image_sdxl": "datasets/train/attribute_binding/sdxl_prompt_7.png",
"image_sd3": "datasets/train/attribute_binding/sd3_prompt_7.png",
"image_flux": "datasets/train/attribute_binding/flux_prompt_7.png",
"image_rpg": "datasets/train/attribute_binding/rpg_prompt_7.png",
"image_instancediffusion": "datasets/train/attribute_binding/instancediffusion_prompt_7.png",
"image_sdxl_iteration1": "datasets/train/attribute_binding/sdxl_iteration1_prompt_7.png",
"image_omost": "datasets/train/attribute_binding/omost_prompt_7.png",
"initial_rank": [4, 3, 5, 2, 1, 6],
"rank": [4, 3, 7, 5, 8, 2, 1, 6]
},
After updating the composition-aware model preference dataset, you should to train the reward models and base diffusion model according to the process described above.
IterComp can serve as a powerful backbone for various compositional generation methods, such as RPG and Omost. We recommend integrating IterComp into these approaches to achieve more advanced compositional generation results. Simply update the backbone path to IterComp to apply the changes.
@article{zhang2024itercomp,
title={IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation},
author={Zhang, Xinchen and Yang, Ling and Li, Guohao and Cai, Yaqi and Xie, Jiake and Tang, Yong and Yang, Yujiu and Wang, Mengdi and Cui, Bin},
journal={arXiv preprint arXiv:2410.07171},
year={2024}
}
Our IterComp is a general text-to-image generation framework, which is builded upon several solid works. Thanks to ImageReward and RPG for their wonderful work and codebase!