This project is developed based on the GME model and is used for testing image retrieval under arbitrary inputs.
# Set Environment
conda create -n gme python=3.10
conda activate gme
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -c pytorch -c nvidia faiss-gpu=1.9.0
pip install transformers # test with 4.47.1
pip install gradio # test with 5.9.1
# Get Model
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download Alibaba-NLP/gme-Qwen2-VL-2B-Instruct --local-dir gme-Qwen2-VL-2B-Instruct
- Prepare the database for retrieval, use build_index.py for feature extraction and index building.
- run retrieval_app.py for online retrieval.
Detailed usage
usage: build_index.py [-h] [--model_path MODEL_PATH] [--image_dir IMAGE_DIR] [--batch_size BATCH_SIZE] [--embeddings_output EMBEDDINGS_OUTPUT] [--index_output INDEX_OUTPUT] [--image_paths_output IMAGE_PATHS_OUTPUT]
options:
--model_path MODEL_PATH
Path to the GmeQwen2VL model.
--image_dir IMAGE_DIR
Path to the directory containing new images.
--batch_size BATCH_SIZE
Batch size for embedding extraction.
--embeddings_output EMBEDDINGS_OUTPUT
Output file for saving image embeddings.
--index_output INDEX_OUTPUT
Output file for saving FAISS index.
--image_paths_output IMAGE_PATHS_OUTPUT
Output file for saving image paths.
usage: retrieval_app.py [-h] [--model_path MODEL_PATH] [--image_embeddings_file IMAGE_EMBEDDINGS_FILE] [--faiss_index_file FAISS_INDEX_FILE] [--image_paths_file IMAGE_PATHS_FILE]
options:
--model_path MODEL_PATH
Path to the GME model.
--image_embeddings_file IMAGE_EMBEDDINGS_FILE
Path to the image embeddings file.
--faiss_index_file FAISS_INDEX_FILE
Path to the FAISS index file.
--image_paths_file IMAGE_PATHS_FILE
Path to the file containing image paths.
- gallery.zip : the set of images used to build the database(1,131 images).
- query.zip : some query images(17 images).
- Below are some test results along with their visualizations.
- Test with GeForce RTX 4070 Ti SUPER(16GB) on WSL2.
Image(+Text) -> Image
I+T.mp4
Text -> Image[Chinese input]
T-zh.mp4
Text -> Image[English input]
T-en.mp4
Text(long) -> Image
LT.mp4
This project is released under the MIT License.
@misc{Open Grounding Dino,
author = {Wei Li},
title = {Gradio app with GME for Image Search},
howpublished = {\url{https://github.com/BIGBALLON/GME-Search}},
year = {2025}
}
Please create a pull request if you find any bugs or want to contribute code. 😄