Prepare data

To run the experiments on crepe dataset, you need to download visual genome data from here and extract both "VG_100K.zip" and "VG_100K_2.zip". Suppose this dataset is stored in /path/to/data/. Also for other datasets like the text retrieval dataset trec-covid dataset, we also assume that the dataset is stored in /path/to/data/ folder.

The query files are stored in "prod_hard_negatives/". In this demo, we only used a subset of queries, which are stored in "prod_hard_negatives/prod_vg_hard_negs_swap_all4.csv"

Example command on how to run the code

Run the code without decomposing images or queries:

python main.py --dataset_name crepe --data_path /path/to/data/ --query_count -1 --total_count -1

in this command, "--total_count" represents the number of documents used for retrieval tasks, -1 means that we used all documents while a positive number means that we only used a subset of the entire document set. For the purpose of quick demonstration, "total_count" could be 500 or 1000.

Run the code by decomposing images and queries:

python main.py --dataset_name crepe --data_path /path/to/data/ --query_count -1 --total_count -1  --img_concept --query_concept

in this command, "--img_concept" represents partitioning images or documents while "--query_concept" represents partitioning queries.

Run the code by decomposing images and queries while using the clustering-based indexes at the same time

python main.py --dataset_name crepe --data_path /path/to/data/ --query_count -1 --total_count -1  --img_concept --query_concept --search_by_cluster

in this command, "--search_by_cluster" means that we construct the clustering-based indexes for speed-ups

Incorporating other datasets

Run the code on another image retrieval dataset

Here we assume that the dataset is adapted by an image captioning dataset. Then this indicates that each image will have only one caption as one query. If there are multiple captions for one image, we just use the first one. To adapt the other image captioning dataset for the image retrieval task, we can simply replace this function "load_crepe_datasets" in the main.py file with another function that can return four variables "queries", "raw_img_ls", "sub_queries_ls", "img_idx_ls", in which "queries" is a list of queries (i.e., the image captions), "raw_img_ls" represents a list of raw images (in pillow image format), "sub_queries_ls" represents a list of sub-query lists which correspond to each of the query while "img_idx_ls" represents the list of image ids.

Note that the retrieval performance is evaluated based on the ground-truth mappings between the queries and the documents/images, which specify the similarity between each query and each image (2 for very similar while 0 means not similar at all). Since such mappings don't exist for image captioning dataset, we therefore defined one function called "construct_qrels" to create such mappings in which each pair of the image and the caption has the similarity score 2. If such ground-truth mappings are given for a image retrieval dataset, we can comment out this function.

Run the code on another text retrieval dataset

We can start from the datasets listed in this git repo. Note that the splited queries should be put in the folder /path/to/data/${dataset_name}. For example, for trec-covid dataset, we need to move the file "queries_with_subs.jsonl" to /path/to/data/${dataset_name} before the experiments.

To vary the retrieval mode

You can change algebra_method to "one" or "two". By defauly, "one" is for text retrieval while "two" is for image retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
Dessert-main-v4		Dessert-main-v4
LLM4split		LLM4split
baselines		baselines
beir		beir
examples		examples
experiment_figures		experiment_figures
images		images
long_tail_evals		long_tail_evals
prod_hard_negatives		prod_hard_negatives
raptor		raptor
raptor_files		raptor_files
scripts		scripts
.gitignore		.gitignore
CONTRIBUTORS.txt		CONTRIBUTORS.txt
Dessert-main-v4.zip		Dessert-main-v4.zip
Few Shot Prompt for Decomposing.docx		Few Shot Prompt for Decomposing.docx
LICENSE		LICENSE
MaxFlash_modified.cc		MaxFlash_modified.cc
MaxFlash_modified.h		MaxFlash_modified.h
NOTICE.txt		NOTICE.txt
README.md		README.md
agg_query_processing.py		agg_query_processing.py
agg_query_processing_percentile_algfive.py		agg_query_processing_percentile_algfive.py
bbox_utils.py		bbox_utils.py
birch.py		birch.py
blip_image_captioning_mscoco.ipynb		blip_image_captioning_mscoco.ipynb
clustering.py		clustering.py
crepe_llava_captions.ipynb		crepe_llava_captions.ipynb
crepe_original_prod_hard_negatives.zip		crepe_original_prod_hard_negatives.zip
derive_sub_query_dependencies.py		derive_sub_query_dependencies.py
dessert_c.zip		dessert_c.zip
dessert_minheap.py		dessert_minheap.py
dessert_minheap_torch.py		dessert_minheap_torch.py
dessert_pytorch.py		dessert_pytorch.py
dessert_pytorch_experiments.ipynb		dessert_pytorch_experiments.ipynb
dessert_pytorch_fixed.py		dessert_pytorch_fixed.py
discover_unique_queries.ipynb		discover_unique_queries.ipynb
embed_single_image.ipynb		embed_single_image.ipynb
examine_image_segmentation.ipynb		examine_image_segmentation.ipynb
explore.py		explore.py
explore_atom_datasets.ipynb		explore_atom_datasets.ipynb
explore_crepe_dataset.ipynb		explore_crepe_dataset.ipynb
explore_crepe_dataset_hard_samples.ipynb		explore_crepe_dataset_hard_samples.ipynb
explore_image_retrieval.py		explore_image_retrieval.py
explore_image_retrieval_opt.py		explore_image_retrieval_opt.py
explore_trec_covid.py		explore_trec_covid.py
files_for_experiments.ipynb		files_for_experiments.ipynb
gpt_api_functions.ipynb		gpt_api_functions.ipynb
image_utils.py		image_utils.py
image_utils_bm_25_promptreps.py		image_utils_bm_25_promptreps.py
image_utils_modified.py		image_utils_modified.py
image_utils_node_percent.py		image_utils_node_percent.py
image_utils_tree.py		image_utils_tree.py
image_utils_univector.py		image_utils_univector.py
kmeans_sampling.ipynb		kmeans_sampling.ipynb
llm2vec.zip		llm2vec.zip
main.py		main.py
main_bm_25_promptreps.py		main_bm_25_promptreps.py
main_modified.py		main_modified.py
main_node_percent.py		main_node_percent.py
main_percentile.py		main_percentile.py
main_percentile_algfive.py		main_percentile_algfive.py
mscoco_LLaVA_13b_4bit_caption_colab.ipynb		mscoco_LLaVA_13b_4bit_caption_colab.ipynb
nlp_utils.py		nlp_utils.py
parse_queries_to_trees.py		parse_queries_to_trees.py
parse_queries_to_trees_node_percent.py		parse_queries_to_trees_node_percent.py
phi_llm_decomposition_dependencies.ipynb		phi_llm_decomposition_dependencies.ipynb
preparing_files.ipynb		preparing_files.ipynb
prod_vg_hard_negs_swap_all_output_with_dependency3.csv		prod_vg_hard_negs_swap_all_output_with_dependency3.csv
queries_with_subs.jsonl		queries_with_subs.jsonl
raptor_files.zip		raptor_files.zip
retrieval_utils.py		retrieval_utils.py
retrieval_utils_node_percent.py		retrieval_utils_node_percent.py
retrieval_utils_percentile.py		retrieval_utils_percentile.py
retrieval_utils_percentile_algfive.py		retrieval_utils_percentile_algfive.py
scene_graph.zip		scene_graph.zip
scene_graph_multilayeredbfs.zip		scene_graph_multilayeredbfs.zip
segment_new_files.ipynb		segment_new_files.ipynb
setup.cfg		setup.cfg
setup.py		setup.py
show_clustering_res.py		show_clustering_res.py
sparse_index.py		sparse_index.py
storage.py		storage.py
test.py		test.py
test_gpt_captioning.py		test_gpt_captioning.py
text_utils.py		text_utils.py
univector_exp.ipynb		univector_exp.ipynb
utils.py		utils.py
vector_dataset.py		vector_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prepare data

Example command on how to run the code

Run the code without decomposing images or queries:

Run the code by decomposing images and queries:

Run the code by decomposing images and queries while using the clustering-based indexes at the same time

Incorporating other datasets

Run the code on another image retrieval dataset

Run the code on another text retrieval dataset

To vary the retrieval mode

About

Releases

Packages

Contributors 4

Languages

License

wuyinjun-1993/concept_based_retrieval

Folders and files

Latest commit

History

Repository files navigation

Prepare data

Example command on how to run the code

Run the code without decomposing images or queries:

Run the code by decomposing images and queries:

Run the code by decomposing images and queries while using the clustering-based indexes at the same time

Incorporating other datasets

Run the code on another image retrieval dataset

Run the code on another text retrieval dataset

To vary the retrieval mode

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages