FrameFinderLE is an advanced image and video frame retrieval system that enhances CLIP's image-text pairing capabilities with hashtag-based refinement and sophisticated user feedback mechanisms, providing an intuitive and flexible search experience.
- ⚡ FrameFinderLE ⚡
FrameFinderLE addresses the limitations of traditional image retrieval systems, particularly when dealing with the complexities of human memory and imprecise queries.
- CLIP's 77-token limit restricts complex or detailed queries
- Human memory and recall are often fragmented and imprecise
- Traditional systems struggle with partial or imperfect user input
FrameFinderLE overcomes these challenges by:
- Extended Descriptions: Combining longer descriptions with traditional prompts to accommodate less precise inputs.
- Hashtag Integration: Allowing users to gradually refine their search using key terms, aligning with natural recall patterns.
- Flexible Search System: Creating an intuitive interface that matches how users naturally remember events and scenes.
- Bridges computer vision, natural language processing, and human-computer interaction
- Advances cognitive computing by adapting to the fluid and imperfect nature of human recall
- Enhances retrieval experiences and contributes to more advanced human-computer interaction models
- CLIP Integration: Utilizes CLIP's powerful image-text pairing capabilities as the foundation of the retrieval system.
- Hashtag-Based Refinement: Allows users to narrow results using partial details or key terms.
- Dual Feedback Systems:
- Immediate Feedback System: Rapidly refines search results based on user likes and dislikes within the current session.
- Aggregated Feedback System: Provides long-term refinement by incorporating historical feedback and balancing exploration with exploitation.
- Similarity-Based Score Adjustment: Utilizes encoded frame representations to adjust scores based on similarities between liked/disliked items and other results.
- Relevant Lookup Feature: Enables new searches based on any result image, creating a more interactive and personalized experience.
- VideoID and Timestamp Filters: Helps users find adjacent frames when searching for specific moments in video clips.
- Multi-Modal Search: Supports text queries, hashtags, and combinations for flexible searching.
The Dynamic Hashtag Exploration in GRAFA is a graph-based retrieval mechanism that discovers and ranks keyframes based on relationships between hashtags. Here's how it works:
- Hashtag Exploration:
- Query Initialization: Starts with the provided query hashtags and their embeddings.
- Graph Traversal: The hashtag co-occurrence graph is traversed to explore neighboring hashtags, capturing broader context while maintaining focus on relevant terms.
- Scoring System:
- Hybrid Score Calculation: Combines neighbor frequency and path length to score hashtags.
- Logarithmic Path Adjustment: Prevents score inflation from repetitive paths by logarithmic scaling.
- Dynamic Exploration:
- Prioritizes hashtags with scores above the mean and adjusts exploration depth based on relevance.
- Stopping Criteria:
- Stops exploration based on reaching a defined number of keyframes, iterations, or a score threshold.
- Keyframe Ranking:
- Normalizes scores and ranks the top keyframes based on their final scores, retrieving the most relevant frames.
The Immediate Feedback System provides rapid refinement of search results based on user interactions in the current session.
- Feedback Processing: Converts user feedback (likes, dislikes) into binary representation and uses pre-encoded frame representations for similarity calculations.
- Score Adjustment: Adjusts scores based on feedback, increasing for similar liked items and decreasing for similar disliked items.
- Similarity-Based Refinement: Scores are weighted by similarity to feedback items.
- Real-time Updates: Scores update immediately after each feedback interaction.
- Final Refinement: The refined results are re-sorted and returned.
The Aggregated Feedback System refines searches by incorporating historical feedback for long-term personalized results.
- Feedback Processing: Converts feedback into a binary form and applies a time-weighted decay factor for recent interactions.
- Score Adjustment: Adjusts scores based on the feedback factor, balancing exploration with exploitation.
- Time-Sensitive Refinement: Recent feedback has more influence, adapting to changing preferences.
- Final Refinement: Combines adjusted scores with original relevance for re-ranked results.
/FrameFinderLE/
│
├── database/
│ ├── encoded_frames.pt
│ ├── index_caption_hashtag_dict_v2.json
│ ├── key_frame_folder_reduced.zip
│ ├── merged_index_hnsw_baseline_v0.bin
│ ├── merged_index_hnsw_baseline_v2.bin
│ ├── graph_data_full.pkl
│ ├── hashtag_embeddings.pkl
│ ├── hashtag_embeddings.bin
│ ├── __init__.py
│ └── db_init.py
│
├── diagram/
│ └── FrameFinderLE_diagram.png
│
├── models/
│ ├── __init__.py
│ └── model_init.py
│
├── routers/
│ ├── __init__.py
│ ├── data_router.py
│ ├── feedback_router.py
│ ├── home_router.py
│ ├── process_query_router.py
│ ├── search_router.py
│ └── update_results_router.py
│
├── static/
| ├── script/
| │ ├── home_script.js
| │ ├── popup_mess_script.js
| │ ├── show_results_script.js
| │ └── update_results_script.js
│ ├── style/
| │ ├── home_style.js
| │ ├── popup_mess.js
| │ └── show_results_style.js
│ └── images/
│ └── key_frame_folder_reduced/
│ ├── key_frame_folder_videos-l01
│ │ ├── keyframe_L01_V001
│ │ │ ├── 0000161_6.44.webp
│ │ │ ├── 0000350_13.98.webp
│ │ │ └── ...
│ │ └── ...
│ ├── key_frame_folder_videos-l02
│ │ ├── keyframe_L02_V002
│ │ │ ├── 0000010_0.35.webp
│ │ │ ├── 0000040_1.3165.webp
│ │ │ └── ...
│ │ └── ...
│ └── ...
│
├── templates/
│ ├── data.html
│ ├── home.html
│ ├── layout.html
│ ├── results_content.html
│ ├── show_results.html
│ └── v0_search_results.html
│
├── tools/
│ ├── __init__.py
│ ├── aggregated_refining.py
│ ├── faiss_retrieval.py
│ ├── feedback_processing.py
│ ├── graph_based_image_retrieval.py
│ ├── hashtags_generating.py
│ ├── hashtags_processing.py
│ ├── immediate_refining.py
│ ├── info_extracting.py
│ ├── query_encoding.py
│ ├── results_display.py
│ ├── search_utils.py
│ └── utils.py
│
├── __init__.py
├── app_notebook.ipynb
├── app.py
├── README.md
└── requirements.txt
- Running the Application with Docker (via Docker Hub)
- Prerequisites: Install Docker on your machine.
- Steps:
Pull the Docker image from Docker Hub:
docker pull thuyhale/frame_finder_le:latest
Run the Docker container:
docker run -p 8000:8000 thuyhale/frame_finder_le:latest
Open your browser and navigate to:
http://localhost:8000
- Running the Application without Docker
- Prerequisites:
- Install Python 3.9 or higher.
- Install pip.
- (Optional) Set up a virtual environment.
- Steps:
Clone the repository:
git clone https://github.com/ThuyHaLE/FrameFinderLE.git
cd FrameFinderLE
Install the required dependencies:
pip install -r requirements.txt
Load database
import gdown
#Load and unzip images, store at static/images
gdown 1-92UIqmQ5ODeZlSQ61cjFmUdLVZZ_HfV #Load key frame folder (key_frame_folder_reduced.zip)
unzip -q key_frame_folder_reduced.zip -d static/images #Unzip key frame folder (key_frame_folder_reduced)
#Load database, store at database/
cd FrameFinderLE/database
#Load FAISS
gdown 1-CDUlIAIYAk5L87tXlYFosbUXQQANam8 #Load annotation (index_caption_hashtag_dict_v2.json)
gdown 1EvNEWTNPe8Tk20-Tn0O6BwAgURLJTHZP #Load database CLIP_v0 (merged_index_hnsw_baseline_v0.bin)
gdown 1-85d-oCWU39o9d8Ie0c5093fKTp0IpwO #Load database CLIP_v2 (merged_index_hnsw_baseline_v2.bin)
#Load GRAFA
gdown 1-AotePkVml3iQONPxCZeK-gDjQFI0Asb #Graph database (graph_data_full.pkl)
gdown 1ZRt1-qvJP2CJcWzGykWQVN9JLHBS5XFR #List of hashtag embeddings (hashtag_embeddings.pkl)
gdown 1tZyr1h8yDJO_CXuMn5ounEEdiFKD530d #List of hashtag embeddings (hashtag_embeddings.bin)
gdown 1-KQx8lD7tHJH-RpbLE9gUBA8k_VV5fsI #Load encoded frames
Run the application:
uvicorn app:app --reload
Open your browser and navigate to:
http://localhost:8000
[Updating...]
[Updating...]
[Updating]
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.