[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
-
Updated
Oct 2, 2024 - Python
[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges".
up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
[ICASSP 2025] Open-source code for the paper "Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification"
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
[ICML 2024] Offical code repo for ICML2024 paper "Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data"
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
[Under review] Assessing and Learning Alignment of Unimodal Vision and Language Models
[EMNLP 2024] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
This is a curated list of "Continual Learning with Pretrained Models" research.
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
Awesome Vision-Language Compositionality, a comprehensive curation of research papers in literature.
This is an official implementation of our work, Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models, accepted to ECCV'24
Repo for the paper "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities"
PicQ: Demo for MiniCPM-V 2.6 to answer questions about images using natural language.
TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language proficiency using text, audio, and visual data with deep learning. Features Selective Token Constraint Mechanism (STCM) for enhanced decoding stability.
Streamlit App Combining Vision, Language, and Audio AI Models
Add a description, image, and links to the vision-language-models topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-models topic, visit your repo's landing page and select "manage topics."