MedEmbed is a collection of embedding models fine-tuned specifically for medical and clinical data, aimed at enhancing performance in healthcare-related natural language processing (NLP) tasks.
Note: These models are a work in progress, successive releases will be aimed at improving the performance even better and creating a medical embedding leaderboard on MTEB. Also, the work is almost done for late-interation based retrievers based on ColBERT.
Blog Post: Click Here
Model Download Links: v0.1
Dataset Download Links: v1
Developing MedEmbed requires significant resources. If you find it valuable, consider supporting the project. Your contribution helps sustain and improve this open-source initiative.
MedEmbed provides high-quality embedding models tailored for use in medical and clinical contexts. These models are designed to capture the nuances and complexities of medical terminology and concepts, making them particularly useful for a wide range of healthcare-related NLP tasks.
- Fine-tuned embedding models focused on medical and clinical data
- Improved performance on healthcare-specific NLP tasks
- Multiple model variants to suit different use cases and computational requirements
- Extensive evaluation on medical NLP benchmarks
MedEmbed includes several model variants, each fine-tuned using different strategies:
- MedEmbed-Small-v1
- MedEmbed-Base-v1
- MedEmbed-Large-v1
Note: We have also finetuned ColBERT-v2 models, benchmarking is in progress.
Our models have been evaluated on various medical NLP benchmarks for retrieval, including:
- ArguAna
- MedicalQARetrieval
- NFCorpus
- PublicHealthQA
- TRECCOVID
-
Small Models:
- MedEmbed-Small-v1 consistently outperformed the base
BAAI/bge-small-en-v1.5
model across all benchmarks.
- MedEmbed-Small-v1 consistently outperformed the base
-
Base Models:
- MedEmbed-Base-v0 showed significant improvements over the base
BAAI/bge-base-en-v1.5 model
.
- MedEmbed-Base-v0 showed significant improvements over the base
-
Large Models:
MedEmbed-Large-v0
demonstrated superior performance compared to the baseBAAI/bge-large-en-v1.5 model
.
-
Cross-Size Comparison:
- In a comparison across different model sizes,
MedEmbed-Large-v0
showed the best overall performance. - Notably, the medical-tuned small and base models often outperformed the larger base models, indicating significant improvements from domain-specific fine-tuning.
- In a comparison across different model sizes,
Note: More comparisons will be added with other frontier models along with a table.
Our models are trained using a sophisticated synthetic data generation pipeline, leveraging the power of large language models and real-world clinical data.
-
Clinical Notes: We start with a large corpus of patient data clinical notes from PubMed Central (PMC).
-
LLM Processing: These notes are processed through the LLaMA 2 70B model to generate high-quality query-response pairs.
-
Negative Sampling: We perform negative sampling to create challenging negative examples.
-
Triplet Formation: The positive and negative examples are combined to form triplets (query, positive response, negative response).
-
Contrastive Learning: These triplets are used to train our models using contrastive learning techniques inspired by ColBERT and BERT.
This innovative approach allows us to leverage the vast knowledge encoded in large language models while grounding our training data in real-world clinical information, resulting in embedding models that are both comprehensive and medically accurate.
[To be added]
Note: This project is actively evolving. We recommend using this repository as a template for similar projects, as significant code modifications may be necessary to adapt it to your specific needs. Please check for updates regularly and be prepared to adjust your implementation accordingly.
[To be added]
If you use MedEmbed in your research, please cite our work:
@software{balachandran2024medembed,
author = {Balachandran, Abhinand},
title = {MedEmbed: Medical-Focused Embedding Models},
year = {2024},
url = {https://github.com/abhinand5/MedEmbed}
}
This project is licensed under the Apache License Version 2.0. See the LICENSE file for details.
For any queries regarding the codebase or research, please reach out to Abhinand Balachandran at abhinandb.ml@gmail.com.