MDCure is an effective and scalable approach for synthetic data generation and curation tailored for multi-document (MD) instruction-tuning. This repository provides datasets, models, and source code to reproduce MDCure.
- π Datasets & βοΈ Models
- π MDCure Dataset Construction
- π₯οΈ MDCure Instruction Tuning
- π Evaluation
- π Citation
All size variations of our full MDCure datasets are available on HuggingFace. You can download and use the data via HuggingFace datasets:
Dataset | HuggingFace Repo | Description |
---|---|---|
MDCure-12k | π€ HF Repo | Multi-document instruction dataset of size 12K, filtered using MDCureRM |
MDCure-36k | π€ HF Repo | Multi-document instruction dataset of size 36K, filtered using MDCureRM |
MDCure-72k | π€ HF Repo | Multi-document instruction dataset of size 72K, filtered using MDCureRM |
The table below contains example instances of the final MDCure instruction samples. Each dataset consists of multi-document instruction-answer pairs designed to simulate complex real-world information synthesis tasks. Each entry contains a comprehensive instruction, a context composed of multiple documents, and a well-formed answer.
Instruction | Context | Answer |
---|---|---|
How did the combination of weather conditions, game dynamics, and viewership trends contribute to the overall perception of the Falcons-Eagles season opener, and what implications might this have for the NFL's ratings going forward? Respond with 3-4 sentences. | In what had to be a disappointing night for both NBC and the NFL, the Falcons and Eagles played a fairly ugly game after a lengthy weather delay. That's not how the league wanted to kick off the regular season, and while it got sort of exciting at the end, the overall recipe was a bad one for ratings. That was confirmed today, as the overnights came in... | The season opener between the Atlanta Falcons and Philadelphia Eagles faced a significant weather delay that postponed the game's start until 9:05 p.m. ET, leading to a lackluster atmosphere and frustration among fans eager for the NFL to return. The match itself was characterized by a dismal performance, marked by a total of 26 penalties compared to only 30 points scored, which caused offensive excitement to dwindle, especially... |
What happened in CHAMPAIGN regarding Lovie Smith and the 2019 defense improvements? Respond with 1-2 sentences. | CHAMPAIGN - Lovie Smith knows his defense has to take a significant step forward in 2019. After defensive coordinator Hardy Nickerson left during the fall due to health concerns, Smith took over as the primary play-caller. This offseason, he didn't hire a defensive coordinator... | Lovie Smith took over as the defensive play-caller for 2019 after the previous coordinator, Hardy Nickerson, stepped down due to health concerns. Smith decided not to hire a replacement and instead took on the responsibilities himself. The defense has shown significant improvement in the spring practices. |
We release our MDCure'd models, instruction-tuned from various base models to exhibit superior performnace on multi-document tasks, as well as MDCureRM, our custom reward model designed specifically for the MD setting to filter and select high-quality MD instruction data in a cost-effective way.
Model | Huggingface Repo | Description |
---|---|---|
MDCureRM | π€ HF Repo | Multi-objective reward model to filter MD instruction data more cheaply and effectively than GPT-3.5-Turbo. |
MDCure-FlanT5-Base | π€ HF Repo | FlanT5-Base fine-tuned with MDCure-72k, optimized for MD tasks. |
MDCure-FlanT5-Large | π€ HF Repo | FlanT5-Large fine-tuned with MDCure-72k, optimized for MD tasks. |
MDCure-Qwen2-1.5B-Instruct | π€ HF Repo | Qwen2-1.5B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks. |
MDCure-Qwen2-7B-Instruct | π€ HF Repo | Qwen2-7B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks. |
MDCure-LLAMA3.1-8B-Instruct | π€ HF Repo | LLAMA3.1-8B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks. |
MDCure-LLAMA3.1-70B-Instruct | π€ HF Repo | LLAMA3.1-70B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks. |
The MDCure dataset is constructed through a two-phase process:
- Generation Phase: Zero-shot prompt templates are used to generate complex, cross-text instructions from related documents.
- Filtering Phase: The generated instructions are filtered by MDCureRM, a multi-objective reward model, to ensure quality and diversity.
Below is a visual representation of the MDCure pipeline:
- We use the NewSHead dataset as the source for sets of (related) context documents. The pre-processed and cleaned NewSHead dataset can be downloaded from here. After downloading the data, unzip the file to a new folder named
./0_source_data_preparation/newshead
. - To extract the clusters of articles from the NewSHead dataset, create the
./0_source_data_preparation/base_articles
directory and then run the following command:
python ./0_source_data_preparation/extract_source_articles.py
- In addition to the article clusters, we utilize pairs of snippets ranging from 1-3 sentences in length selected from different documents within each cluster to perform instruction data generation with a portion of the MDCure generation prompt templates. To generate relevant pairs of snippets for each cluster in NewSHead and save to the
./0_source_data_preparation/base_snippet_pairs
directory, run:
python ./0_source_data_preparation/save_snippet_pairs.py
- During the Generation phase of MDCure, we employ two types of prompt templates (General & Style-Specific) to encourage a generator model to produce a diverse set of instruction data. The specific prompt templates we utilize to elicit relevant multi-document instruction-answer pairs based on the provided context documents and/or text snippets can be found in
./mdcure_prompt_templates
. These are described further below. - We use GPT-3.5-Turbo as our generator model, but other powerful instruction-following LLMs may be substituted in its place. Be sure to set your OpenAI API key for use with the code.
- Please note that depending on the number of instructions being generated, this process can be slow as it relies on numerous calls to the OpenAI API. To avoid issues due to interruptions to the program execution, we define a save interval of every 10 instructions, such that after producing 10 samples the results will be saved to a distinct
.pt
file. The save interval can be adjusted as desired in./1_mdcure_generation/generate_general.py
and./1_mdcure_generation/generate_style_specific.py
.
Install the required packages:
pip install -r ./1_mdcure_generation/gen_requirements.txt
General templates broadly ask the generator model to produce an instruction-answer pair adhering to a certain answer length, with emphasis placed on creation of instructions that cannot be answered if any one context document is removed; we employ various phrasings of this specification to obtain prompts A-F. Further variants are captured by prompts G-N, described below.
To generate instruction-answer pairs using general prompts A-F, use ./1_mdcure_generation/generate_general.py
. This file can be run with several combinations of arguments β for basic usage, you can run:
# Create generation prompts for training set
python ./1_mdcure_generation/generate_general.py --only_generate_prompts --num_prompts_to_generate=<desired_num_train_generations> --use_train --prompt_num=<prompt_num>
# Create generation prompts for validation set
python ./1_mdcure_generation/generate_general.py --only_generate_prompts --num_prompts_to_generate=<desired_num_valid_generations> --use_valid --prompt_num=<prompt_num>
# Use prompts to elicit candidate instructions
python ./1_mdcure_generation/generate_general.py --use_existing_prompts --given_prompt_dir=./1_mdcure_generation/general_instructions/prompt_<prompt_id>/prompts
The resulting instructions will be saved to a folder path of the form:
./1_mdcure_generation/generations_general/prompt_<prompt_id>/instructions
And the prompts used to obtain these instructions will be saved to:
./1_mdcure_generation/generations_general/prompt_<prompt_id>/prompts
Style-Specific templates employ a single phrasing while incorporating varying combinations of constraints regarding instruction complexity (e.g., analysis, recall), type (e.g., paraphrasing, inference), style (e.g., imperative, interrogative), and output length (e.g., 3-4 words, 2 sentences), with pre-defined options for each specification category. This helps enforce constraints on diversity of the generated candidate data. We consider 4 characteristics, 4 types, and 3 styles of instruction, in combination with 8 possible answer lengths. Any number of context documents can be considered for generation in these prompt templates. Further details can be seen in ./mdcure_prompt_templates
.
To generate instruction-answer pairs using general prompts G-N or style-specific prompts, use ./1_mdcure_generation/generate_style_specific.py
. This file can be run with several combinations of arguments β for basic usage, you can run:
# For general templates G-N
python ./1_mdcure_generation/generate_style_specific.py --use_train --model=gpt-4o-mini --num_prompts_to_generate=<desired_number_of_generations> --num_docs_to_prompt_with=<desired_number_of_docs_in_context> --straightforward_prompt --prompt_num=<prompt_num>
# For style-specific templates
python ./1_mdcure_generation/generate_style_specific.py --use_train --model=gpt-4o-mini --num_prompts_to_generate=<desired_number_of_generations> --template_prompt --num_docs_to_prompt_with=<desired_number_of_docs_in_context> --char_idx=<complexity_specification_index> --type_idx=<type_specification_index> --style_idx=<style_specification_index> --length_idx=<output_length_specification_index> --cluster_start_multiplier=<idx>
The resulting instructions will be saved to a folder path of the form:
./1_mdcure_generation/generations_style_specific/<template_description>/instructions
And the prompts used to obtain these instructions will be saved to:
./1_mdcure_generation/generations_style_specific/<template_description>/prompts
To extract the generated instruction-answer pairs from the raw outputs by the generator model, we use ./1_mdcure_generation/extract_general.py
and ./1_mdcure_generation/extract_style_specific.py
. Sample usages are as follows:
# For general templates A-F
python ./1_mdcure_generation/extract_general.py --input_dir=./1_mdcure_generation/generations_general/prompt_<prompt_id> --prompt_num=<prompt_num>
# For general templates G-N & style-specific templates
python ./1_mdcure_generation/extract_style_specific.py --input_dir=./1_mdcure_generation/generations_style_specific/<template_specification>
For all samples we append an additional sentence or phrase indicating the length of the expected answer output, in terms of the number of words or sentences to produce. For data generated using prompts G-N or style-specific prompts, the length direction is automatically added during the extraction step above. For data generated using prompts A-F, this is done with the ./1_mdcure_generation/add_length_direction.py
file, which can be used as such:
python ./1_mdcure_generation/add_length_direction.py --data_json_dir=<path_to_directory_with_generated_instruction_jsons>
During the Filtering phase of MDCure, we train and employ a fine-grained, MD-specific reward model, MDCureRM, to evaluate each instruction-answer pair based on six different criteria. These criteria capture both the overall quality of the instruction-response pairs and their effectiveness in handling multi-document content and are shown below:
Criterion | Description |
---|---|
Context Integration | How well does the instruction synthesize information from multiple documents? |
Inter-Document Relationships | Does the instruction help understand relationships between documents? |
Complexity | Does the instruction challenge critical thinking and analysis skills over multiple sources? |
Relevance | Is the instruction aligned with the document content? |
Coherence & Factuality | Is the response logical, factual, and well-supported by the documents? |
Creativity | How diverse or creative is the instruction in question type and format? |
MDCureRM is used in a RM-as-a-Judge fashion to filter the generated instruction set and yield high-quality training data more cheaply and effectively than GPT-3.5-Turbo. We walk through the MDCureRM training and inference process below.
Install the requirements with pip:
pip install -r ./2_mdcure_filtering/mdcurerm/rm_requirements.txt
To create fine-grained reward training data for the MDCureRM, follow these steps:
-
Generate Candidate Instructions:
- Use the
get_rm_candidate_instructions.py
script to generate instruction samples from models such as Mistral-7B Instruct or GPT-4o-mini. Alternatively, you can build your own generation pipeline using the instruction generation prompts. These generated instructions will serve as input for the reward model training pipeline. -
python ./2_mdcure_filtering/mdcurerm/get_rm_candidate_instructions.py --rubric <rubric> --num_sample 2000 --file_path <path_to_context_docs> --openai_key <openai_api_key> --mistral_key <mistral_api_key>
- Use the
-
Prepare and Finalize Training Data:
- Once the candidate instructions are generated, use GPT-4 or a similar model to assign ratings based on criteria such as relevance, coherence, and quality. These ratings will be used to train the fine-grained reward model.
-
python ./2_mdcure_filtering/mdcurerm/get_rm_training_data.py
To train the MDCure Reward Model (RM) based on 1-layer regression, follow these steps:
-
Input Data:
- The expected input data should be in JSONL format and located as
./data/parsed_ratings.jsonl
. This file contains the parsed training data required for the reward model.
- The expected input data should be in JSONL format and located as
-
Running the Training:
- To initiate the training process, execute the
train_rm.py
script. The script is designed to use theparsed_ratings.jsonl
data and will set up a 1-layer regression model appended to the end of the base LLM, which is initialized with Bradley-Terry weights, based on the input.
- To initiate the training process, execute the
-
Example Command:
torchrun --nproc_per_node=8 ./2_mdcure_filtering/mdcurerm/train_rm.py \ --output_dir=<model_save_path> \ --per_device_train_batch_size=1 \ --per_device_eval_batch_size=1 \ --num_train_epochs=4 \ --gradient_accumulation_steps=8 \ --learning_rate=1e-4 \ --report_to="wandb" \ --optim="adamw_8bit" \ --logging_steps=5 \ --eval_strategy="steps" \ --evaluation_strategy="steps" \ --save_strategy="steps" \ --eval_steps=125 \ --save_steps=125 \ --fp16=True \ --lr_scheduler_type="linear" \ --warmup_ratio=0.1 \ --remove_unused_columns=False \ --save_total_limit=10 \ --push_to_hub=False \ --label_names="rewards" \ --wandb_run_name="real_1e-4" \ --preference_data_path="./data/parsed_ratings.jsonl" \ --base_model_name="sfairXC/FsfairX-LLaMA3-RM-v0.1"
The score_and_filter.py
script is designed to score instructions using a pretrained reward model. It evaluates instruction data against multiple scoring criteria and filters out low-quality instructions based on the threshold set by the user. The model applies a weighted average of six scoring criteria to predict instruction quality.
This script can be used to score instruction samples with the following command:
python ./2_mdcure_filtering/score_and_filter.py \
--input_dir=<path_to_instruction_jsons> \
--scoring_model_path=./reward_modeling/models/reward_model \
--prompt_num=0 \
--score_general \
--score_style_specific
--input_dir
: Directory containing the input JSON files for training and validation, created during the Generation phase.--scoring_model_path
: Path to MDCureRM.--prompt_num
: Prompt ID for general template-based samples.--score_general
: Flag to indicate scoring of general template-based samples.--score_style_specific
: Flag to indicate scoring of style-specific template-based samples.
-
HF API Key: Ensure that your Hugging Face (HF) API key is set at the top of the script to access the LLAMA model. This is required for model loading.
-
Scoring Mechanism: MDCureRM generates a vector of six floating-point values (one per criterion), which are scaled to the 1-5 range. These six scores are combined into an overall score using a weighted sum, with general quality criteria (Relevance, Coherence & Factuality, Creativity) weighted at
$\frac{1}{9}$ , and multi-document-specific criteria (Context Integration, Inter-Document Relationships, Complexity) weighted at$\frac{2}{9}$ .
To set up the environment, install the necessary dependencies using the following command:
pip install -r ./instruction_tuning/it_requirements.txt
We utilize Unsloth to train models at a large scale (billion-parameter models). For detailed guidance on Unsloth, please refer to their official GitHub repository.
We use the ./instruction_tuning/train.py
to fine-tune various models, such as Flan-T5 and LLAMA models via instruction-tuning. The script supports multiple options for model architectures, such as Seq2Seq models (T5
) or autoregressive models (LLAMA
, Qwen
), along with PEFT (Parameter Efficient Fine-Tuning) techniques like LoRA.
To train a model using Unsloth, for example, the LLAMA3.1-8B-Instruct model, you can use the following command:
torchrun --nproc_per_node=1 --rdzv_endpoint=127.0.0.1:46967 ./instruction_tuning/train.py \
--model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" \
--wandb_project_name=<proj_name> \
--output_dir=./instruction_tuning/models \
--lr=5e-5 \
--lr_scheduler_type="linear" \
--optimizer="adamW" \
--warmup_ratio=0.15 \
--weight_decay=0.01 \
--save_total_limit=<save_limit> \
--per_train_batch_size=4 \
--per_eval_batch_size=4 \
--grad_acc_steps=16 \
--wandb_group=<group_name> \
--num_train_steps=<train_steps> \
--save_steps=<save_steps> \
--eval_steps=<eval_steps> \
--train_num_samples=<train_num_samps> \
--valid_num_samples=<valid_num_samps> \
--train_json_path=<train_json_path> \
--valid_json_path=<valid_json_path> \
--train_num_samples2=<train_num_samps2> \
--valid_num_samples2=<valid_num_samps2> \
--train_json_path2=<train_json_path2> \
--valid_json_path2=<valid_json_path2> \
--tokenization_batch_size=1 \
--use_truncation \
--use_unsloth \
--use_chat_template \
--choose_by_rank \
--choose_by_rank2
-
Model-specific Arguments:
--model_name
: Specify the pretrained model to fine-tune (e.g.,google/long-t5-tglobal-base
,unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
).--use_unsloth
: Activate this flag when training with Unsloth models.--use_peft
: Enable PEFT (e.g., LoRA) for large models with memory-efficient training.
-
Wandb Logging:
--wandb_project_name
: Define the Weights and Biases (Wandb) project to log results.--wandb_group
: Specify a group for the runs, helpful for tracking related experiments.
-
Training Configuration:
--lr
: Learning rate for training.--lr_scheduler_type
: Learning rate schedule (e.g., constant, linear).--optimizer
: Optimizer to use (e.g.,adamW
oradafactor
).--grad_acc_steps
: Gradient accumulation steps.--save_steps
and--eval_steps
: Number of steps between saving checkpoints and running validation scores.--num_train_steps
: Total number of training steps.--save_total_limit
: Max number of checkpoints to save.--per_train_batch_size
and--per_eval_batch_size
: Per-device training and evaluation batch sizes.
-
Dataset-related Arguments:
--train_num_samples
and--valid_num_samples
: Number of training and validation samples.--train_json_path
and--valid_json_path
: Paths to the JSON files for training and validation datasets.--choose_by_rank
: Enable sampling based on scores (e.g., for MDCure datasets).
-
Tokenization:
--tokenization_batch_size
: Set batch size for tokenization. Must be 1 for MDCure datasets.--truncate_max_length
: Maximum truncation length when tokenizing long sequences for training.
We list below the evaluation benchmarks that we used in our paper. For Multi-XScience and QMDSCNN, we utilize GPT-3.5-Turbo via the OpenAI API and Gemini 1.5 Pro via the Gemini API to obtain scores of results, respectively; for all other benchmarks we utilize their built-in evaluation metrics.
If you find the content of this project helpful, please cite our paper as follows:
@article{liu2024mdcure,
title={MDCure: A Scalable Pipeline for Multi-Document Instruction-Following},
author={Gabrielle Kaili-May Liu and Bowen Shi and Avi Caciularu and Idan Szpektor and Arman Cohan},
journal={arXiv preprint arXiv:2410.23463},
year={2024},
url={https://arxiv.org/abs/2410.23463}
}