LLMBox | Training | Utilization
Evaluating davinci-002 on HellaSwag, with prefix caching and flash attention enabled by default:
python inference.py -m davinci-002 -d hellaswag
Evaluating Gemma on MMLU:
python inference.py -m gemma-7b -d mmlu -shots 5
This will report the 57 subsets of MMLU, along with the macro average performance on four categories.
Evaluating Phi-2 on GSM8k using self-consistency and 4-bit quantization:
python inference.py -m microsoft/phi-2 -d gsm8k -shots 8 --sample_num 100 --load_in_4bit
Evaluating LLaMA-2 (7b) on CMMLU and CEval with instruction using vllm:
CUDA_VISIBLE_DEVICES=0 python inference.py -m llama-2-7b-hf -d cmmlu ceval --vllm True --model_type instruction
We use all cuda devices by default. You can specify the device with CUDA_VISIBLE_DEVICES
.
Define the model parameters, efficient evaluation settings, generation arguments, quantization, and additional configuration options.
We provide an enumeration (enum
) for models corresponding to each model_backend
. If a model is not listed within this enumeration, --model_backend
should be specified directly.
--model_name_or_path MODEL_NAME_OR_PATH, --model MODEL_NAME_OR_PATH, -m MODEL_NAME_OR_PATH
The model name or path, e.g., davinci-002, meta-
llama/Llama-2-7b-hf, ./mymodel (default: None)
--model_type {base,instruction}
The type of the model, which can be chosen from `base`
or `instruction`. (default: base)
--model_backend {anthropic,dashscope,huggingface,openai,qianfan,vllm}
The model backend
--device_map DEVICE_MAP
The device map for model and data (default: auto)
--vllm [VLLM] Whether to use vllm (default: False)
--flash_attention [FLASH_ATTENTION]
Whether to use flash attention (default: True)
--no_flash_attention Whether to use flash attention (default: False)
--openai_api_key OPENAI_API_KEY
The OpenAI API key (default: None)
--anthropic_api_key ANTHROPIC_API_KEY
The Anthropic API key (default: None)
--dashscope_api_key DASHSCOPE_API_KEY
The Dashscope API key (default: None)
--qianfan_access_key QIANFAN_ACCESS_KEY
The Qianfan access key (default: None)
--qianfan_secret_key QIANFAN_SECRET_KEY
The Qianfan secret key (default: None)
--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH, --tokenizer TOKENIZER_NAME_OR_PATH
The tokenizer name or path, e.g., cl100k_base, meta-llama/Llama-2-7b-hf, ./mymodel
Generation arguments and quantization options::
--max_tokens MAX_TOKENS
The maximum number of tokens for output generation
(default: None)
--max_length MAX_LENGTH
The maximum number of tokens of model input sequence
(default: None)
--temperature TEMPERATURE
The temperature for models (default: None)
--top_p TOP_P The model considers the results of the tokens with
top_p probability mass. (default: None)
--top_k TOP_K The model considers the token with top_k probability.
(default: None)
--frequency_penalty FREQUENCY_PENALTY
Positive values penalize new tokens based on their
existing frequency in the generated text, vice versa.
(default: None)
--repetition_penalty REPETITION_PENALTY
Values>1 penalize new tokens based on their existing
frequency in the prompt and generated text, vice
versa. (default: None)
--presence_penalty PRESENCE_PENALTY
Positive values penalize new tokens based on whether
they appear in the generated text, vice versa.
(default: None)
--stop STOP [STOP ...]
List of strings that stop the generation when they are
generated. E.g. --stop 'stop' 'sequence' (default:
None)
--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
All ngrams of that size can only occur once. (default:
None)
--best_of BEST_OF, --num_beams BEST_OF
The beam size for beam search (default: None)
--length_penalty LENGTH_PENALTY
Positive values encourage longer sequences, vice
versa. Used in beam search. (default: None)
--early_stopping [EARLY_STOPPING]
Positive values encourage longer sequences, vice
versa. Used in beam search. (default: None)
--system_prompt SYSTEM_PROMPT, -sys SYSTEM_PROMPT
The system prompt for chat-based models
--chat_template CHAT_TEMPLATE
The chat template for huggingface chat-based models
--bnb_config BNB_CONFIG
JSON string for BitsAndBytesConfig parameters.
--load_in_8bit [LOAD_IN_8BIT]
Whether to use bnb's 8-bit quantization to load the
model. (default: False)
--load_in_4bit [LOAD_IN_4BIT]
Whether to use bnb's 4-bit quantization to load the
model. (default: False)
--gptq [GPTQ] Whether the model is a gptq quantized model. (default:
False)
--vllm_gpu_memory_utilization VLLM_GPU_MEMORY_UTILIZATION
The maximum gpu memory utilization of vllm. (default:
None)
--torch_dtype {float16,bfloat16,float32}
The torch dtype for model input and output
Configure dataset parameters such as the dataset identifiers, batch size, example strategies, chain-of-thought (CoT) strategies, and other relevant settings.
You can evaluate datasets sequentially in a single run when they require similar evaluation parameters. Both evaluation_set
and example_set
support the Huggingface String API for defining dataset slices.
--dataset_names DATASET [DATASET ...], -d DATASET [DATASET ...], --dataset DATASET [DATASET ...]
Space splitted dataset names. If only one dataset is specified, it can be followed by
subset names or category names. Format: 'dataset1 dataset2', 'dataset:subset1,subset2', or
'dataset:[cat1],[cat2]', e.g., 'copa race', 'race:high', 'wmt16:en-ro,en-fr', or
'mmlu:[stem],[humanities]'. (default: None)
--batch_size BATCH_SIZE, -bsz BATCH_SIZE, -b BATCH_SIZE
The evaluation batch size. Specify an integer (e.g., '10') to use a fixed batch size for
all iterations. Alternatively, append ':auto' (e.g., '10:auto') to start with the specified
batch size and automatically adjust it in subsequent iterations to maintain constant CUDA
memory usage (default: 1)
--dataset_path DATASET_PATH
The path of dataset if loading from local. Supports
repository cloned from huggingface, dataset saved by
`save_to_disk`, or a template string e.g.
'mmlu/{split}/{subset}_{split}.csv'. (default: None)
--evaluation_set EVALUATION_SET
The set name for evaluation, supporting slice, e.g.,
validation, test, validation[:10] (default: None)
--example_set EXAMPLE_SET
The set name for demonstration, supporting slice,
e.g., train, dev, train[:10] (default: None)
--instance_format INSTANCE_FORMAT, -fmt INSTANCE_FORMAT
The format to format the `source` and `target` for
each instance (default: {source}{target})
--num_shots NUM_SHOTS, -shots NUM_SHOTS
The few-shot number for demonstration (default: 0)
--max_example_tokens MAX_EXAMPLE_TOKENS
The maximum token number of demonstration (default:
1024)
Different types of datasets support different evaluation methods. The following table lists the supported evaluation methods and prompting methods for each dataset type.
Dataset | Evaluation Method | Prompt |
Generation
|
Generate based on the source text |
|
MultipleChoice
|
Calculate perplexity of the option text based on the source text |
|
|
||
Get the probability of each option label |
|
--ranking_type {ppl,prob,ppl_no_option}
The evaluation and prompting method for ranking task
(default: ppl_no_option)
--sample_num SAMPLE_NUM, --majority SAMPLE_NUM, --consistency SAMPLE_NUM
The sampling number for self-consistency (default: 1)
--kate [KATE], -kate [KATE]
Whether to use KATE as an ICL strategy (default:
False)
--globale [GLOBALE], -globale [GLOBALE]
Whether to use GlobalE as an ICL strategy (default:
False)
--ape [APE], -ape [APE]
Whether to use APE as an ICL strategy (default: False)
--cot {base,least_to_most,pal}
The method to prompt, eg. 'base', 'least_to_most',
'pal'. Only available for some specific datasets.
(default: None)
--perspective_api_key PERSPECTIVE_API_KEY
The Perspective API key for toxicity metrics (default:
None)
--pass_at_k PASS_AT_K
The k value for pass@k metric (default: None)
Specify the random seed, logging directory, evaluation results directory, and other arguments.
--seed SEED The random seed (default: 2023)
--logging_dir LOGGING_DIR
The logging directory (default: logs)
--log_level {debug,info,warning,error,critical}
Logger level to use on the main node. Possible choices
are the log levels as strings: 'debug', 'info',
'warning', 'error' and 'critical' (default: info)
--evaluation_results_dir EVALUATION_RESULTS_DIR
The directory to save evaluation results, which
includes source and target texts, generated texts, and
the references. (default: evaluation_results)
--log_results [LOG_RESULTS]
Whether to log the evaluation results. Notes that the generated JSON file will be the same
size as the evaluation dataset itself
--no_log_results Whether to log the evaluation results. Notes that the generated JSON file will be the same
size as the evaluation dataset itself
--dry_run [DRY_RUN] Test the evaluation pipeline without actually calling
the model. (default: False)
--proxy_port PROXY_PORT
The port of the proxy (default: None)
--dataset_threading [DATASET_THREADING]
Load dataset with threading
--no_dataset_threading
Load dataset with threading
--dataloader_workers DATALOADER_WORKERS
The number of workers for dataloader
Backend | Entrypoint | Example Model | Supported Methods |
Huggingface | AutoModelForCasualLM | Llama-2-7b-hf |
generation , get_ppl , get_prob |
OpenAI | Chat Completion Models | gpt-4-0125-preview , gpt-3.5-turbo |
generation , get_prob (adapted by generation) |
Completion Models (Legacy) | davinci-002 |
generation , get_ppl , get_prob |
|
Qianfan | Chat Completion Models | ernie-speed-8k |
generation , get_prob (adapted by generation) |
Dashscope | Generation | qwen-turbo |
generation , get_prob (adapted by generation) |
Anthropic | Chat Completion Models | claude-3-haiku-20240307 |
generation , get_prob (adapted by generation) |
vLLM | LLM | Llama-2-7b-hf |
generation , get_ppl , get_prob |
By inheriting the Model
class, you can customize support for more models. You can implement generation
, get_ppl
, and get_prob
methods to support different models. For example, you can implement the generation
method for a new model as follows:
class NewModel(Model):
def call_model(self, batched_inputs: List[str]) -> List[Any]:
return ... # call to model, e.g., self.model.generate(...)
def to_text(self, result: Any) -> str:
return ... # convert result to text, e.g., result['text']
def generation(self, batched_inputs: List[str]) -> List[str]:
results = self.call_model(batched_inputs)
results = [to_text(result) for result in results]
return results
And then, you should register your model in the load
file.
We currently support 53 commonly used datasets for LLMs. Each dataset may includes multiple subsets, or is a subset of a collection.
Load from huggingface server:
python inference.py -d copa
python inference.py -d race:middle,high
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train"
Dataset | Subsets / Collections | Evaluation Type | CoT | Notes |
agieval (alias of agieval_single_choice and agieval_cot) |
English: sat-en , sat-math , lsat-ar , lsat-lr , lsat-rc , logiqa-en , aqua-rat , math |
MultipleChoice | ||
gaokao-chinese , gaokao-geography , gaokao-history , gaokao-biology , gaokao-chemistry , gaokao-english , logiqa-zh |
||||
jec-qa-kd , jec-qa-ca , math , gaokao-physics , gaokao-mathcloze , gaokao-mathqa |
Generation | ✅ | ||
alpaca_eval | / | Generation | Single GPTEval | |
anli | Round2 (default) |
MultipleChoice | ||
arc | ARC-Easy , ARC-Challenge |
MultipleChoice | Normalization | |
bbh | boolean_expressions , ... |
Generation | ✅ | |
boolq | super_glue | MultipleChoice | ||
cb | super_glue | MultipleChoice | ||
ceval | stem: advanced_mathematics , college_chemistry , ... |
MultipleChoice | ||
social science: business_administration , college_economics , ... |
||||
humanities: art_studies , chinese_language_and_literature , ... |
||||
other: accountant , basic_medicine , ... |
||||
cmmlu | stem: anatomy , astronomy , ... |
MultipleChoice | ||
social science: ancient_chinese , business_ethics , ... |
||||
humanities: arts , chinese_history , ... |
||||
other: agronomy , chinese_driving_rule , ... |
||||
cnn_dailymail | 3.0.0 (default), ... |
Generation | ||
color_objects | bigbench (reasoning_about_colored_objects) | Generation | ||
commonsenseqa | / | MultipleChoice | ||
copa | super_glue | MultipleChoice | ||
coqa | / | Generation | Download: train, dev | |
crows_pairs | / | MultipleChoice | ||
drop | / | Generation | ||
gaokao | Chinese: 2010-2022_Chinese_Modern_Lit , 2010-2022_Chinese_Lang_and_Usage_MCQs |
Generation | Metric: Exam scoring | |
English: 2010-2022_English_Reading_Comp , 2010-2022_English_Fill_in_Blanks , ... |
||||
2010-2022_Math_II_MCQs , 2010-2022_Math_I_MCQs , ... |
||||
gsm8k | main (default), socratic |
Generation | ✅ | Code exec |
halueval | dialogue_samples , qa_samples , summarization_samples |
Generation | ||
hellaswag | / | MultipleChoice | ||
humaneval | / | Generation | Pass@K | |
ifeval | / | Generation | ||
lambada | default (default), de , ... (source: EleutherAI/lambada_openai) |
Generation | ||
math | / | Generation | ||
mbpp | full (default), sanitized |
Generation | Pass@K | |
mmlu | stem: abstract_algebra , astronomy , ... |
MultipleChoice | ||
social_sciences: econometrics , high_school_geography , ... |
||||
humanities: formal_logic , high_school_european_history , ... |
||||
other: anatomy , business_ethics , ... |
||||
mt_bench | / | Generation | Multi-turn GPTEval | |
nq | / | Generation | ||
openbookqa | main (default), additional |
MultipleChoice | Normalization | |
penguins_in_a_table | bigbench | MultipleChoice | ||
piqa | / | MultipleChoice | ||
quac | / | Generation | ||
race | high , middle |
MultipleChoice | Normalization | |
real_toxicity_prompts | / | Generation | Perspective Toxicity | |
rte | super_glue | MultipleChoice | ||
siqa | / | MultipleChoice | ||
squad, squad_v2 | / | Generation | ||
story_cloze | 2016 (default), 2018 |
MultipleChoice | Manually download | |
tldr | / | Generation | ||
triviaqa | rc.wikipedia.nocontext (default), rc , rc.nocontext , ... |
Generation | ||
truthfulqa_mc | multiple_choice (default), generation (not supported) |
MultipleChoice | ||
vicuna_bench | / | Generation | GPTEval | |
webq | / | Generation | ||
wic | super_glue | MultipleChoice | ||
winogender | main , gotcha |
MultipleChoice | Group by gender | |
winograd | wsc273 (default), wsc285 |
MultipleChoice | ||
winogrande | winogrande_debiased (default), ... |
MultipleChoice | ||
wmt21, wmt19, ... | en-ro , ro-en , ... |
Generation | ||
wsc | super_glue | MultipleChoice | ||
xsum | / | Generation |
By default we load all the subsets of a dataset:
python inference.py -m model -d arc
# equivalent: arc:ARC-Easy,ARC-Challenge
Unless a default subset is defined:
python inference.py -m model -d cnn_dailymail
# equivalent: cnn_dailymail:3.0.0
If dataset_path
is not None, the dataset will be loaded from the given local path:
# from a cloned directory of the huggingface dataset repository:
python inference.py -d copa --dataset_path /path/to/copa
# from a local (nested) directory saved by `dataset.save_to_disk`:
python inference.py -d race --dataset_path /path/to/race/middle
python inference.py -d race:middle --dataset_path /path/to/race
python inference.py -d race:middle --dataset_path /path/to/race/middle
python inference.py -d race:middle,high --dataset_path /path/to/race
dataset_path
can also accept a dataset file or a directory containing these files (supports json, jsonl, csv, and txt):
# load one split from one subset only
python inference.py -d gsm8k --dataset_path /path/to/gsm.jsonl
python inference.py -d race --dataset_path /path/to/race/middle/train.json
# load test and train splits from middle subset (a directory contains `/path/to/race/middle/train.json` and `/path/to/race/middle/test.json`)
python inference.py -d race --dataset_path /path/to/race/middle --evaluation_set "test[:10]" --example_set "train"
# load test and train splits from middle and high subsets (a nested directory)
python inference.py -d race:middle,high --dataset_path /path/to/race --evaluation_set "test[:10]" --example_set "train"
# load test and train splits from middle and high subsets with a filename pattern
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train" --dataset_path "/pattern/of/race_{subset}_{split}.json"
python inference.py -d mmlu --evaluation_set val --example_set dev --dataset_path "/pattern/of/mmlu/{split}/{subset}_{split}.csv"
Also feel free to override this function if you want to load the dataset in a different way:
from .utils import load_raw_dataset_from_file, get_raw_dataset_loader
class MyDataset(Dataset):
def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
self.evaluation_data = get_raw_dataset_loader(...)("test")
self.example_data = load_raw_dataset_from_file("examples.json")
We provide two types of datasets: GenerationDataset
, MultipleChoiceDataset
. You can also customize support for a new dataset type by inheriting the Dataset
class. For example, you can implement a new GenerationDataset
as follows:
def NewDataset(GenerationDataset):
instruction = "Answer the following question."
metrics = [Accuracy()]
evaluation_set = "test"
example_set = "dev"
load_args = ("huggingface/path", "subset")
extra_model_args = dict(temperature=0)
category_subsets = {"Group": ["subset1", "subset2"]}
def format_instance(self, instance):
src, tgt = func(instance, self.example_data)
return dict(source=src, target=tgt)
def reference(self):
return [i["answer"] for i in self.eval_data]
You can load the raw dataset by the following methods:
- Set a
load_args
: The arguments fordatasets.load_dataset
. - Or overwrite
load_raw_dataset
function: Set theself.evaluation_data
andself.example_data
.
from .utils import load_raw_dataset_from_file, get_raw_dataset_loader
class MyDataset(Dataset):
def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
self.evaluation_data = get_raw_dataset_loader(...)("test")
self.example_data = load_raw_dataset_from_file("examples.json")
Then, format the instance by implementing the format_instance
method. The instance should be a dictionary with the following keys:
source
(Union[str, List[str]]
): The source text. If this is a list,source_idx
is required.source_idx
(int
, optional): The index of the correct source (for multiple contexts ranking dataset like winogrande).source_postfix
(str
, optional): The postfix of the source text. This will be appended to the source text after options whenranking_with_options
is True.target
(str
, optional): The target text. Eithertarget
ortarget_idx
should be provided.target_idx
(int
, optional): The index of the target in the options (for ranking). This will generate thetarget
text in_format_instance
.options
(List[str]
, optional): The options for ranking.
MultipleChoiceDataset:
def format_instance(self, instance):
dict(
source=self.source_prefix + instance["question"].strip(),
source_postfix="\nAnswer:",
target_idx=instance["answer"],
options=options,
)
MultipleChoiceDataset (Multiple-context) like winogrande:
def format_instance(self, instance):
dict(
source=contexts,
source_idx=int(instance["answer"]) - 1,
target=completion,
)
GenerationDataset:
def format_instance(self, instance):
dict(
source=instance["question"],
target=instance["answer"],
)
See Dataset
for more details.