LLMBox | Training | Utilization

Utilization

Utilization

Usage

Evaluating davinci-002 on HellaSwag, with prefix caching and flash attention enabled by default:

python inference.py -m davinci-002 -d hellaswag

Evaluating Gemma on MMLU:

python inference.py -m gemma-7b -d mmlu -shots 5

This will report the 57 subsets of MMLU, along with the macro average performance on four categories.

Evaluating Phi-2 on GSM8k using self-consistency and 4-bit quantization:

python inference.py -m microsoft/phi-2 -d gsm8k -shots 8 --sample_num 100 --load_in_4bit

Evaluating LLaMA-2 (7b) on CMMLU and CEval with instruction using vllm:

CUDA_VISIBLE_DEVICES=0 python inference.py -m llama-2-7b-hf -d cmmlu ceval --vllm True --model_type instruction

We use all cuda devices by default. You can specify the device with CUDA_VISIBLE_DEVICES.

Model Arguments

Define the model parameters, efficient evaluation settings, generation arguments, quantization, and additional configuration options.

We provide an enumeration (enum) for models corresponding to each model_backend. If a model is not listed within this enumeration, --model_backend should be specified directly.

--model_name_or_path MODEL_NAME_OR_PATH, --model MODEL_NAME_OR_PATH, -m MODEL_NAME_OR_PATH
                      The model name or path, e.g., davinci-002, meta-
                      llama/Llama-2-7b-hf, ./mymodel (default: None)
--model_type {base,instruction}
                      The type of the model, which can be chosen from `base`
                      or `instruction`. (default: base)
--model_backend {anthropic,dashscope,huggingface,openai,qianfan,vllm}
                      The model backend
--device_map DEVICE_MAP
                      The device map for model and data (default: auto)
--vllm [VLLM]         Whether to use vllm (default: False)
--flash_attention [FLASH_ATTENTION]
                      Whether to use flash attention (default: True)
--no_flash_attention  Whether to use flash attention (default: False)
--openai_api_key OPENAI_API_KEY
                      The OpenAI API key (default: None)
--anthropic_api_key ANTHROPIC_API_KEY
                      The Anthropic API key (default: None)
--dashscope_api_key DASHSCOPE_API_KEY
                      The Dashscope API key (default: None)
--qianfan_access_key QIANFAN_ACCESS_KEY
                      The Qianfan access key (default: None)
--qianfan_secret_key QIANFAN_SECRET_KEY
                      The Qianfan secret key (default: None)
--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH, --tokenizer TOKENIZER_NAME_OR_PATH
                      The tokenizer name or path, e.g., cl100k_base, meta-llama/Llama-2-7b-hf, ./mymodel

Generation arguments and quantization options::

--max_tokens MAX_TOKENS
                      The maximum number of tokens for output generation
                      (default: None)
--max_length MAX_LENGTH
                      The maximum number of tokens of model input sequence
                      (default: None)
--temperature TEMPERATURE
                      The temperature for models (default: None)
--top_p TOP_P         The model considers the results of the tokens with
                      top_p probability mass. (default: None)
--top_k TOP_K         The model considers the token with top_k probability.
                      (default: None)
--frequency_penalty FREQUENCY_PENALTY
                      Positive values penalize new tokens based on their
                      existing frequency in the generated text, vice versa.
                      (default: None)
--repetition_penalty REPETITION_PENALTY
                      Values>1 penalize new tokens based on their existing
                      frequency in the prompt and generated text, vice
                      versa. (default: None)
--presence_penalty PRESENCE_PENALTY
                      Positive values penalize new tokens based on whether
                      they appear in the generated text, vice versa.
                      (default: None)
--stop STOP [STOP ...]
                      List of strings that stop the generation when they are
                      generated. E.g. --stop 'stop' 'sequence' (default:
                      None)
--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
                      All ngrams of that size can only occur once. (default:
                      None)
--best_of BEST_OF, --num_beams BEST_OF
                      The beam size for beam search (default: None)
--length_penalty LENGTH_PENALTY
                      Positive values encourage longer sequences, vice
                      versa. Used in beam search. (default: None)
--early_stopping [EARLY_STOPPING]
                      Positive values encourage longer sequences, vice
                      versa. Used in beam search. (default: None)
--system_prompt SYSTEM_PROMPT, -sys SYSTEM_PROMPT
                      The system prompt for chat-based models
--chat_template CHAT_TEMPLATE
                      The chat template for huggingface chat-based models
--bnb_config BNB_CONFIG
                      JSON string for BitsAndBytesConfig parameters.
--load_in_8bit [LOAD_IN_8BIT]
                      Whether to use bnb's 8-bit quantization to load the
                      model. (default: False)
--load_in_4bit [LOAD_IN_4BIT]
                      Whether to use bnb's 4-bit quantization to load the
                      model. (default: False)
--gptq [GPTQ]         Whether the model is a gptq quantized model. (default:
                      False)
--vllm_gpu_memory_utilization VLLM_GPU_MEMORY_UTILIZATION
                      The maximum gpu memory utilization of vllm. (default:
                      None)
--torch_dtype {float16,bfloat16,float32}
                      The torch dtype for model input and output

Dataset Arguments

Configure dataset parameters such as the dataset identifiers, batch size, example strategies, chain-of-thought (CoT) strategies, and other relevant settings.

You can evaluate datasets sequentially in a single run when they require similar evaluation parameters. Both evaluation_set and example_set support the Huggingface String API for defining dataset slices.

--dataset_names DATASET [DATASET ...], -d DATASET [DATASET ...], --dataset DATASET [DATASET ...]
                      Space splitted dataset names. If only one dataset is specified, it can be followed by
                      subset names or category names. Format: 'dataset1 dataset2', 'dataset:subset1,subset2', or
                      'dataset:[cat1],[cat2]', e.g., 'copa race', 'race:high', 'wmt16:en-ro,en-fr', or
                      'mmlu:[stem],[humanities]'. (default: None)
--batch_size BATCH_SIZE, -bsz BATCH_SIZE, -b BATCH_SIZE
                      The evaluation batch size. Specify an integer (e.g., '10') to use a fixed batch size for
                      all iterations. Alternatively, append ':auto' (e.g., '10:auto') to start with the specified
                      batch size and automatically adjust it in subsequent iterations to maintain constant CUDA
                      memory usage (default: 1)
--dataset_path DATASET_PATH
                      The path of dataset if loading from local. Supports
                      repository cloned from huggingface, dataset saved by
                      `save_to_disk`, or a template string e.g.
                      'mmlu/{split}/{subset}_{split}.csv'. (default: None)
--evaluation_set EVALUATION_SET
                      The set name for evaluation, supporting slice, e.g.,
                      validation, test, validation[:10] (default: None)
--example_set EXAMPLE_SET
                      The set name for demonstration, supporting slice,
                      e.g., train, dev, train[:10] (default: None)
--instance_format INSTANCE_FORMAT, -fmt INSTANCE_FORMAT
                      The format to format the `source` and `target` for
                      each instance (default: {source}{target})
--num_shots NUM_SHOTS, -shots NUM_SHOTS
                      The few-shot number for demonstration (default: 0)
--max_example_tokens MAX_EXAMPLE_TOKENS
                      The maximum token number of demonstration (default:
                      1024)

Different types of datasets support different evaluation methods. The following table lists the supported evaluation methods and prompting methods for each dataset type.

Dataset Evaluation Method Prompt

Generation

{
  "question":
    "when was ...",
  "answer": [
    '14 December 1972',
    'December 1972'
  ]
}

generation

Generate based on the source text

Q: When was ...?
A: ________

MultipleChoice

{
  "question":
    "What is the ...?",
  "choices": [
    "The first",
    "The second",
    ...
  ],
  "answer": 3
}

get_ppl

Calculate perplexity of the option text based on the source text

ppl_no_option

Q: What is ...?
A: The first
   └--ppl--┘

ppl

Q: What is ...?
A. The first
B. The second
C. ...
A: A. The first
   └----ppl---┘

get_prob

Get the probability of each option label

prob

Q: What is ...?
A. The first
B. The second
C. ...
A: _
   └→ [A B C D]

--ranking_type {ppl,prob,ppl_no_option}
                      The evaluation and prompting method for ranking task
                      (default: ppl_no_option)
--sample_num SAMPLE_NUM, --majority SAMPLE_NUM, --consistency SAMPLE_NUM
                      The sampling number for self-consistency (default: 1)
--kate [KATE], -kate [KATE]
                      Whether to use KATE as an ICL strategy (default:
                      False)
--globale [GLOBALE], -globale [GLOBALE]
                      Whether to use GlobalE as an ICL strategy (default:
                      False)
--ape [APE], -ape [APE]
                      Whether to use APE as an ICL strategy (default: False)
--cot {base,least_to_most,pal}
                      The method to prompt, eg. 'base', 'least_to_most',
                      'pal'. Only available for some specific datasets.
                      (default: None)
--perspective_api_key PERSPECTIVE_API_KEY
                      The Perspective API key for toxicity metrics (default:
                      None)
--pass_at_k PASS_AT_K
                      The k value for pass@k metric (default: None)

Evaluation Arguments

Specify the random seed, logging directory, evaluation results directory, and other arguments.

--seed SEED           The random seed (default: 2023)
--logging_dir LOGGING_DIR
                      The logging directory (default: logs)
--log_level {debug,info,warning,error,critical}
                      Logger level to use on the main node. Possible choices
                      are the log levels as strings: 'debug', 'info',
                      'warning', 'error' and 'critical' (default: info)
--evaluation_results_dir EVALUATION_RESULTS_DIR
                      The directory to save evaluation results, which
                      includes source and target texts, generated texts, and
                      the references. (default: evaluation_results)
--log_results [LOG_RESULTS]
                      Whether to log the evaluation results. Notes that the generated JSON file will be the same
                      size as the evaluation dataset itself
--no_log_results      Whether to log the evaluation results. Notes that the generated JSON file will be the same
                      size as the evaluation dataset itself
--dry_run [DRY_RUN]   Test the evaluation pipeline without actually calling
                      the model. (default: False)
--proxy_port PROXY_PORT
                      The port of the proxy (default: None)
--dataset_threading [DATASET_THREADING]
                      Load dataset with threading
--no_dataset_threading
                      Load dataset with threading
--dataloader_workers DATALOADER_WORKERS
                      The number of workers for dataloader

Supported Models

Backend	Entrypoint	Example Model	Supported Methods
Huggingface	AutoModelForCasualLM	`Llama-2-7b-hf`	`generation`, `get_ppl`, `get_prob`
OpenAI	Chat Completion Models	`gpt-4-0125-preview`, `gpt-3.5-turbo`	`generation`, `get_prob` (adapted by generation)
OpenAI	Completion Models (Legacy)	`davinci-002`	`generation`, `get_ppl`, `get_prob`
Qianfan	Chat Completion Models	`ernie-speed-8k`	`generation`, `get_prob` (adapted by generation)
Dashscope	Generation	`qwen-turbo`	`generation`, `get_prob` (adapted by generation)
Anthropic	Chat Completion Models	`claude-3-haiku-20240307`	`generation`, `get_prob` (adapted by generation)
vLLM	LLM	`Llama-2-7b-hf`	`generation`, `get_ppl`, `get_prob`

Customize Model

By inheriting the Model class, you can customize support for more models. You can implement generation, get_ppl, and get_prob methods to support different models. For example, you can implement the generation method for a new model as follows:

class NewModel(Model):

    def call_model(self, batched_inputs: List[str]) -> List[Any]:
        return ...  # call to model, e.g., self.model.generate(...)

    def to_text(self, result: Any) -> str:
        return ...  # convert result to text, e.g., result['text']

    def generation(self, batched_inputs: List[str]) -> List[str]:
        results = self.call_model(batched_inputs)
        results = [to_text(result) for result in results]
        return results

And then, you should register your model in the load file.

Supported Datasets

We currently support 53 commonly used datasets for LLMs. Each dataset may includes multiple subsets, or is a subset of a collection.

Load from huggingface server:

python inference.py -d copa
python inference.py -d race:middle,high
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train"

Dataset	Subsets / Collections	Evaluation Type	CoT	Notes
agieval (alias of agieval_single_choice and agieval_cot)	English: `sat-en`, `sat-math`, `lsat-ar`, `lsat-lr`, `lsat-rc`, `logiqa-en`, `aqua-rat`, `math`	MultipleChoice
	`gaokao-chinese`, `gaokao-geography`, `gaokao-history`, `gaokao-biology`, `gaokao-chemistry`, `gaokao-english`, `logiqa-zh`	MultipleChoice
	`jec-qa-kd`, `jec-qa-ca`, `math`, `gaokao-physics`, `gaokao-mathcloze`, `gaokao-mathqa`	Generation	✅
alpaca_eval	/	Generation		Single GPTEval
anli	`Round2` (default)	MultipleChoice
arc	`ARC-Easy`, `ARC-Challenge`	MultipleChoice		Normalization
bbh	`boolean_expressions`, ...	Generation	✅
boolq	super_glue	MultipleChoice
cb	super_glue	MultipleChoice
ceval	stem: `advanced_mathematics`, `college_chemistry`, ...	MultipleChoice
	social science: `business_administration`, `college_economics`, ...
	humanities: `art_studies`, `chinese_language_and_literature`, ...
	other: `accountant`, `basic_medicine`, ...
cmmlu	stem: `anatomy`, `astronomy`, ...	MultipleChoice
	social science: `ancient_chinese`, `business_ethics`, ...
	humanities: `arts`, `chinese_history`, ...
	other: `agronomy`, `chinese_driving_rule`, ...
cnn_dailymail	`3.0.0` (default), ...	Generation
color_objects	bigbench (reasoning_about_colored_objects)	Generation
commonsenseqa	/	MultipleChoice
copa	super_glue	MultipleChoice
coqa	/	Generation		Download: train, dev
crows_pairs	/	MultipleChoice
drop	/	Generation
gaokao	Chinese: `2010-2022_Chinese_Modern_Lit`, `2010-2022_Chinese_Lang_and_Usage_MCQs`	Generation		Metric: Exam scoring
	English: `2010-2022_English_Reading_Comp`, `2010-2022_English_Fill_in_Blanks`, ...
	`2010-2022_Math_II_MCQs`, `2010-2022_Math_I_MCQs`, ...
gsm8k	`main` (default), `socratic`	Generation	✅	Code exec
halueval	`dialogue_samples`, `qa_samples`, `summarization_samples`	Generation
hellaswag	/	MultipleChoice
humaneval	/	Generation		Pass@K
ifeval	/	Generation
lambada	`default` (default), `de`, ... (source: EleutherAI/lambada_openai)	Generation
math	/	Generation
mbpp	`full` (default), `sanitized`	Generation		Pass@K
mmlu	stem: `abstract_algebra`, `astronomy`, ...	MultipleChoice
	social_sciences: `econometrics`, `high_school_geography`, ...
	humanities: `formal_logic`, `high_school_european_history`, ...
	other: `anatomy`, `business_ethics`, ...
mt_bench	/	Generation		Multi-turn GPTEval
nq	/	Generation
openbookqa	`main` (default), `additional`	MultipleChoice		Normalization
penguins_in_a_table	bigbench	MultipleChoice
piqa	/	MultipleChoice
quac	/	Generation
race	`high`, `middle`	MultipleChoice		Normalization
real_toxicity_prompts	/	Generation		Perspective Toxicity
rte	super_glue	MultipleChoice
siqa	/	MultipleChoice
squad, squad_v2	/	Generation
story_cloze	`2016` (default), `2018`	MultipleChoice		Manually download
tldr	/	Generation
triviaqa	`rc.wikipedia.nocontext` (default), `rc`, `rc.nocontext`, ...	Generation
truthfulqa_mc	`multiple_choice` (default), `generation` (not supported)	MultipleChoice
vicuna_bench	/	Generation		GPTEval
webq	/	Generation
wic	super_glue	MultipleChoice
winogender	`main`, `gotcha`	MultipleChoice		Group by gender
winograd	`wsc273` (default), `wsc285`	MultipleChoice
winogrande	`winogrande_debiased` (default), ...	MultipleChoice
wmt21, wmt19, ...	`en-ro`, `ro-en`, ...	Generation
wsc	super_glue	MultipleChoice
xsum	/	Generation

By default we load all the subsets of a dataset:

python inference.py -m model -d arc
# equivalent: arc:ARC-Easy,ARC-Challenge

Unless a default subset is defined:

python inference.py -m model -d cnn_dailymail
# equivalent: cnn_dailymail:3.0.0

If dataset_path is not None, the dataset will be loaded from the given local path:

# from a cloned directory of the huggingface dataset repository:
python inference.py -d copa --dataset_path /path/to/copa

# from a local (nested) directory saved by `dataset.save_to_disk`:
python inference.py -d race --dataset_path /path/to/race/middle
python inference.py -d race:middle --dataset_path /path/to/race
python inference.py -d race:middle --dataset_path /path/to/race/middle
python inference.py -d race:middle,high --dataset_path /path/to/race

dataset_path can also accept a dataset file or a directory containing these files (supports json, jsonl, csv, and txt):

# load one split from one subset only
python inference.py -d gsm8k --dataset_path /path/to/gsm.jsonl
python inference.py -d race --dataset_path /path/to/race/middle/train.json

# load test and train splits from middle subset (a directory contains `/path/to/race/middle/train.json` and `/path/to/race/middle/test.json`)
python inference.py -d race --dataset_path /path/to/race/middle --evaluation_set "test[:10]" --example_set "train"

# load test and train splits from middle and high subsets (a nested directory)
python inference.py -d race:middle,high --dataset_path /path/to/race --evaluation_set "test[:10]" --example_set "train"

# load test and train splits from middle and high subsets with a filename pattern
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train" --dataset_path "/pattern/of/race_{subset}_{split}.json"
python inference.py -d mmlu --evaluation_set val --example_set dev --dataset_path "/pattern/of/mmlu/{split}/{subset}_{split}.csv"

Also feel free to override this function if you want to load the dataset in a different way:

from .utils import load_raw_dataset_from_file, get_raw_dataset_loader

class MyDataset(Dataset):
    def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
        self.evaluation_data = get_raw_dataset_loader(...)("test")
        self.example_data = load_raw_dataset_from_file("examples.json")

Customize Dataset

We provide two types of datasets: GenerationDataset, MultipleChoiceDataset. You can also customize support for a new dataset type by inheriting the Dataset class. For example, you can implement a new GenerationDataset as follows:

def NewDataset(GenerationDataset):

    instruction = "Answer the following question."
    metrics = [Accuracy()]
    evaluation_set = "test"
    example_set = "dev"
    load_args = ("huggingface/path", "subset")

    extra_model_args = dict(temperature=0)
    category_subsets = {"Group": ["subset1", "subset2"]}

    def format_instance(self, instance):
        src, tgt = func(instance, self.example_data)
        return dict(source=src, target=tgt)

    def reference(self):
        return [i["answer"] for i in self.eval_data]

You can load the raw dataset by the following methods:

Set a load_args: The arguments for datasets.load_dataset.
Or overwrite load_raw_dataset function: Set the self.evaluation_data and self.example_data.

from .utils import load_raw_dataset_from_file, get_raw_dataset_loader

class MyDataset(Dataset):
    def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
        self.evaluation_data = get_raw_dataset_loader(...)("test")
        self.example_data = load_raw_dataset_from_file("examples.json")

Then, format the instance by implementing the format_instance method. The instance should be a dictionary with the following keys:

source (Union[str, List[str]]): The source text. If this is a list, source_idx is required.
source_idx (int, optional): The index of the correct source (for multiple contexts ranking dataset like winogrande).
source_postfix (str, optional): The postfix of the source text. This will be appended to the source text after options when ranking_with_options is True.
target (str, optional): The target text. Either target or target_idx should be provided.
target_idx (int, optional): The index of the target in the options (for ranking). This will generate the target text in _format_instance.
options (List[str], optional): The options for ranking.

MultipleChoiceDataset:

def format_instance(self, instance):
    dict(
        source=self.source_prefix + instance["question"].strip(),
        source_postfix="\nAnswer:",
        target_idx=instance["answer"],
        options=options,
    )

MultipleChoiceDataset (Multiple-context) like winogrande:

def format_instance(self, instance):
    dict(
        source=contexts,
        source_idx=int(instance["answer"]) - 1,
        target=completion,
    )

GenerationDataset:

def format_instance(self, instance):
    dict(
        source=instance["question"],
        target=instance["answer"],
    )

See Dataset for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Utilization

Usage

Model Arguments

Dataset Arguments

Evaluation Arguments

Supported Models

Customize Model

Supported Datasets

Customize Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Utilization

Usage

Model Arguments

Dataset Arguments

Evaluation Arguments

Supported Models

Customize Model

Supported Datasets

Customize Dataset