Dataset & code for our paper DafnyBench: A Benchmark for Formal Software Verification
Dataset is also available for download on 🤗 Hugging Face.
DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.
- Dataset: The dataset for DafnyBench (with 782 programs) could be found in the
DafnyBench
directory, which contains theground_truth
set & thehints_removed
set (with compiler hints, i.e. annoataions, removed). - Evaluation: Evaluate LLMs on DafnyBench by asking models to fill in missing hints in a test file from the
hints_removed
set and checking if the reconstructed program could be verified by Dafny. Please refer to theeval
directory.
- Install Dafny on your machine by following this tutorial
- Clone &
cd
into this repository - Set up environment by running the following lines:
python -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval
- Set up environment variable for the root directory:
export DAFNYBENCH_ROOT=
- Set up environment variable for path to Dafny executable on your machine (for example,
/opt/homebrew/bin/Dafny
):
export DAFNY_PATH=
- If you're evaluating an LLM through API access, set up API key. For example:
export OPENAI_API_KEY=
- You can choose to evaluate an LLM on a single test program, such as:
python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"
or evaluate on the entire dataset:
export model_to_eval='gpt-4o'
./run_eval.sh
DafnyBench
- A collection of 782 Dafny programs. Each program has a
ground_truth
version that is fully verified with Dafny & ahints_removed
version that has hints (i.e. annotations) removed
- A collection of 782 Dafny programs. Each program has a
eval
- Contains scripts to evaluate LLMs on DafnyBench
results
results_summary
- Dataframes that summarize LLMs' success on every test programreconstructed_files
- LLM outputs with hints filled back inanalysis
- Contains a notebook for analyzing the results
@article{loughridge2024dafnybench,
title={DafnyBench: A Benchmark for Formal Software Verification},
author={Chloe Loughridge and Qinyi Sun and Seth Ahrenbach and Federico Cassano and Chuyue Sun and Ying Sheng and Anish Mudide and Md Rakib Hossain Misu and Nada Amin and Max Tegmark},
year={2024},
journal={arXiv preprint arXiv:2406.08467}
}