Jiaqi Han*, Mingjian Jiang*, Yuxuan Song, Jure Leskovec, Stefano Ermon, Minkai Xu*^
*Equal contribution. ^Corresponding author.
Stanford University
We introduce
Please refer to the environment file for the detailed dependencies.
Please refer to HH and TLDR for the procedure of data processing, generating preference dataset with SFT model, and labeling with the reward model.
Afterward, refer to run/hh_pref.sh
, run/hh_rw.sh
, run/tldr_pref.sh
, and run/tldr_rw.sh
for an integrated pipeline of training, inference, and GPT-4 evaluation on HH and TLDR in the preference and reward model settings.
Remember to change the variables INIT_MODEL_PATH
, DATA_PATH
in the scripts (e.g., run/hh_pref.sh
) and YOUR_PATH
and YOUR_API_KEY
in src/api.py
for the code to work properly.
Use the following command to launch the training:
ACCELERATE_LOG_LEVEL=info \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
accelerate launch \
--main_process_port 21893 \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/llama-3-8b-base-alphapo.yaml
Change the config file in the last line to others in the folder training_configs
for the entire experiments using Llama3 and Mistral-7B in both Base and Instruct settings.
For the evaluation on AlpacaEval, MT-Bench, and ArenaHard, refer to here for the detailed guidance and configurations.
Please consider citing our work if you find it useful:
@article{han2024f,
title={$ f $-PO: Generalizing Preference Optimization with $ f $-divergence Minimization},
author={Han, Jiaqi and Jiang, Mingjian and Song, Yuxuan and Leskovec, Jure and Ermon, Stefano and Xu, Minkai},
journal={arXiv preprint arXiv:2410.21662},
year={2024}
}
This repo is built upon EXO and SimPO. We thank the authors for their great work and open-sourcing the code.