Skip to content
/ fPO Public

f-PO: Generalizing Preference Optimization with f-divergence Minimization

License

Notifications You must be signed in to change notification settings

MinkaiXu/fPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Jiaqi Han*, Mingjian Jiang*, Yuxuan Song, Jure Leskovec, Stefano Ermon, Minkai Xu*^

*Equal contribution. ^Corresponding author.

Stanford University

License: MIT ArXiv

Introduction

We introduce $f$-PO, a novel approach that generalizes preference optimization via minimizing the $f$-divergence through solving certain distribution matching problem. The method subsumes existing offline preference optimization methods like DPO while inspiring new formulations by exploring other members in the $f$-divergence family. Results demonstrate the efficacy of $f$-PO on a wide suite of benchmarks, including, e.g., AlpacaEval, MT-Bench, ArenaHard, and more.

Overview

Environment

Please refer to the environment file for the detailed dependencies.

Experiments

Tuning Pythia-2.8B on HH and TLDR

Please refer to HH and TLDR for the procedure of data processing, generating preference dataset with SFT model, and labeling with the reward model.

Afterward, refer to run/hh_pref.sh, run/hh_rw.sh, run/tldr_pref.sh, and run/tldr_rw.sh for an integrated pipeline of training, inference, and GPT-4 evaluation on HH and TLDR in the preference and reward model settings.

Remember to change the variables INIT_MODEL_PATH, DATA_PATH in the scripts (e.g., run/hh_pref.sh) and YOUR_PATH and YOUR_API_KEY in src/api.py for the code to work properly.

Tuning Llama3 and Mistral-7B

Use the following command to launch the training:

ACCELERATE_LOG_LEVEL=info \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
accelerate launch \
--main_process_port 21893 \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/llama-3-8b-base-alphapo.yaml

Change the config file in the last line to others in the folder training_configs for the entire experiments using Llama3 and Mistral-7B in both Base and Instruct settings.

For the evaluation on AlpacaEval, MT-Bench, and ArenaHard, refer to here for the detailed guidance and configurations.

Citation

Please consider citing our work if you find it useful:

@article{han2024f,
  title={$ f $-PO: Generalizing Preference Optimization with $ f $-divergence Minimization},
  author={Han, Jiaqi and Jiang, Mingjian and Song, Yuxuan and Leskovec, Jure and Ermon, Stefano and Xu, Minkai},
  journal={arXiv preprint arXiv:2410.21662},
  year={2024}
}

Acknowledgment

This repo is built upon EXO and SimPO. We thank the authors for their great work and open-sourcing the code.

About

f-PO: Generalizing Preference Optimization with f-divergence Minimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published