The Comparison of Alignment Fine-Tuning Methods for Large Language Models on Honesty, Helpfulness, and Harmlessness

Haodong Huo

Authors

Haodong Huo Author

Keywords:

Alignment Tuning, Honesty, Harmlessness, Helpfulness, RLHF, DPO

Abstract

The alignment of large language models (LLMs) with human values of Honesty, Helpfulness, and Harmlessness (HHH) is a critical issue for their safe deployment. However, a comparative analysis across these three dimensions remains lacking for primary alignment approaches such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO). This paper aims to conduct a comparative study of the performance of these alignment methods on the HHH dimension through a review and synthesis of relevant research literature. The results indicate that there are performance trade-offs between the different alignment algorithms. In particular, SFT is the foundation of the alignment process, but is generally outperformed by preference-based methods. RLHF excels in usefulness and harmlessness, but involves greater implementation complexity. DPO matches RLHF in terms of usefulness in a cleaner and more efficient way and has an advantage in honesty, but is more sensitive to dataset quality and slightly weaker in harmlessness.

The Comparison of Alignment Fine-Tuning Methods for Large Language Models on Honesty, Helpfulness, and Harmlessness

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section