【原】LLMs之RLHF：《LLM對齊技術(shù)的全面綜述：RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF

處女座的程序猿 2024-08-12 發(fā)布于上海

展開全文

LLMs之RLHF：《LLM對齊技術(shù)的全面綜述：RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀

《A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀

地址	論文地址：https://www./abs/2407.16216
時間	2024年7月23日
作者	Zhichao Wang* , Bin Bi* , Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng Salesforce
總結(jié)	背景與痛點：盡管大型語言模型(LLMs)在自我監(jiān)督學(xué)習(xí)、指令微調(diào)等方面有所進(jìn)步，但由于訓(xùn)練數(shù)據(jù)質(zhì)量參差不齊，它們?nèi)钥赡苌刹粚崱⒂卸净驘o助于用戶的響應(yīng)，與人類意圖不一致?，F(xiàn)有評估指標(biāo)如BLEU、ROUGE和BERTScore無法很好地捕捉人類對LLM輸出的偏好。需要將LLM與人類價值觀對齊，以避免生成不當(dāng)內(nèi)容。具體解決方案：強化學(xué)習(xí)從人類反饋（RLHF）：通過人類反饋來調(diào)整模型，使其輸出更符合人類期望。收集人類偏好數(shù)據(jù)集(包含提示、期望響應(yīng)和不期望響應(yīng)三元組)，并訓(xùn)練獎勵模型和強化學(xué)習(xí)策略。強化學(xué)習(xí)從AI反饋（RLAIF）：利用AI生成的反饋數(shù)據(jù)，以減少人類反饋的成本。核心思路與步驟 >> 獎勵模型：使用顯式或隱式的獎勵模型，對生成的響應(yīng)進(jìn)行評分。獎勵可以是響應(yīng)級別或令牌級別的。利用Bradley-Terry模型，基于人類偏好數(shù)據(jù)訓(xùn)練pointwise獎勵函數(shù)rφ(x，y)，給定提示x和響應(yīng)y，預(yù)測人類期望響應(yīng)的概率。 >> 反饋機制：收集偏好反饋或二元反饋。采用成對或列表的反饋方式。利用人類或AI提供的反饋。 >> 強化學(xué)習(xí)策略：基于參考模型的強化學(xué)習(xí)，控制輸出的長度。采用不同的散度測量方法，如KL散度。選擇在線或離線的策略。以LLM為代理，獎勵模型為環(huán)境，最大化獎勵、最小化KL散度，同時避免"對齊稅"(即下游任務(wù)性能下降)。探索了不同的獎勵模型(explicit/implicit，pointwise/preferencewise等)、反饋類型(偏好/二元、成對/列表式等)、RL目標(biāo)(參考/無參考等)和優(yōu)化方式(在線/離線等)。 >> 優(yōu)化方法：迭代/在線偏好優(yōu)化與非迭代/離線偏好優(yōu)化。將指令微調(diào)與對齊過程分開或合并。優(yōu)勢：直接將人類偏好納入模型微調(diào)，提高了LLM與人類意圖的一致性。InstructGPT等RLHF模型在真實性、無害性等方面優(yōu)于GPT-3等基線模型。探索了多種方法擴展RLHF框架，為進(jìn)一步對齊研究奠定了基礎(chǔ)。 >> 成本效益：RLAIF減少了對昂貴人類反饋的依賴。 >> 靈活性：多種反饋和獎勵模型選擇，適應(yīng)不同的應(yīng)用場景。 >> 提高模型安全性和可靠性：通過對齊過程減少生成不當(dāng)內(nèi)容的風(fēng)險。總的來說，該綜述系統(tǒng)梳理了近兩年來LLM對齊技術(shù)的主要進(jìn)展，概括了面臨的挑戰(zhàn)、提出的解決方案及其優(yōu)缺點，為該領(lǐng)域的后續(xù)研究提供了全面的概覽。

Abstract

With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.

隨著自監(jiān)督學(xué)習(xí)的進(jìn)展、預(yù)訓(xùn)練語料庫中數(shù)萬億個標(biāo)記的可用性、指令微調(diào)的應(yīng)用以及具有數(shù)十億參數(shù)的大型 Transformer 的發(fā)展，大型語言模型（LLMs）現(xiàn)在能夠生成真實且連貫的對人類查詢的回應(yīng)。然而，訓(xùn)練數(shù)據(jù)的質(zhì)量參差不齊可能導(dǎo)致生成不符合預(yù)期的響應(yīng)，這成為一個重大挑戰(zhàn)。在過去兩年中，提出了各種方法，從不同的角度來提升 LLM，特別是使其更好地與人類期望對齊。盡管有這些努力，但尚未出現(xiàn)一篇全面的綜述論文，對這些方法進(jìn)行分類和詳細(xì)說明。本文旨在填補這一空白，通過將這些論文分類為不同的主題，并提供每種對齊方法的詳細(xì)解釋，幫助讀者深入了解當(dāng)前領(lǐng)域的現(xiàn)狀。

1 Introduction

Over the past decades, the pretraining of LLMs through self-supervised learning [1] has seen significant advancements. These improvements have been driven by the development of larger decoder-only Transformers, the utilization of trillions of tokens, and the parallelization of computations across multiple GPUs. Following the pretraining phase, instruction tuning was employed to guide LLMs in responding to human queries. Despite these advancements, a critical issue remains unresolved: LLMs can generate undesired responses, such as providing instructions on how to commit illegal activities. To mitigate this risk, it is essential to align LLMs with human values.

Reinforcement Learning from Human Feedback (RLHF) [2, 3] has emerged as a groundbreaking technique for aligning LLMs. This approach has led to the development of powerful models such as GPT-4 [4], Claude [5], and Gemini [6]. Following the introduction of RLHF, numerous studies have explored various approaches to further align LLMs. However, there has not yet been a comprehensive review of methods for aligning LLMs with human preferences. This paper aims to fill that gap by categorically reviewing existing literature and providing detailed analyses of individual papers.

在過去幾十年里，通過自監(jiān)督學(xué)習(xí)進(jìn)行的 LLM 預(yù)訓(xùn)練 [1] 取得了顯著進(jìn)展。這些進(jìn)展得益于更大規(guī)模的僅解碼 Transformer 的發(fā)展、數(shù)萬億標(biāo)記的利用以及多 GPU 的并行計算。在預(yù)訓(xùn)練階段之后，采用指令微調(diào)來指導(dǎo) LLM 響應(yīng)人類查詢。盡管取得了這些進(jìn)展，但一個關(guān)鍵問題仍未解決：LLM 可能生成不符合期望的響應(yīng)，例如提供如何進(jìn)行非法活動的指示。為了減輕這一風(fēng)險，有必要使 LLM 與人類價值觀對齊。

基于人類反饋中進(jìn)行強化學(xué)習(xí)（RLHF）[2, 3] 已成為對齊 LLM 的一種突破性技術(shù)。這種方法促成了如 GPT-4 [4]、Claude [5] 和 Gemini [6] 等強大模型的發(fā)展。在 RLHF 介紹之后，眾多研究探索了各種進(jìn)一步對齊 LLM 的方法。然而，尚未對對齊 LLM 的方法進(jìn)行全面的綜述。本文旨在通過分類回顧現(xiàn)有文獻(xiàn)并對個別論文進(jìn)行詳細(xì)分析來填補這一空白。

In this paper, we have structured our review into four main topics: 1. Reward Model; 2. Feedback; 3. Reinforcement Learning (RL); and 4. Optimization. Each topic was further divided into subtopics as shown in Figure. 1. For the Reward Model, the subtopics were: 1. Explicit Reward Model vs. Implicit Reward Model; 2. Pointwise Reward Model vs. Preference Model; 3. Response-Level Reward vs. Token-Level Reward and 4. Negative Preference Optimization. Regarding Feedback, the subtopics included: 1. Preference Feedback vs. Binary Feedback; 2. Pairwise Feedback vs. Listwise Feedback; and 3. Human Feedback vs. AI Feedback. In the RL section, the subtopics were:

1. Reference-Based RL vs. Reference-Free RL; 2. Length-Control RL; 3. Different Divergences in RL and 4. On-Policy RL vs. Off-Policy RL. For Optimization, the subtopics were: 1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization; and 3. Separating SFT and Alignment vs. Merging SFT and Alignment. Table 1 provided an analysis of all the papers reviewed in detail using these 13 evaluation metrics.

在本文中，我們將回顧結(jié)構(gòu)分為四個主要主題：1. 獎勵模型；2. 反饋；3. 強化學(xué)習(xí)（RL）；和 4. 優(yōu)化。

每個主題進(jìn)一步分為子主題，如圖 1 所示。

在獎勵模型中，子主題包括：1. 顯式獎勵模型 vs. 隱式獎勵模型；2. 點對點獎勵模型 vs. 偏好模型；3. 響應(yīng)級獎勵 vs. 標(biāo)記級獎勵；4. 負(fù)偏好優(yōu)化。

在反饋方面，子主題包括：1. 偏好反饋 vs. 二元反饋；2. 成對反饋 vs. 列表反饋；3. 人類反饋 vs. AI 反饋。

在?RL 部分，子主題包括：1. 基于參考的 RL vs. 無參考 RL；2. 長度控制 RL；3. RL 中的不同散度；4. 策略內(nèi) RL vs. 策略外 RL。

在優(yōu)化方面，子主題包括：1. 在線/迭代偏好優(yōu)化 vs. 離線/非迭代偏好優(yōu)化；2. 分離 SFT 和對齊 vs. 合并 SFT 和對齊。

表 1 對所有詳細(xì)審查的論文進(jìn)行了這些 13 項評估指標(biāo)的分析。

Figure 1: The 13 categorical directions for xPO to align an LLM with human preference

4 Future Directions未來發(fā)展方向

Based on the analysis of the reviewed papers, several research problems have been identified for further exploration.

在對文獻(xiàn)分析的基礎(chǔ)上，提出了若干有待進(jìn)一步探討的研究問題。

4.1、General Tasks for Alignment Evaluation對齊評估的一般任務(wù)

When reviewing various papers, different tasks were used to evaluate the performance of these methods. However, some tasks, like GSM8K [65], which focused more on reasoning, might not be suitable for assessing alignment performance. In contrast, tasks like TruthfulQA [45] or those addressing toxicity should be prioritized for evaluating the toxicity of fine-tuned LLMs. There should be an effort to combine these tasks and create a unified leaderboard for alignment evaluation.

在回顧不同論文時，使用了不同的任務(wù)來評估這些方法的性能。然而，一些任務(wù)，如 GSM8K [65]，更側(cè)重于推理，可能不適合評估對齊性能。相比之下，應(yīng)優(yōu)先考慮像 TruthfulQA?[45] 或處理毒性的問題來評估微調(diào) LLM 的毒性。應(yīng)努力結(jié)合這些任務(wù)，創(chuàng)建一個統(tǒng)一的對齊評估排行榜。

4.2、Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs 將隱式獎勵模型、列表偏好和 Nash 學(xué)習(xí)應(yīng)用于更大規(guī)模的 LMs

Currently, implicit reward model methods have been applied only to models with up to 70B parameters. Extending these methods to even larger models, such as those the size of GPT-4 and Claude-3, can provide insights into their effectiveness compared to RLHF/PPO. Similarly, the listwise preference model warrants further investigation. In RLHF, preference datasets were collected using listwise preference but were subsequently transformed into multiple pairs of pairwise preferences. The potential issues associated with applying listwise preference models at larger scales remain to be addressed. Lastly, Nash learning can address the inconsistency among human labelers. Incorporating a Nash learning model into larger-scale LLMs can demonstrate its ability to capture the complexity of human nature.

目前，隱式獎勵模型方法僅應(yīng)用于最多 70B 參數(shù)的模型。將這些方法擴展到更大的模型，如 GPT-4 和 Claude-3，可以提供關(guān)于其相較于 RLHF/PPO 的有效性的見解。類似地，列表偏好模型也值得進(jìn)一步研究。在 RLHF 中，使用列表偏好收集了偏好數(shù)據(jù)集，但隨后轉(zhuǎn)化為多個成對的偏好。應(yīng)用列表偏好模型于更大規(guī)模時潛在的問題仍待解決。最后，Nash 學(xué)習(xí)可以解決人類標(biāo)注者之間的不一致性。將 Nash 學(xué)習(xí)模型納入更大規(guī)模的 LLM 可以展示其捕捉人類復(fù)雜性的能力。

4.3、Experiments on Binary Feedbacks二元反饋的實驗

Both KTO and DRO utilized binary feedback mechanisms, such as "thumbs up" and "thumbs down", instead of pairwise preferences. These binary feedbacks were derived from preference datasets, where desired responses were marked as positive and undesired responses as negative. Further research is needed on realistic binary datasets. Additionally, binary datasets are easier to collect compared to pairwise preference data, making it feasible to use larger-scale binary feedback datasets for alignment. However, the noise in binary feedback may be more pronounced than in preference datasets, raising the intriguing question of how to effectively filter out noisy data.

?KTO 和 DRO 都使用了二元反饋機制，如“點贊”和“點踩”，而不是成對的偏好。這些二元反饋來自偏好數(shù)據(jù)集，其中期望的響應(yīng)標(biāo)記為正，期望之外的響應(yīng)標(biāo)記為負(fù)。需要進(jìn)一步研究現(xiàn)實中的二元數(shù)據(jù)集。此外，相比于成對偏好數(shù)據(jù)，二元數(shù)據(jù)集更易收集，使得使用大規(guī)模二元反饋數(shù)據(jù)集進(jìn)行對齊成為可能。然而，二元反饋中的噪音可能比偏好數(shù)據(jù)更明顯，因此如何有效過濾噪聲數(shù)據(jù)是一個值得關(guān)注的問題。

4.4、Experiments on Helpful AI Feedback有益AI反饋實驗

Current AI feedback primarily includes harmless feedback in RLAIF and feedback ranking in iterative DPO. However, in RLAIF, helpful feedback is still provided by human labelers. This approach is reasonable, as generating helpful responses is significantly more challenging than identifying harmful ones. An intriguing future direction involves using LLMs to generate helpful feedback, thereby enabling LLMs to self-improve.

當(dāng)前的 AI 反饋主要包括 RLAIF 中的無害反饋和迭代 DPO 中的反饋排序。然而，在 RLAIF 中，有益的反饋仍由人類標(biāo)注者提供。這種方法是合理的，因為生成有益的響應(yīng)遠(yuǎn)比識別有害的響應(yīng)要困難。一個有趣的未來方向是利用 LLM 生成有益的反饋，從而使 LLM 實現(xiàn)自我提升。

4.5、Speeding up Nash Learning加速Nash學(xué)習(xí)

The proposed Nash learning method effectively modeled pairwise preferences and addressed inconsistencies arising from human labeling. However, it necessitated multiple iterations to converge to the optimal policy. Although the authors did not specify the time required for alignment, it was presumed to be significantly slower compared to implicit reward models such as DPO. This area warrants further research attention to speed up the Nash learning process.

提出的 Nash 學(xué)習(xí)方法有效建模了成對偏好并解決了人類標(biāo)注的不一致性。然而，它需要多次迭代才能收斂到最佳策略。盡管作者未具體說明對齊所需的時間，但推測其速度明顯慢于隱式獎勵模型如 DPO。因此，這一領(lǐng)域需要進(jìn)一步研究，以加速 Nash 學(xué)習(xí)過程。

4.6、Termination of Iterative/Online Learning迭代/在線學(xué)習(xí)的終止

When applying iterative or online training, determining when to terminate the iteration is crucial. Previous research has noted that iterative learning can sometimes degrade the performance of LLMs on specific tasks, which can be a sign of overfitting. However, identifying a reasonable epoch for stopping the iteration remains an unexplored area.

在應(yīng)用迭代或在線訓(xùn)練時，確定何時終止迭代至關(guān)重要。以往研究指出，迭代學(xué)習(xí)有時會導(dǎo)致 LLM 在特定任務(wù)上的性能下降，這可能是過擬合的跡象。然而，確定合理的停止迭代的輪次仍是一個未探索的領(lǐng)域。

4.7、Simplify SFT + Alignment簡化SFT +對齊

Current methodologies typically implemented SFT and alignment in a consecutive manner. However, this approach often resulted in catastrophic forgetting and rendered the training process laborious. The PAFT method mitigated catastrophic forgetting by fine-tuning SFT and alignment separately before merging them, albeit at the cost of increased complexity. Conversely, the ORPO technique integrated both processes simultaneously, but this led to a decline in performance. Thus, the challenge of effectively combining SFT and alignment to achieve high performance while maintaining efficiency remains unresolved.

當(dāng)前方法通常采用連續(xù)的方式實現(xiàn) SFT 和對齊。然而，這種方法往往導(dǎo)致災(zāi)難性遺忘，并使訓(xùn)練過程變得繁瑣。PAFT 方法通過在合并 SFT 和對齊之前分別微調(diào)這兩者來減輕災(zāi)難性遺忘，盡管增加了復(fù)雜性。相反，ORPO 技術(shù)同時集成了這兩個過程，但這導(dǎo)致了性能下降。因此，如何有效地結(jié)合 SFT 和對齊以實現(xiàn)高性能且保持效率仍然是一個未解決的挑戰(zhàn)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：處女座的程序猿 > 《待分類》

舉報/認(rèn)領(lǐng)

0條評論

發(fā)表

請遵守用戶評論公約

類似文章

處女座的程序猿

關(guān)注對話

TA的最新館藏

LLMs/MLMs之Qwen-3：Qwen3的簡介、安裝和使用方法、案例應(yīng)用之詳細(xì)攻略
MLLM之Bench：LEGO-Puzzles的簡介、安裝和使用方法、案例應(yīng)用之詳細(xì)攻略
LLMs之DeepSeek-R1：基于TinyZero項目(Huggingface TRL框架+Countdown-Tasks數(shù)據(jù)集+Qwen-2.5-3B模型)復(fù)現(xiàn)DeepSeek R1 Zero模
LLMs之Agent之RL：RAGEN的簡介、安裝和使用方法、案例應(yīng)用之詳細(xì)攻略
LLM之LRMs：《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extractio
MLMs之OpenAI o系列：OpenAI o3/o4-mini的簡介、安裝和使用方法、案例應(yīng)用之詳細(xì)攻略

喜歡該文的人也喜歡更多

熱門閱讀換一換

日韩黑丝制服一区视频播放|日韩欧美人妻丝袜视频在线观看|九九影院一级蜜桃|亚洲中文在线导航|青草草视频在线观看|婷婷五月色伊人网站|日本一区二区在线|国产AV一二三四区毛片|正在播放久草视频|亚洲色图精品一区

【原】LLMs之RLHF：《LLM對齊技術(shù)的全面綜述：RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF

《A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀

Abstract

1 Introduction

Figure 1: The 13 categorical directions for xPO to align an LLM with human preference

4 Future Directions未來發(fā)展方向

4.1、General Tasks for Alignment Evaluation對齊評估的一般任務(wù)

4.2、Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs 將隱式獎勵模型、列表偏好和 Nash 學(xué)習(xí)應(yīng)用于更大規(guī)模的 LMs

4.3、Experiments on Binary Feedbacks二元反饋的實驗

4.4、Experiments on Helpful AI Feedback有益AI反饋實驗