【原】LLM之LRMs：《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extractio

處女座的程序猿 2025-04-21 發(fā)布于上海

展開全文

LLM之LRMs：《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction》翻譯與解讀

導讀：這篇論文研究的是大型推理模型?(LRMs) 在提示優(yōu)化方面的有效性，特別是將其應用于事件抽取任務。該論文證明了即使對于強大的LRMs，提示優(yōu)化仍然是有效的，并且LRMs本身可以作為高效且穩(wěn)定的提示優(yōu)化器。這為未來利用LRMs進行提示優(yōu)化提供了重要的理論和實踐指導。

>> 背景痛點：

● 大型語言模型 (LLMs) 在復雜推理任務上的能力有限：盡管LLMs在各種自然語言處理任務中表現(xiàn)出色，但在需要復雜推理的任務（例如事件抽取）上表現(xiàn)仍然不足。

● 提示優(yōu)化對LLMs至關重要，但對LRMs的影響尚不明確：傳統(tǒng)的提示優(yōu)化方法在提升LLMs性能方面非常有效，但LRMs憑借其強大的推理能力，人們對其是否仍然需要提示優(yōu)化存在疑問。缺乏針對LRMs提示優(yōu)化的系統(tǒng)性研究。

● 現(xiàn)有提示優(yōu)化研究多集中在零樣本基線表現(xiàn)良好的任務上：許多研究忽略了像事件抽取這樣對推理能力要求極高的任務，而這些任務即使對于強大的模型如GPT-4也存在挑戰(zhàn)。

>> 具體的解決方案：論文提出了一種基于蒙特卡洛樹搜索?(MCTS) 的提示優(yōu)化框架，系統(tǒng)地研究了LRMs在事件抽取任務中的提示優(yōu)化效果，并將其與LLMs進行了比較。該框架包含以下步驟：

>> 核心思路步驟：

● 問題設定：將提示優(yōu)化定義為一個離散搜索問題，目標是找到一個最佳提示，最大化事件抽取任務的F1分數(shù)。

● 提示表示：使用Python代碼表示模型的輸入和輸出，初始提示包含任務指令和事件模式（用Python類定義，并附帶人工編寫的指導說明）。

● MCTS框架：使用MCTS算法探索提示空間，在每個迭代中：

● 答案生成：使用任務模型?(Mtask) 對當前提示和輸入文本生成答案。

● 錯誤提取：使用Python解釋器識別答案中的錯誤（例如解析錯誤、未定義的事件類、幻覺的跨度等）。

● 反饋生成：使用優(yōu)化器模型?(Mopt) 分析錯誤并生成反饋，建議修改任務指令和事件指導說明。

● 提示優(yōu)化：使用優(yōu)化器模型根據(jù)反饋生成更新后的提示。

● 獎勵評估：在開發(fā)集上評估更新后的提示，并使用平均F1分數(shù)作為獎勵函數(shù)。

● 反饋生成和提示優(yōu)化：使用元提示指導優(yōu)化器模型生成結構化的反饋，并根據(jù)反饋更新提示。

● 模型評估：使用四個F1分數(shù)指標評估模型性能：觸發(fā)詞識別 (TI)、觸發(fā)詞分類 (TC)、參數(shù)識別 (AI) 和參數(shù)分類 (AC)。

>> 優(yōu)勢：

● 系統(tǒng)性研究：首次系統(tǒng)性地研究了LRMs的提示優(yōu)化，并與LLMs進行了比較。

● 挑戰(zhàn)性任務：在具有挑戰(zhàn)性的事件抽取任務上進行了實驗，該任務需要復雜的推理能力。

● 統(tǒng)一框架：使用統(tǒng)一的MCTS框架評估LRMs作為任務模型和優(yōu)化器的性能。

● 多模型比較：實驗了兩種LRMs (DeepSeek-R1 和 o1) 和兩種LLMs (GPT-4.5 和 GPT-4o)。

● 資源條件考慮：考慮了低資源和中等資源場景。

>> 結論和觀點：

● LRMs受益于提示優(yōu)化： LRMs從提示優(yōu)化中獲得的收益比LLMs更大，即使在訓練集非常小的情況下也是如此。

● LRMs是更好的提示優(yōu)化器： LRMs作為優(yōu)化器可以生成更高質(zhì)量的提示，這些提示通常更簡潔、更精確，包含了任務特定的啟發(fā)式規(guī)則和異常處理規(guī)則。

● LRMs作為優(yōu)化器更高效穩(wěn)定： LRMs引導任務模型更快、更穩(wěn)定地達到最佳性能。

● 提示長度與性能的關系：較短的提示并不一定意味著較低的性能，不同的任務模型可能對不同長度的提示有不同的偏好。DeepSeek-R1在最短的提示下取得了最佳性能。

● 錯誤分析： LRMs優(yōu)化的提示可以減少多種錯誤，例如參數(shù)過度預測、幻覺和解析錯誤。

《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction》翻譯與解讀

地址

論文地址：[2504.07357] Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

時間

2025年4月10日

作者

Saurabh Srivastava & Ziyu Yao

George Mason University

Abstract

Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

諸如 DeepSeek-R1 和 OpenAI o1 這樣的大型推理模型（LRMs）在各種推理任務中展現(xiàn)出了卓越的能力。它們生成和推理中間思想的強大能力也引發(fā)了爭論，認為它們可能不再需要大量的提示工程或優(yōu)化來理解人類指令并生成準確的輸出。在本研究中，我們旨在系統(tǒng)地研究這一開放性問題，以事件抽取這一結構化任務作為案例研究。我們對兩種大型推理模型（DeepSeek-R1 和 o1）以及兩種通用大型語言模型（LLMs）（GPT-4o 和 GPT-4.5）進行了實驗，考察它們作為任務模型或提示優(yōu)化器時的表現(xiàn)。我們的結果表明，在像事件抽取這樣復雜的任務中，作為任務模型的大型推理模型仍然能從提示優(yōu)化中獲益，并且使用大型推理模型作為提示優(yōu)化器能生成更有效的提示。最后，我們對大型推理模型常見的錯誤進行了分析，并強調(diào)了它們在細化任務指令和事件指南方面的穩(wěn)定性和一致性。

1、Introduction

In recent years, Large Language Models (LLMs) have demonstrated remarkable ca-pabilities across various natural language processing tasks. However, their profi-ciency in complex reasoning tasks has often been limited (Zhou et al., 2022). To address this, a new class of models, known as Large Reasoning Models (LRMs), has emerged, focusing on enhancing reasoning abilities through advanced training methodologies. One prominent example is DeepSeek-R1 (Guo et al., 2025), an open-source LRM that has achieved state-of-the-art performance on several reasoning benchmarks, includ-ing MATH-500 (Lin et al., 2025) and SWE-bench Verified (Jimenez et al., 2023). Simi-larly, OpenAI’s o1 (Zhong et al., 2024) has set new standards in reasoning tasks, show-casing superior performance in complex problem-solving scenarios.

The advent of these advanced reasoning models has sparked discussions (Wang et al., 2024a; OpenAI, 2025; Mantaras, 2025; Together AI, 2025; Menendez et al., 2025) about the necessity of prompt optimization—the process of refining input prompts to guide model outputs effectively (Zhou et al., 2022; Yang et al., 2024; Srivastava et al., 2024; Agarwal et al., 2024; Guo et al., 2024; Fernando et al., 2024; Li et al., 2025). Traditionally, prompt optimization has been crucial for enhancing LLM performance, with frameworks like PromptAgent?(Wang et al., 2024b) and OPRO (Yang et al., 2024) automating the creation and refinement of prompts through iterative feedback and strategic planning. However, the inherent reasoning capabilities of LRMs like DeepSeek-R1 and o1 raise questions about whether such prompt optimization techniques are equally beneficial for these models. While previous studies have demonstrated the effectiveness of prompt optimization in improving LLM performance, there is a notable gap in research focusing on its impact on LRMs. Moreover, many existing prompt optimization studies focus on tasks where zero-shot baselines already perform well, whereas recent work, such as Gao et al. (2024), demonstrates that even powerful models like GPT-4 struggle with Information Extraction tasks, underscoring the need for more targeted and optimized prompting strategies. We present a discussion on related works in Appendix A.

近年來，大型語言模型（LLMs）在各種自然語言處理任務中展現(xiàn)出了卓越的能力。然而，它們在復雜推理任務上的表現(xiàn)往往受限（Zhou 等人，2022）。為解決這一問題，一類新的模型——大型推理模型（LRMs）應運而生，通過先進的訓練方法來提升推理能力。其中，DeepSeek-R1（Guo 等人，2025）是一個開源的 LRM，在包括 MATH-500（Lin 等人，2025）和 SWE-bench Verified（Jimenez 等人，2023）在內(nèi)的多個推理基準測試中取得了最先進的成績。同樣，OpenAI 的 o1（Zhong 等人，2024）在推理任務中樹立了新的標桿，在復雜問題解決場景中表現(xiàn)出了卓越的性能。

這些先進推理模型的出現(xiàn)引發(fā)了關于提示優(yōu)化必要性的討論（Wang 等人，2024a；OpenAI，2025；Mantaras，2025；Together AI，2025；Menendez 等人，2025），提示優(yōu)化即通過改進輸入提示來有效引導模型輸出的過程（Zhou 等人，2022；Yang 等人，2024；Srivastava 等人，2024；Agarwal 等人，2024；郭等人（2024 年）；費爾南多等人（2024 年）；李等人（2025 年）。傳統(tǒng)上，提示優(yōu)化對于提升大型語言模型（LLM）的性能至關重要，諸如 PromptAgent（王等人，2024 年 b）和 OPRO（楊等人，2024 年）之類的框架通過迭代反饋和策略規(guī)劃來自動創(chuàng)建和優(yōu)化提示。然而，像 DeepSeek-R1 和 o1 這樣的邏輯推理模型（LRM）所具備的內(nèi)在推理能力引發(fā)了這樣的疑問：這些提示優(yōu)化技術對這些模型是否同樣有益。盡管先前的研究已經(jīng)證明了提示優(yōu)化在提升 LLM 性能方面的有效性，但針對其對 LRM 影響的研究卻存在顯著空白。此外，許多現(xiàn)有的提示優(yōu)化研究都集中在零樣本基線表現(xiàn)良好的任務上，而最近的工作，如高等人（2024 年）表明，即使是像 GPT-4 這樣強大的模型在信息抽取任務上也存在困難，這凸顯了需要更針對性和優(yōu)化的提示策略。我們在附錄 A 中對相關工作進行了討論。

To fill this gap, we conduct the first systematic study of prompt optimization with LRMs and compare their performance with LLMs. In particular, we experimented with these models on a challenging task, i.e., end-to-end event extraction (EE), a structured prediction task of information extraction that requires identifying and classifying event triggers and arguments within text. EE poses unique challenges: models must follow schema constraints, handle coreference, and balance precision with recall, all of which demand nuanced reasoning. We evaluated four models, two LRMs (DeepSeek-R1, o1) and two LLMs (GPT-4.5, GPT-4o) as both task models and prompt optimizers within a Monte Carlo Tree Search (MCTS) framework (Wang et al., 2024b). This setup allows us to examine both task performance and prompt optimization quality under a consistent setting. Our findings are organized around the following research questions:

1. Do LRMs benefit from prompt optimization? We find that LRMs such as DeepSeek-R1 and o1 show substantial gains from prompt optimization, outperforming their non-optimized versions as well as LLMs, even when the training set is extremely small, showing that even strong reasoning models still benefit significantly from prompt optimization.

2. How do LRMs behave under the full-scale MCTS prompt optimization? Using our MCTS-based framework, we analyze how model performance evolves across optimization depth. LRMs scale more consistently than LLMs, converging faster and with less variance. For instance, DeepSeek-R1 achieves peak performance by depth 2, while LLMs require deeper exploration and still underperform.

3. Do LRMs make better prompt optimizers? LRMs generate high-quality prompts when used as optimizers, often (especially for DeepSeek-R1) producing shorter, more precise prompts than LLMs. These prompts contain extraction rules and exception cases that mirror human annotation guidelines, leading to better downstream task performance.

4. Can LRMs act as efficient and stable optimizers in prompt optimization? When used as optimizers, LRMs guide models to peak performance more efficiently than LLMs. They help task models achieve convergence at shallower MCTS depth with lower variance across nodes, indicating both faster and greater stability.

為了填補這一空白，我們對基于語言模型的提示優(yōu)化進行了首次系統(tǒng)研究，并將其性能與大型語言模型進行了比較。具體而言，我們在一項具有挑戰(zhàn)性的任務上對這些模型進行了實驗，即端到端事件抽取（EE），這是一個信息抽取的結構化預測任務，需要在文本中識別和分類事件觸發(fā)詞和論元。EE 帶來了獨特的挑戰(zhàn)：模型必須遵循模式約束、處理共指關系，并在準確率和召回率之間取得平衡，所有這些都需要細致入微的推理。我們在一個蒙特卡羅樹搜索（MCTS）框架（Wang 等人，2024b）中評估了四種模型，包括兩個基于語言的模型（DeepSeek-R1、o1）和兩個大型語言模型（GPT-4.5、GPT-4o），它們既作為任務模型又作為提示優(yōu)化器。這種設置使我們能夠在一致的環(huán)境中考察任務性能和提示優(yōu)化質(zhì)量。我們的研究結果圍繞以下研究問題展開：

1. 長程模型（LRMs）能從提示優(yōu)化中獲益嗎？我們發(fā)現(xiàn)諸如 DeepSeek-R1 和 o1 這樣的長程模型在經(jīng)過提示優(yōu)化后表現(xiàn)出了顯著的提升，不僅超越了未優(yōu)化的版本，還超過了大型語言模型（LLMs），即便訓練集極其有限也是如此。這表明即便是強大的推理模型也能從提示優(yōu)化中獲得顯著收益。

2. 在全規(guī)模的蒙特卡羅樹搜索（MCTS）提示優(yōu)化下，長程記憶模型（LRMs）的表現(xiàn)如何？利用我們基于 MCTS 的框架，我們分析了模型性能在優(yōu)化深度上的演變情況。長程記憶模型比大型語言模型（LLMs）的擴展性更穩(wěn)定，收斂速度更快且方差更小。例如，DeepSeek-R1 在深度為 2 時就達到了峰值性能，而大型語言模型則需要更深入的探索，且表現(xiàn)仍不如前者。

3. 語言表示模型（LRMs）是否能成為更出色的提示優(yōu)化器？當用作優(yōu)化器時，LRMs 能生成高質(zhì)量的提示，通常（尤其是對于 DeepSeek-R1）生成的提示比大型語言模型（LLMs）更短、更精準。這些提示包含提取規(guī)則和例外情況，與人類標注指南相呼應，從而能提升下游任務的表現(xiàn)。4. 局部響應模型（LRMs）能否在提示優(yōu)化中充當高效且穩(wěn)定的優(yōu)化器？當用作優(yōu)化器時，LRMs 比大型語言模型（LLMs）更有效地引導模型達到最佳性能。它們幫助任務模型在更淺的蒙特卡羅樹搜索（MCTS）深度實現(xiàn)收斂，且節(jié)點間的方差更低，這表明收斂速度更快且穩(wěn)定性更強。

Finally, our analyses show that LRMs generally produce more effective prompts. These optimized prompts often include task-specific heuristics and exception handling rules, which help reduce common trigger-related mistakes such as identifying multiple or implicit events, and slightly mitigate argument-level errors like coreferences and span overprediction. Among all the models in our experiments, DeepSeek-R1 produced the shortest (yet most effective) prompts. Interestingly, we observe that a longer prompt is not necessarily a more effective one, and various task models may have different preferences over various lengths of prompts. These findings align with the guidance on prompting LRMs (Mantaras, 2025; Together AI, 2025; OpenAI, 2025), which recommends using concise, focused instructions that avoid extraneous or overly complex phrasing, but in the meantime supplying the models with necessary task specifications. Our work demonstrates that, even with LRMs, prompt optimization is still valuable by automatically optimizing the prompt to be task-targeted yet concise.

最后，我們的分析表明，LRMs 通常能生成更有效的提示。這些優(yōu)化后的提示往往包含特定任務的啟發(fā)式方法和異常處理規(guī)則，有助于減少常見的觸發(fā)相關錯誤，例如識別多個或隱含事件，還能略微減輕諸如共指和跨度過度預測等論元級錯誤。在我們實驗中的所有模型中，DeepSeek-R1 生成的提示最短（但最有效）。有趣的是，我們發(fā)現(xiàn)較長的提示并不一定更有效，不同的任務模型可能對不同長度的提示有不同的偏好。這些發(fā)現(xiàn)與對 LRMs 提示的指導（Mantaras，2025；Together AI，2025；）相符。OpenAI（2025 年）建議使用簡潔、明確的指令，避免冗余或過于復雜的措辭，但同時要為模型提供必要的任務說明。我們的工作表明，即使使用語言模型，提示優(yōu)化仍然很有價值，能夠自動優(yōu)化提示，使其既針對任務又簡潔明了。

Figure 1: Summary of our main results, where LRMs and LLMs are used as either the task model (Mtask) or the optimizer (Mopt) in prompt optimization, and we observed a strong advantage of LRMs over LLMs.圖 1：我們主要結果的總結，其中在提示優(yōu)化中，LRM 和 LLM 被用作任務模型（Mtask）或優(yōu)化器（Mopt），并且我們觀察到 LRM 相對于 LLM 具有顯著優(yōu)勢。

Figure 2: Overview of our prompt optimization framework using language models. At each iteration, a zero-shot task LLM generates outputs, while a separate optimizer LLM analyzes the errors and updates the prompt, including task instructions and event guidelines, accordingly. This process continues over batches of training samples Dtrain, and the final optimized prompt is evaluated on the development set to determine the node reward rt.圖 2：使用語言模型的提示優(yōu)化框架概述。在每次迭代中，零樣本任務語言模型生成輸出，而另一個優(yōu)化器語言模型分析錯誤并相應地更新提示，包括任務說明和事件指南。此過程在訓練樣本批次 Dtrain 上持續(xù)進行，最終優(yōu)化的提示在開發(fā)集上進行評估以確定節(jié)點獎勵 rt。

Figure 3: A code prompt consists of a task instruction and an event schema. The event schema contains information about the labels that are represented as Python classes and event guidelines defining both the event classes and the arguments. In prompt optimization, we refine both the task instruction and event guidelines (shown for two events; others omitted due to space limits) to generate more effective prompts for the task model.圖 3：代碼提示由任務指令和事件模式組成。事件模式包含有關標簽的信息，這些標簽以 Python 類的形式表示，以及定義事件類和參數(shù)的事件指南。在提示優(yōu)化中，我們對任務指令和事件指南（此處僅展示兩個事件；由于篇幅限制，其他省略）進行細化，以生成更有效的任務模型提示。

Conclusion

We present the first systematic study of prompt optimization for LRMs, evaluating their roles as both task models and optimizers in a unified MCTS framework. On the structured task of event extraction, we find that LRMs benefit more from prompt optimization than LLMs and serve as stronger optimizers. They produce higher-quality prompts, converge faster, and generalize more reliably across models—highlighting their effectiveness in both prompt consumption and generation. Our error analysis further reveals that prompts optimized by LRMs reduce overprediction, hallucination, and parsing errors, contributing to more faithful and structured outputs.

我們首次對語言反應模型（LRMs）的提示優(yōu)化進行了系統(tǒng)研究，在統(tǒng)一的蒙特卡羅樹搜索（MCTS）框架中評估了它們作為任務模型和優(yōu)化器的作用。在事件抽取這一結構化任務上，我們發(fā)現(xiàn)與大型語言模型（LLMs）相比，LRMs 更能從提示優(yōu)化中獲益，并且作為優(yōu)化器表現(xiàn)更出色。它們生成的提示質(zhì)量更高，收斂速度更快，并且在不同模型之間泛化更可靠——這凸顯了它們在提示消費和生成方面的有效性。我們的錯誤分析進一步表明，由 LRMs 優(yōu)化的提示減少了過度預測、幻覺和解析錯誤，從而有助于生成更忠實和結構化的輸出。