当前位置: 首页 > news >正文

【Parallel-R1 代码实现】sft

1. 为什么需要sft?

在Parallel-R1的论文中提到,对于大模型来说,在DAPO这种难数据集上,无法生成并行思维链;在GSM8K这种简单数据集上,可以生成并行思维链。
image
由此产生的推论:对于这种并行推理任务来说,模型的指令遵循能力很差,大模型如此,小模型只会更差。所以,必须要插入special token后,做sft后训练

2. sft的数据基本设置是什么?

  • User Prompt(SysTem Prompt,仿照parallel-r1,设置为空)
Solve the following problem step by step.
During the reasoning process, whenever you encounter a step that may benefit from multiple perspectives or independent reasoning, insert a parallel block at that point.follow this format:
1. Start with <Parallel> to begin parallel reasoning
2. Generate multiple <Path> blocks, each with a different approach
3. Within each <Path>, you can use tools:- Call tools with: <tool_call>{{"name": "tool_name", "arguments": "..."}}</tool_call>- Tool responses will appear as: <tool_response>response content</tool_response>
4. After all paths, generate <Summary> to combine insights
5. Provide your final answer with the format: Final Answer: <answer>Example:
<Parallel>
<Path>Approach 1: use Python<tool_call>{{"name": "python", "arguments": "123+456"}}</tool_call><tool_response>579</tool_response>Result: 579</Path>
<Path>Approach 2: direct calculation: 123+456=579</Path>
</Parallel>
<Summary>Both methods confirm the answer</Summary>
Final Answer: 579Remind:
1. You can only use tools once in each path.
2. You can only use one time of parallel reasoning.
3. You can only use one time of tool call in each path.
4. You are encouraged to use tools in each path to solve the problem.Problem: {problem}
  • sft的回答格式
    由于这里面是在并行的path中进行工具调用,因此没有办法像传统的工具调用格式一样,返回多轮对话。这里简要修改数据格式,使其返回单轮对话,并且使用<tool_call></tool_call><tool_response></tool_response>来包裹工具调用和返回结果。
    一个回答的case为:
Step 1: Find out how many seashells Mimi picked up.\nMimi picked up 2 dozen seashells. 1 dozen = 12, so 2 dozen = 24.
Step 2: Find out how many seashells Kyle found.\nKyle found twice as many shells as Mimi.<Parallel><Path>Approach 1: Use Python to calculate Kyle's seashells.<tool_call>{\"name\": \"python\", \"arguments\": \"2 * 24\"}</tool_call><tool_response>48</tool_response>Kyle found 48 seashells.</Path><Path>Approach 2: Calculate directly.\nMimi: 24 shells.\nKyle: 2 × 24 = 48 shells.</Path>
</Parallel>
<Summary>Both approaches confirm that Kyle found 48 seashells.</Summary>Step 3: Leigh grabbed one-third of the shells that Kyle found.<Parallel><Path>Approach 1: Use Python to calculate Leigh's seashells.<tool_call>{\"name\": \"python\", \"arguments\": \"48 / 3\"}</tool_call><tool_response>16.0</tool_response>Leigh had 16 seashells.</Path>
<Path>Approach 2: Calculate directly.\nKyle: 48 shells.\nLeigh: 48 ÷ 3 = 16 shells.
</Path>
</Parallel>
<Summary>Both approaches confirm that Leigh had 16 seashells.</Summary>Final Answer: 16"
  • 数据集
    仿照Parallel-r1,这里使用gsm8k作为sft冷启动的数据集来训练,主要目标是固定模型生成的格式。

3. sft的数据如何生成?

3.1 尝试1:构建一个multi-agent系统,调用真实的Python接口,仿照parallel中的attention mask来生成数据。

具体的思路为:

  1. 生成时,停止,进入并行生成
  2. 对于每一条路径来说,其消息队列互相独立,模仿parallel-r1中的attention mask
  3. 生成</tool_call>或者时停止。如果是</tool_call>,那么从中解析工具,使用Python沙箱执行工具,然后继续推理;如果是,那么当前路径生成结束
  4. 所有路径生成结束后,在最后插入标签,然后生成总结
  5. 生成时,停止生成,回到第一步继续生成

主要的问题在于:

  1. OpenAI接口,不能像Parallel-R1一样,遇到token停止后,最后一个token为停止token。
response = client.chat.completions.create(model="gpt-4.1",messages=[{"role": "user", "content": "生成包含'<Parallel>'的文本"}],max_tokens=50,stop=[end_token],
)

例如,执行上面的代码后,模型会输出生成包含。这使得在multi-agent系统中,判断到底是因为模型回答结束,还是因为遇到了特定token而停止变得困难。

  1. 使用Agent来模拟模型行为,经常会出现标签混乱的情况,例如,模型会在summary中生成一个parallel标签,导致并行成树状展开,这会导致后期进行RL变得不可行。如果对每个Agent进行单独prompt调优,工程量过大。

3.2 尝试2:直接使用LLM进行一次生成+筛选

LLM生成的Prompt为:

Solve the following problem step by step.
During the reasoning process, whenever you encounter a step that may benefit from multiple perspectives or independent reasoning, insert a parallel block at that point.follow this format:
1. Start with <Parallel> to begin parallel reasoning
2. Generate multiple <Path> blocks, each with a different approach
3. Within each <Path>, you can use Python tools:• Call tools with: <tool_call>{{"name": "python", "arguments": "<code>"}}</tool_call>• Tool responses will appear as: <tool_response>response content</tool_response>4. After all paths, generate <Summary> to combine insights
5. Provide your final answer with the format: Final Answer: <answer>Example:
<Parallel>
<Path>Approach 1: use Python<tool_call>{{"name": "python", "arguments": "123+456"}}</tool_call><tool_response>579</tool_response>Result: 579</Path>
<Path>Approach 2: direct calculation: 123+456=579</Path>
</Parallel>
<Summary>Both methods confirm the answer</Summary>
Final Answer: 579Remind:
1. You can only use tools once in each path.
2. You can only use one time of parallel reasoning.
3. You can only use one time of tool call in each path.
4. You are encouraged to use tools in each path to solve the problem.
5. You can mock the Python tools responseProblem: {PROBLEM}

让模型一次生成,中间的工具调用结果直接mock,不调用真实的python环境。
经过实际测试:gpt-4.1的指令遵循能力还不错,例如上面的case,就是使用prompt调用gpt-4.1一次生成的。目前初步采用这种方案

  • 数据筛选:明天

4. sft的训练怎么做?

5. sft的效果怎么样?

http://www.rkmt.cn/news/128038.html

相关文章:

  • 2025年海南和田玉推荐商家排名TOP10(三亚+海口首选攻略) - charlieruizvin
  • 微观交通流仿真软件:AIMSUN_(21).微观仿真与其他交通软件的集成
  • python 02
  • C语言 文件的操作
  • 食品级螺杆泵选型经验
  • Docker 安装 Tomcat
  • 免费降AI率的工具怎么选?认准这2个降低AI率的工具,避免白忙活
  • 降AI率的工具怎么选?认准这2个免费降低AI率的工具,嘎嘎快!
  • mysql | 复制表结构和数据
  • 论文AI率高怎么办?认准这2个免费降低AI率的工具,嘎嘎快!
  • 软件解耦与扩展:插件式开发方式(基于 C++ 与 C# 的实现)
  • AI概念扫盲:LoRA微调原理是什么?
  • 10个高效降AI率工具,本科生必看!
  • 动态规划算法
  • Section four Homework
  • 阅读诗歌:时间的沙漏
  • Item45--运用成员函数模板接受所有兼容类型
  • 强烈推荐 wxWidgets
  • 过半的家庭都踩过近视的“坑”,每位爸妈都值得注意!
  • 2025年度江西南昌老人护理企业TOP7评测!专业照护+经验沉淀优质品牌榜单发布,用心守护构筑长者幸福晚年 - 全局中转站
  • 前端开发随笔
  • 程序员的幸福之道:不必追逐权力与学历——在代码与生活之间寻找真正的自由
  • 基于java的SpringBoot/SSM+Vue+uniapp的课程目标达成度系统的详细设计和实现(源码+lw+部署文档+讲解等)
  • 动态规划解决最小编辑距离问题
  • 【Memory协议栈】AUTOSAR架构下NvM_ReadAll时间优化的实用方案
  • 今天,终于进博客园了
  • 基于java的SpringBoot/SSM+Vue+uniapp的心理咨询预约管理的详细设计和实现(源码+lw+部署文档+讲解等)
  • Item18--让接口容易被正确使用,不易被误用
  • Item34--区分接口继承和实现继承
  • Item24--若所有参数皆需类型转换,请为此采用 non-member 函数