06-30 · LLM 最新论文速览-尧图网站建设

📅 发布时间：2026/7/2 15:21:13

今日候选池100篇，硬过滤 + LLM 打分后通过评估27篇，精选 Top-10，另列 17 篇速览。

关注方向：多 Agent 系统 / LLM 后训练（RL/SFT） / 扩散语言模型 / 推理加速 / 长上下文 / 量化交易

🌟 精选

1.`MOPD`MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

评分9.0·方向cs.CL · Computation and Language ·arxiv2606.30406· PDF

💡 MOPD 用多领域 RL teacher 的 on-policy distillation，把多个能力蒸馏进 Qwen3-30B-A3B student。

LLM后训练RL能力蒸馏

摘要：本文针对 LLM 后训练中多能力整合困难的问题，提出 Multi-teacher On-Policy Distillation（MOPD）。方法先为各领域训练专门 RL teacher，再基于 student 自身 rollouts 进行蒸馏，从而减少 exposure bias，并提供更密集的优化信号。在 Qwen3-30B-A3B 上，MOPD 优于 Mix-RL、Cascade RL、Off-Policy Finetune 和 Param-Merge，几乎完整继承各 teacher 能力，并已用于 MiMo-V2-Flash 的工业级后训练。

评分细项：rel 9.5 / nov 8.0 / prac 9.0 / author 8.0

2.`AgentsA1`Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

评分8.9·方向cs.CL · Computation and Language ·arxiv2606.30616· PDF

💡 Agents-A1用45K长轨迹SFT、多教师域路由蒸馏训练35B MoE智能体模型。

多智能体后训练SFT蒸馏长程智能体

摘要：本文提出 Agents-A1，一个 35B MoE agentic model，通过扩展 agent horizon 而非参数规模，达到接近万亿参数模型的表现。作者构建长程知识-行动基础设施，生成平均 45K tokens 的轨迹，并采用三阶段训练：全域 SFT、领域 teacher 训练、多 teacher 按领域路由的 on-policy distillation。Agents-A1 在 SEAL-0、IFBench、HiPhO 等长程 agent benchmark 上领先或具竞争力，为 35B agent 扩展长程能力提供了实践路径。

评分细项：rel 9.5 / nov 8.5 / prac 8.5 / author 7.0

3.`TACO`TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

评分8.5·方向cs.MA · Multiagent Systems ·arxiv2606.30251· PDF

💡 TACO 用 DAPR 探针奖励和 GRPO 双优势通道，为代码工具调用分配自监督信用。

AgentRLGRPO工具调用

摘要：TACO 面向使用代码工具的多模态 agent，解决工具调用有用、冗余或误导时难以精确信用分配的问题。它在 GRPO 中引入两条优势信号：DAPR 通过插入 probe token 比较有无工具时答案奖励差异，无需外部 judge；OGAR 则按最终结果把奖励路由到负责片段，抑制无效调用。该方法提升细粒度视觉问答中的工具使用效率与可靠性。

评分细项：rel 9.2 / nov 8.0 / prac 8.0 / author 6.0

4.`MoD`Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning

评分8.5·方向cs.MA · Multiagent Systems ·arxiv2606.29425· PDF

💡 MoD 用双路由与动量切换把多智能体辩论压进 MoE 单模型，降低推理开销。

多智能体MoE推理加速

摘要：MoD 针对多智能体辩论框架架构固定、需复制多个模型导致开销高的问题，将自辩论机制融入单模型 Mixture-of-Experts。其关键包括双路由以动态分配辩论与综合流程、momentum switching 减少 token 级专家切换抖动，以及用轻量专家表示不同辩论角色。多模态基准实验显示，MoD 优于单模型和传统多智能体方法，延迟降低 3.7 倍，token 消耗减少 87%。

评分细项：rel 9.0 / nov 8.0 / prac 8.5 / author 7.0

5.`FlashMorph`Morphing into Hybrid Attention Models

评分8.2·方向cs.CL · Computation and Language ·arxiv2606.30562· PDF

💡 FlashMorph把混合注意力层选择建模为预算子集优化，用门控学习替换全注意力层。

长上下文注意力机制推理加速线性注意力

摘要：混合注意力模型通过保留少量 full-attention 层、将其余层替换为 linear attention 来提升长上下文效率，但层选择常依赖启发式。本文将其建模为预算约束的子集优化，提出 FlashMorph：为每层加入线性注意力分支，冻结权重，在合成长上下文检索数据上联合学习门控，并用正则鼓励线性化；再按预算离散化并蒸馏、微调。实验表明其能找到更优混合结构，保持长上下文召回与泛化能力。

评分细项：rel 8.5 / nov 8.0 / prac 8.0 / author 7.5

6.`DOPD`DOPD: Dual On-policy Distillation

评分8.4·方向cs.AI · Artificial Intelligence ·arxiv2606.30626· PDF

💡 DOPD 按 advantage gap 与相对概率在 privileged teacher/student 间路由 token 级蒸馏。

后训练知识蒸馏On-policy

摘要：On-policy distillation 用学生采样轨迹和 token 级监督提升能力迁移，但引入 privileged information 可能造成“privilege illusion”：学生只能模仿信息不对称带来的表象，无法真正复制能力。DOPD 提出优势感知的双重蒸馏，根据 privileged teacher 与 privileged student 的 advantage gap 和相对概率，动态分配每个 token 的监督来源、强度与目标。LLM 与 VLM 实验显示，DOPD 稳定优于 Vanilla OPD 及其他方法，并在鲁棒性、持续学习等方面表现更好。

评分细项：rel 9.0 / nov 8.0 / prac 8.0 / author 6.0

7.`WorldEvolver`Self-Evolving World Models for LLM Agent Planning

评分8.0·方向cs.AI · Artificial Intelligence ·arxiv2606.30639· PDF

💡 WorldEvolver 用情景记忆、语义记忆与选择性 foresight 做 LLM agent 测试时世界模型修订。

LLM Agent世界模型测试时适应

摘要：WorldEvolver旨在提升长程 LLM agent 的规划前瞻性，同时避免不可靠预测误导决策。该框架在冻结下游 agent 和参数的情况下，于测试时自我修正上下文，结合 Episodic Memory、Semantic Memory 与 Selective Foresight，利用真实转移、启发规则和置信度过滤改进世界模型。ALFWorld、ScienceWorld 实验显示，其在多种 backbone 上预测精度最高，并提升 AgentBoard 成功率，证明测试时记忆修订能同时增强预测与规划。

评分细项：rel 8.5 / nov 7.5 / prac 8.0 / author 6.0

8.`COHORT`COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

评分8.0·方向cs.MA · Multiagent Systems ·arxiv2606.30479· PDF

💡 COHORT用多角色LLM在GNS3中生成设备命令，并用offensive replay验证缓解效果。

多智能体LLM工作流网络安全仿真验证

摘要：COHORT面向企业网络中针对已观测攻击者的自动化缓解生成，减少依赖专家和生产网试错。它采用角色分解的多 agent LLM 流程，在运行真实厂商固件的 GNS3 高保真仿真拓扑中提出、下发并迭代真实设备命令；通过 offensive replay 复现原攻击，对比缓解前后效果，并加入连通性回归与累积评估。三类拓扑、四种攻击实验中，46.7% 缓解既阻断攻击又保持连通性，显著优于基线。

评分细项：rel 8.5 / nov 7.5 / prac 8.5 / author 5.0

9.`MASLab`MAS-Lab: A Specification-Driven Validation Framework for Reliable Multi-Agent Systems

评分7.9·方向cs.MA · Multiagent Systems ·arxiv2606.30546· PDF

💡 MAS-Lab 用规范驱动框架分离语义意图、编排控制和可复现实验。

多智能体Agent工程系统验证

摘要：MAS-Lab 面向 LLM 多智能体系统从演示原型走向可靠生产的痛点：现有开发常将逻辑、编排、观测与控制耦合，缺乏系统级验证。该框架以 specification-driven 思路分离语义意图与运行机制，包含框架无关的声明式 Spec、提供执行与控制原语的 MAS-OS，以及集成观测和评估的 Labs，用于支持可复现实验、显式行为控制和全生命周期演进。

评分细项：rel 8.5 / nov 7.5 / prac 8.0 / author 5.0

10.`PRP`Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning

评分7.8·方向cs.CL · Computation and Language ·arxiv2606.30217· PDF

💡 PRP 通过 DRL 与 JRL 预测 draft/target 能力，在视觉推理前提前路由到小模型或大模型。

推理加速视觉推理模型路由

摘要：针对大型多模态模型视觉推理中长 chain-of-thought 导致的高推理成本，论文提出 PRP 主动路由范式，在生成前判断查询应交给小 draft model 还是大 target model。方法通过 Draft Rating Learning 估计草稿模型置信度，并用 Joint Rating Learning 预测目标模型胜任度，从而按实例细粒度分配样本，在尽量保持性能的同时显著加速多模态推理。

评分细项：rel 8.0 / nov 7.5 / prac 8.0 / author 6.5

📚 速览 · 其他通过评估的工作（17 篇）

一句话扫读，按评分从高到低；点击标题跳转 arxiv。

cs.MA7.8ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit· 💡 ECHO 将多轮信息搜寻建模为 EDP，并用 posterior-sensitive reward 做 turn-level policy gradient。
cs.MA7.7Experience Graphs: The Data Foundation for Self-Improving Agents· 💡 Trellis把agent轨迹建成experience graph，用查询、向量图检索和物化视图复用经验。
cs.AI8.0Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents· 💡 Dynamo 让冻结 VLM 从成败样例生成 reasoning skill 与可执行视觉工具库，提升视觉推理准确率。
cs.MA7.6Minority Sentinel: When to Overturn Majority Voting in Multi-Agent LLM Debates· 💡 Minority Sentinel 从多智能体辩论日志提取 fingerprint，用 LightGBM 判断何时推翻多数投票。
cs.MA7.6Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book· 💡 PTMC 用 persona 条件化神经交易 bot 群在限价订单簿中采样市场结果分布。
cs.CL7.2Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts· 💡 RAPS-DA 将 RAG 知识冲突分为 Grounding、Arbitration、Resistance，并用同尺度 peer 专家路由训练。
cs.MA7.1Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds· 💡 用kNN局部置信下界在多轮LLM辩论中决定act-or-defer，并约束错误行动预算。
cs.CV7.5EcoVideo: Entropy-Orchestrated Video Generation Paradigm in Cloud-Edge Dynamics· 💡 EcoVideo 用早期自注意力熵选择关键帧，云端 DiT 去噪、边缘插值重建视频。
q-fin.TR7.6The Bounce Has No Direction: Sign, Magnitude, and the Microstructure of Equity Return Predictability· 💡 用 Fourier-Residue Identity 将 SPY 滞后自相关分解为符号与幅度通道。
cs.CL6.8DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning· 💡 DAIN 用 Meta-Controller 稀疏调度专门交互 agent，并压缩通信完成多模态协作推理。
cs.AI6.5Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration· 💡 Clarus用项目-智能体-资源对象模型和四层架构协调开放式科学协作流程。
cs.CL6.4Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs· 💡 TIGRAG 用 token 共现图、语义扩展和神经重排检索多跳问答证据。
cs.MA6.4Hybrid Retriever Evolution for Multimodal Document Reasoning Agents· 💡 用失败驱动meta-agent改写检索指令，让文档问答agent逐步选择词法、语义和多模态检索器。
cs.AI6.6Entity Binding Failures in Tool-Augmented Agents· 💡 定义工具代理的 entity binding failure，并用置信门控、澄清与 provenance tracking 降低错实体操作。
cs.CV6.6Orca: The World is in Your Mind· 💡 Orca用Next-State-Prediction在视频与事件标注上预训练统一世界潜空间。
cs.CL6.0CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph· 💡 用本体语料图组织网页语料，含质量层、轻量本体层和跨域对齐层。
cs.AI6.3BayesEvolve: Explicit Belief States for Autonomous Scientific Discovery· 💡 BayesEvolve 为发现代理维护不确定性 belief state，并用退火 uncertainty bonus 指导黑盒优化实验。

数据源：arxiv.org · 评分与中文摘要由 LLM 自动生成，仅供初筛参考

06-30 · LLM 最新论文速览

🌟 精选

1.MOPDMOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

2.AgentsA1Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

3.TACOTACO: Tool-Augmented Credit Optimization for Agentic Tool Use

4.MoDMixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning

5.FlashMorphMorphing into Hybrid Attention Models

6.DOPDDOPD: Dual On-policy Distillation

7.WorldEvolverSelf-Evolving World Models for LLM Agent Planning

8.COHORTCOHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

9.MASLabMAS-Lab: A Specification-Driven Validation Framework for Reliable Multi-Agent Systems

10.PRPBefore Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning