Papers - 2026-05-19 • Xingjian Wang

3D LLM#

MMSkills: Towards Multimodal Skills for General Visual Agents

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

Embodied Agent#

PhysBrain 1.0 Technical Report

ArXiv 幻觉翻译

这篇工作提出 PhysBrain 1.0，尝试把大规模第一人称人类视频转成结构化的物理常识监督，再迁移到机器人学习中。方法上，它通过数据引擎抽取场景元素、空间动态、动作执行和深度相关关系，并将这些信息整理成问答式监督来训练 PhysBrain VLM，再用保持能力且对语言敏感的适配设计把物理先验转移到 VLA 策略。实验覆盖 ERQA、PhysBench、SimplerEnv-WidowX、LIBERO 和 RoboCasa 等多模态问答与具身控制基准，整体达到 SOTA，尤其在 SimplerEnv 的域外泛化上表现突出。

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

ArXiv 幻觉翻译

这篇工作提出了 DexJoCo，一个面向任务型灵巧操作的基准和工具包，用来系统评测 MuJoCo 上的多指机器人手能力。作者设计了 11 个功能导向任务，覆盖工具使用、双手协作、长时程执行和推理，并搭建了低成本数据采集系统，收集了 1.1K 条轨迹。论文还支持视觉随机化和动力学随机化等设定，以检验策略鲁棒性。实验对现代模型进行了多任务训练、动作头适配等多种评测，结果表明当前策略在灵巧操作中仍有明显局限，也暴露出若干关键难点。

Agent Training and Evaluation#

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

ArXiv 幻觉翻译

这篇论文提出 Solvita，用于提升大语言模型在竞技编程中的持续解题能力。方法上，它把问题求解组织成 Planner、Solver、Oracle、Hacker 四个专用智能体协同的闭环流程，并为每个智能体配套可训练的图结构知识网络，把通过/失败、测试认证质量和对抗漏洞等反馈转化为强化学习更新。系统无需更新底层模型参数，就能把历史解题与调试经验沉淀到可迁移的知识中。实验在 CodeContests、APPS、AetherCode 和真实 Codeforces 轮次上验证了有效性，整体优于现有多智能体管线，并显著超过单次推理基线。