Xingjian Wang
Papers - 2026-06-19Blur image

Multimodal Agent#

Native Active Perception as Reasoning for Omni-Modal Understanding

这篇论文把长视频理解重构为一个基于 POMDP 的主动感知代理过程,提出了 OmniAgent。模型通过 Observation-Thought-Action 的迭代循环按需获取音视频线索,并把信息压缩成持久文本记忆,从而将推理复杂度与原始视频长度解耦。方法上,作者设计了 Agentic Supervised Fine-Tuning,用 best-of-N 轨迹合成和双阶段质量控制来冷启动主动感知能力,并进一步用 Agentic Reinforcement Learning 和 TAURA 奖励分配策略优化关键发现步骤。实验覆盖十个基准,结果显示 OmniAgent 在开源模型中达到最先进水平,而且随着推理轮次增加性能持续提升,验证了主动感知的 test-time scaling。

3D/Space Reasoning#

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

这篇工作针对空间视觉语言模型在多步空间推理上的困难,提出了 SR-REAL 双路径强化学习框架。它同时支持语言-only 推理和先检测再推理两种路径:前者做逐步语言演绎,后者先通过 region token 抽取中心点或框等 3D 几何线索,再进行显式几何推断。方法上,作者先用冷启动监督微调构建两类思维链监督,并建立 region 到 3D 的接口,再用带准确率和格式奖励的强化学习进一步优化。实验显示,单个 RL 训练模型可以同时适配两种推理路径,其中 DTR 在区域相关任务上定位更准,LOR 则提升了通用空间推理能力。整体上,该方法在多个空间基准上显著优于现有空间 VLM,并且具备跨数据集和跨域迁移能力。

3D LLM#

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

Guava: An Effective and Universal Harness for Embodied Manipulation

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

Agent Training and Evaluation#

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

这篇论文提出了一个由当前策略模型充当“环境工程师”的框架,用 LLM 根据失败轨迹和上下文自动设计下一阶段的强化学习训练环境。作者还构建了 MAPF-FrozenLake 作为可控测试床,用于系统研究环境配置重设计问题。方法上,它结合结构化的策略行为摘要、失败案例和环境统计信息,让模型生成新的环境配置,并分析了哪些上下文最有帮助。实验表明,基于 Qwen3-4B 的方法在基准上取得了最强综合表现,超过了更大的闭源模型和固定环境训练基线。结果还显示,成功的环境更新需要充分利用失败证据,同时保留原本已经有效的配置。

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

这篇工作提出 EfficientRollout,用于加速强化学习中的 rollout 生成,目标是在不损失最终模型质量的前提下降低推理延迟。方法上,它采用从目标模型中诱导出的量化自投机解码器,让 drafter 始终与不断演化的策略保持耦合,并结合系统感知的 speculative decoding 开关策略与基于接受率的草稿长度自适应。作者强调 RL rollout 的瓶颈不仅在于自回归采样的串行性,还在于后期 batch 变小后系统从算力受限转向带宽受限,因此需要按运行态动态决定是否启用投机解码。实验表明,EfficientRollout 相比加速版 AR rollout baseline,rollout 延迟最高降低 19.6%,端到端延迟最高降低 12.7%,同时保持了最终模型质量。

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

这篇论文研究多文化多智能体系统中的“价值多样性”,提出它应作为区别于单体价值对齐的新系统级评估维度。作者基于 World Values Survey 构造了 19 种文化、18 个 backbone model 的系统配置,通过比较文化条件化智能体在共享价值问卷上的回答差异来度量多样性。实验发现,价值多样性与传统对齐指标几乎不相关,说明二者刻画的是互补性质,而当前 LLM 多智能体系统在价值多样性上显著低于人类社会。进一步分析显示,混合不同 backbone 可以缩小差距但无法消除,社交交互还会推动智能体向共识收敛,进而削弱集体决策的覆盖面。

Multimodal World Model#

Kairos: A Native World Model Stack for Physical AI

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

这篇论文面向社交平台上的视频生成,提出了一个实时音视频自回归世界模型 MaineCoon。作者将其定位为面向社交互动场景的首个实时音视频社会世界模型原型,重点解决长时程、低延迟和稳定交互的问题。方法上,模型采用自我重采样、跨模态表征对齐、领域感知偏好优化以及强化在线策略蒸馏等训练技巧,并配套了支持长时间生成的 agentic streaming inference 框架。实验结果表明,该模型在单卡上可实现最高 47.5 FPS 的实时流式生成,且能够支持上千秒乃至更长时长的交互生成,同时有效缓解漂移问题。

Papers - 2026-06-19
https://themaoqiu.github.io/blog/papers-2026-06-19
Author 猫柒-
Published at June 19, 2026