Xingjian Wang
Papers - 2026-04-11Blur image

Grounding-driven Visual Reasoning#

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

Thinking with Images#

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

OpenVLThinkerV2 通过新的 G²RPO 强化学习目标,为开放领域多模态推理模型提供跨任务梯度公平的训练信号,并结合两个任务级 shaping 机制平衡细粒度感知与多步推理。G²RPO 通过将任何任务的优势分布逼近标准正态分布,缓解 reward 拓扑差异与长尾影响,同时为正负奖励提供对称更新;随后,响应长度 shaping 促使复杂查询展开较长推理链,entropy shaping 约束探索防止熵坍缩或爆炸。基于增强的训练稳定性,该模型在 18 个多领域基准上与开源和商用前沿模型对齐。实验结果表明 OpenVLThinkerV2 在视觉任务上显著优于其他强基线,验证了方法在感知-推理平衡上的有效性。

Multimodal Agent#

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

MolmoWeb推出开放数据集MolmoWebMix及对应可执行的开放视觉网页代理,目的是让网页任务代理开放可复现。MolmoWebMix结合10万条合成轨迹、3万多条人工演示、原子网页技能轨迹和GUI感知数据,为训练提供丰富指令-界面对齐信号。MolmoWeb代理以指令条件的视觉语言动作策略形式,从任务指令和截图直接预测下一步浏览器动作,无需HTML或API访问。在WebVoyager、Online-Mind2Web和DeepShop等基准上, 4B/8B模型领先Fara-7B、UI-Tars-1.5-7B、Holo1-7B等开源模型, MolmoWeb-8B甚至超过GPT-4o驱动的SoM, 并通过并行Rollout+best-of-N在WebVoyager和Online-Mind2Web上分别达到94.7%和60.5% pass@4。作者还将发布模型、训练数据、代码及统一评估工具以促进可复现。

3D LLM#

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

Agent Training and Evaluation#

ClawBench: Can AI Agents Complete Everyday Online Tasks?

ClawBench 提出一个包含 153 个日常线上任务、横跨 144 个真实平台的评测套件,用于测量智能体在现实线上操作流程中的可靠性。框架通过拦截最终提交请求的轻量级中间层,在不造成实际副作用的条件下直接操控生产网站,保留并评估多步骤 workflow、文档信息提取和填写复杂表单等挑战。评测涵盖采购、预约、求职提交等 15 类任务,强调量表外的多模态理解与书写能力,并对 7 个前沿模型进行系统测评。结果显示即便是 Claude Sonnet 4.6 最终完成率也仅为 33.3%,表明现有模型在真实交互任务中的能力仍然有限。包含动态网页复杂性和任务多样性的 ClawBench 使我们更接近可信赖的通用助手方向。

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

KnowU-Bench 设计一套可在可复现安卓模拟器中运行的评测,覆盖 42 个通用 GUI 任务、86 个个性化任务和 64 个主动干预任务,照顾用户偏好的推理与主动性。Benchmark 隐藏用户画像,仅提供行为日志,强迫智能体通过多轮交互与 LLM 驱动的用户模拟器去主动询问偏好、处理授权以及拒绝后的克制行为。评测 protocol 结合规则验证与 LLM 评分,从 GUI 执行、授权协商到拒绝后沉默全链条评价智能体表现。实验发现即便如 Claude Sonnet 4.6 这样的前沿模型,在需要偏好推断或干预校准的模糊指令下得分仍低于 50%,说明瓶颈不在导航能力而在获取偏好与合理干预。该基准揭示了从熟练界面操作到信任的个性化助理之间的关键差距。

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Act Wisely 提出 HDPO 框架,将工具效率从与准确率耦合的标量奖励转化为条件优化,保留准确率通道最大化任务正确性,并在仅有准确轨迹时通过条件优势估计约束工具调用。该架构引导智能体先掌握任务再精炼自我依赖,克服了过度使用工具与工具惩罚失效之间的悖论。实验结果表明新模型 Metis 在保持甚至提升推理准确率的同时,将工具调用量缩减到原先数量级以下。该方法缓解了盲目工具调用带来的延迟与噪声,推动更加认知自觉的多模态 Agent。

Multimodal World Model#

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website https://jeoyal.github.io/MegaStyle/.

Papers - 2026-04-11
https://themaoqiu.github.io/blog/papers-2026-04-11
Author 猫柒-
Published at April 11, 2026