The Moving Eye:
Enhancing VLA Spatial Generalization via
Hybrid Dynamic Data Collection

Jincheng Tang1,*, Yilong Zhu2,*, Zhengyuan Xie1, Jiang-Jiang Liu1, Jiaxing Zhang1
1China Merchants Lion Rock AI Lab   2The Hong Kong University of Science and Technology
*Equal contribution
Submitted to IROS 2026

Paper and arXiv links will be activated upon release.

The Moving Eye in action. A dual-arm rig where one arm manipulates while the other sweeps the environmental camera through the decoupling zone — the data recipe behind every result below.

Hierarchical spatial zoning of the camera workspace: Fixed (point), Multi-Fixed (bounded, discrete), Moving (bounded, continuous decoupling zone).

Hierarchical Spatial Zoning of the camera workspace. Fixed is a single point; Multi-Fixed samples discrete poses inside a bounded region; Moving sweeps continuous trajectories through that same region — the decoupling zone that breaks spurious camera–robot–object correlations.
Our key insight: a small amount of low-cost moving data is enough to robustify a policy trained on multi-fixed data.

TL;DR

 Three Shortcuts

VLA policies don't fail randomly — they latch onto three implicit couplings: Camera–Base, Camera–Object, and Object–Position. More fixed viewpoints alone won't break them.

 Hybrid & Hierarchical

A real-world dual-arm rig collects Multi-Fixed + Moving data and explicitly varies object configurations. Mixed at Moving : Multi-Fixed = 1:3 ("Golden Ratio" for Gr00t), it lifts success rate to 89.0%.

 Transfer & Universality

Low-cost auxiliary pen data transfers to a multi-object policy (43% → 83% on Moving Test at 2400 episodes). The recipe lifts ACT, Diffusion, Pi0, and Gr00t alike.

Abstract

Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile — a slight perturbation in camera pose or object configuration often leads to catastrophic failure. We argue that simply increasing the number of viewpoints is insufficient. Models fall into the trap of Shortcut Learning, exploiting spurious camera–robot–object regularities rather than learning true spatial relationships.

We propose a data-centric solution. Using a dual-arm setup — one arm manipulates, the other carries a mobile environmental camera — we systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. A hybrid strategy that mixes continuous camera motion with diverse static viewpoints substantially reduces spurious correlations while maintaining training stability.

Our experiments show the strategy enables VLAs to generalize to unseen camera poses and object configurations where adding more static viewpoints alone fails. Crucially, susceptibility to shortcut learning is a universal property across architectures: ACT, Diffusion Policy, and VLA models including Pi0 and Gr00t all benefit significantly from the mixed data strategy.

The Three Shortcuts We Diagnose

We treat shortcut learning as the central failure mode of fixed-view VLA training, and categorize the spurious regularities the policy latches onto into three implicit couplings:

1. Camera ↔ Base

The policy memorizes how the robot looks against a static background, so any change in camera pose breaks recognition.

Diagnosed in Exp.1: 85% ID → 43% OOD.

2. Camera ↔ Object

Objects are recognized from a privileged angle. Novel viewpoints — common in VR/AR, ego-centric, handheld, or mobile manipulation — collapse performance.

Implicit baseline in Exp.1 / Exp.3 / Exp.4.

3. Object ↔ Position

Even with many camera viewpoints, the policy memorizes fixed inter-object geometry (e.g. pen vs. pen-holder). Shift the receptacle and it fails.

Diagnosed in Exp.2: 95% ID → 71.9% OOD.

Method — Hierarchical Data Decoupling

Our recipe combines a real-world dynamic-camera rig with a principled mixing strategy. The manipulation arm is a So-101 with a wrist camera; the environmental camera is carried by a separate Airbot arm, which can execute the full hierarchy — Fixed, Multi-Fixed, or Moving — on demand. Two ideas come together:

System pipeline: dual-arm collection, hybrid mixer at 1:k, VLA training, robust inference.
Pipeline. (1) Dual-arm collection. (2) Hybrid Mixer at Moving:Multi-Fixed = 1:k produces Dtrain. (3) VLA policy π(a | o, l) is trained. (4) Deployment under unseen camera poses and object configurations.

Hierarchical Viewpoint Sampling

We collect across all three configurations. Multi-Fixed provides convergence stability; Moving acts as a continuous-motion regularizer that breaks camera–base and camera–object couplings. Mixed at Moving : Multi-Fixed = 1 : k, with k=3 the empirical "Golden Ratio" for Gr00t.

Multi-dimensional Diversity Injection

Three orthogonal axes are randomized during collection: (i) Viewpoint — random poses / trajectories of the environmental camera; (ii) Object configuration — target–receptacle relative positions varied explicitly to break Object–Position coupling; (iii) Base–Camera decoupling — achieved by camera motion (the base is not moved).

Dual-arm setup with axis frames
Dual-arm setup. So-101 (manip + wrist cam) & Airbot (env cam).
Bounded region of the moving environmental camera
Real-world realization of the three viewpoint configurations.

Experiments

Two tasks — Pen Pick-and-Place (Exp.1–3, 4b) and Multi-Object Pick-and-Place (Exp.4a) — on a So-101 + Airbot platform. Success rates over 400 / 100 evaluation episodes respectively.

Exp. 1

Camera ↔ Base Coupling

Train on a single Fixed View, then test on the same view (ID) vs. a moving camera (OOD). The baseline collapses from 85% → 43%. Our mixed-data policy holds at 86% / 83% — closing the generalization gap.

Exp.1: shortcut learning under camera-base coupling
Exp.2: object-position coupling

Exp. 2

Object ↔ Position Coupling

A controlled probe: train on Multi-Fixed data with the pen-holder pinned to one location, then shift it by one diameter at test time. The Multi-Fixed baseline collapses 95% → 71.9%; our 1:3 mixture stays robust at 91.9% / 90.6%. Diverse camera views alone are not enough — object geometry must also be decoupled.

Exp. 3

The Golden Ratio of Composition

A counter-intuitive sweep. Pure Moving (1:0) gives only 54.8% — the variance hurts convergence. Pure Multi-Fixed (0:1) is a strong baseline at 80.5%. The mixture peaks at 1:3 with 89.0%. Multi-Fixed provides stability; Moving provides regularization — you need both.

Exp.3: success rate vs Moving:Multi-Fixed ratio
Exp.4a: cross-task transfer and sample efficiency

Exp. 4 (a)

Cross-Task Transfer & Sample Efficiency

We mix 50% Fixed-View multi-object data with 50% auxiliary pen data (collected at the Golden Ratio). The auxiliary data is cheap — one object, simple scene — yet at 2400 episodes it lifts Moving-Test success from 43% → 83%. The mechanism is skill decoupling: multi-object data teaches what to grasp; auxiliary moving-pen data teaches how to perceive.

Exp. 4 (b)

Universality Across Architectures

The same recipe lifts every architecture we test on Pen Pick-and-Place: ACT +8.1, Diffusion +26.8, Pi0 +13.8, Gr00t +8.5. The optimal ratio shifts (Diffusion peaks at 1:1, Pi0 at 1:11) but the qualitative gain is universal — shortcut susceptibility is a property of the data distribution, not the model family.

Exp.4b: same data strategy improves ACT, Diffusion, Pi0, Gr00t

Headline Numbers

Setting Baseline Ours (Mixed) Δ
Exp.1 — Moving-Test (OOD), Fixed train43.083.0+40.0
Exp.2 — Shifted-Holder (OOD)71.990.6+18.7
Exp.3 — Gr00t @ Golden Ratio (1:3)80.5 (0:1)89.0+8.5
Exp.4a — Multi-Object Moving-Test (2400 ep.)43.083.0+40.0
Exp.4b — Diffusion Policy (best mix)33.860.6+26.8

BibTeX

@inproceedings{tang2026movingeye,
  title     = {The Moving Eye: Enhancing VLA Spatial Generalization via
               Hybrid Dynamic Data Collection},
  author    = {Tang, Jincheng and Zhu, Yilong and Xie, Zhengyuan and
               Liu, Jiang-Jiang and Zhang, Jiaxing},
  booktitle = {IEEE/RSJ International Conference on Intelligent Robots
               and Systems (IROS)},
  year      = {2026},
  note      = {Under review}
}