Yilong Zhu

Hello, I'm Yilong Zhu

I received my PhD in the Aerial Robotics Group at the Hong Kong University of Science and Technology (HKUST), under the supervision of Prof. Shaojie Shen. My research centers on navigation systems for autonomous platforms, with emphasis on robust localization, mapping, and cross-view perception.

Currently, I am a VLA Research Intern at China Merchants Lion Rock AI Lab, working on Vision-Language-Action models for robotic systems.

Prior to pursuing my Ph.D., I served as Algorithm Leader in the Mapping and Localization Group at Unity Drive Innovation, where I led the development of multi-sensor localization systems integrating LiDAR, UWB, and inertial measurements.

My work has been published in premier robotics journals including T-RO and IJRR. I hold multiple patents in localization technologies and serve as a reviewer for ICRA, IROS, RA-L, T-RO, and IJRR.

EmailGitHubGoogle Scholar

News

[2025/12] Successfully defended my PhD thesis: "Robot Navigation from Explicit Geometry to Implicit Models". Grateful to my committee: Prof. Kun Xu (Chair), Prof. Shaojie Shen (Supervisor), Prof. Wei Zhang, Prof. Zili Meng, Prof. Yang Gao, and external examiner Prof. Dimitrios Kanoulas (UCL, UK).
PhD Thesis Defense PhD Defense Photo
[2025/12] Released my iOS app RTK Helper, a professional high-precision positioning tool on the App Store.
[2025/8] Attended Meta Research Summit in Seattle. Honored to meet Prof. Shuran Song, Prof. Danfei Xu, Prof. Xiaolong Wang and many other outstanding partners. Special thanks to my friends Jianhao Jiao and Yifu Tao for their support — hoping Aria accelerates the arrival of AGI!
Meta Research Summit in Seattle
[2025/6] Invited to serve as Session Co-Chair for "Autonomous Vehicles 3" at IROS 2025. Looking forward to facilitating discussions in Hangzhou, China!
[2025/6] Paper accepted to IROS 2025 on Visual Localization using Novel Satellite Imagery. See you in Hangzhou, China!
[2025/2] Paper accepted to IEEE Transactions on Instrumentation and Measurement on Global Optimal Solutions to Scaled Quadratic Pose Estimation Problems.
[2025/2] Paper accepted to IEEE Transactions on Instrumentation and Measurement on Globally Optimal Estimation of Accelerometer-Magnetometer Misalignment.
[2024/12] Paper accepted to IEEE Robotics and Automation Letters on Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning.

Research Interests

I am actively seeking industry positions or Postdoctoral opportunities for 2026 Spring. Please reach out!

My research bridges classical robotics and learning-based methods, focusing on robust, generalizable navigation and manipulation systems. I am interested in augmenting physically grounded models with high-level semantic reasoning and generative priors, as well as improving the spatial generalization of Vision-Language-Action (VLA) models for embodied AI.

Egocentric Perception & Physical Intuition

Do embodied AI models possess physical intuition from egocentric experience? We build a probing benchmark that captures fine-grained hand-object interactions across deformable, rigid, and fluid-filled objects, exposing where current vision-language models succeed and fail at predicting forces, compliance, and tactile outcomes from first-person video alone. [Project Page]

Spatial Generalization for Vision-Language-Action Models

Enhancing VLA spatial generalization via hybrid dynamic data collection. By mixing multi-fixed and moving camera viewpoints, we break shortcut learning (camera-base and object-position coupling) and achieve robust manipulation under unseen camera poses and object configurations. The strategy generalizes across VLA architectures including ACT, Diffusion Policy, Pi0, and Gr00t. [Project Page]

Simultaneous Localization and Mapping (SLAM)

Developing tightly-coupled optimization frameworks using LiDAR, IMU, and UWB to achieve drift-free, real-time state estimation.

SLAM Visualization

Bird's Eye View (BEV) Localization

Exploring BEV-based geometric and semantic alignment between ego-view and satellite-view images to enable cross-view localization.

BEV Visualization

Robust LiDAR-Inertial Localization (RLIL)

A system integrating motion distortion correction, IMU bias estimation, and Kd-tree-accelerated scan matching. RLIL achieves centimeter-level accuracy in GNSS-denied environments.

RLIL Overview

Dynamic-Aware Localization

Constructing static TSDF maps with dynamic object removal and using scan-to-map deviations to filter dynamic points in real-time.

Dynamic Localization

Generative Models for Cross-view Understanding

Applying diffusion models to generate semantically aligned BEV representations from monocular images.

Generative Models

DiffLoc: Semantic-Guided BEV Generation

Integrates IPM, Navier-Stokes inpainting, and CLIP-based description to synthesize BEV representations for matching against satellite imagery.

DiffLoc

Vision-Language Navigation (VLN) & Long-Horizon Planning

Integrating fast reactive modules with slow semantic planners leveraging satellite imagery and VLMs to enable long-horizon, cross-modal Vision-Language Navigation and early-warning instructions.

Notes & Blog

Study notes on Vision-Language-Action (VLA) model architectures, hosted at vla.yilong-zhu.com.

VLA Model Architecture Notes (8 entries)
[VLA] DreamZero (WAM) — World Action Model for zero-shot task generalization. Combines video frame prediction with robot action sequence generation via a causal diffusion transformer (Causal WAN DiT), perception encoders, video VAE compression, and multi-robot support.
[VLA] Pi0 — Physical Intelligence's flagship VLA model. Uses PaliGemma (SigLIP + Gemma 2B) as the vision-language backbone with a flow-matching action expert for continuous control.
[VLA] Pi0.5 — Updated Pi0 variant with refined architecture for improved sample efficiency and generalization.
[VLA] Pi0-FAST — Discretizes continuous actions into tokens and uses PaliGemma for autoregressive next-token action prediction, trading expressiveness for inference speed.
[VLA] ACT — Transformer encoder-decoder policy with action chunking. A strong imitation learning baseline that predicts short horizons of actions per forward pass.
[VLA] SmolVLA — Lightweight VLA model built on SmolVLM2-500M-Video-Instruct, targeting deployable scale without sacrificing multi-modal grounding.
[VLA] WALL-OSS / WALL-X — Cross-embodied VLA using Qwen2.5-VL as the backbone with a mixture-of-experts action head for handling diverse robot morphologies.
[VLA] X-VLA — Diffusion-based VLA with Florence2 as the vision backbone. Generates action sequences through iterative denoising.