About Me

Yilong Zhu (朱亦隆)

I am a Ph.D. candidate in the Aerial Robotics Group at the Hong Kong University of Science and Technology (HKUST), under the supervision of Prof. Shaojie Shen. My research centers on navigation systems for autonomous platforms, with emphasis on robust localization, mapping, and cross-view perception.

Currently, I am a VLA Research Intern at China Merchants Lion Rock AI Lab, working on Vision-Language-Action models for robotic systems.

Prior to pursuing my Ph.D., I served as Algorithm Leader in the Mapping and Localization Group at Unity Drive Innovation, where I led the development and deployment of multi-sensor localization systems integrating LiDAR, UWB, and inertial measurements for autonomous vehicles.

My work has been published in premier robotics journals including IEEE Transactions on Robotics (T-RO) and The International Journal of Robotics Research (IJRR). I hold multiple patents in localization technologies and actively serve as a reviewer for ICRA, IROS, and IEEE Transactions on Robotics.

For collaboration opportunities or research discussions, please feel free to contact me or visit my GitHub.

News

[2025/6] Invited to serve as Session Co-Chair for "Autonomous Vehicles 3" at IROS 2025. Looking forward to facilitating discussions in Hangzhou, China!

[2025/6] Paper accepted to IROS 2025 on Visual Localization using Novel Satellite Imagery. See you in Hangzhou, China!

[2025/2] Paper accepted to IEEE Transactions on Instrumentation and Measurement on Global Optimal Solutions to Scaled Quadratic Pose Estimation Problems. Congratulations to Bohuan (HKUST Ph.D.)!

[2025/2] Paper accepted to IEEE Transactions on Instrumentation and Measurement on Globally Optimal Estimation of Accelerometer-Magnetometer Misalignment. Congratulations to Xiangcheng (HKUST Ph.D.)!

[2024/12] Paper accepted to IEEE Robotics and Automation Letters on Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning. Congratulations to Shuyang (HKUST Ph.D.)!

Research Interests

                    ⚡ I am actively seeking industry positions or Postdoctoral opportunities for 2026 Spring. If you have opportunities or leads, please reach out!
                

My research bridges classical robotics and learning-based methods, focusing on the development of robust, generalizable navigation systems for autonomous vehicles and mobile robots. I am particularly interested in augmenting physically grounded models—such as optimization-based SLAM and multi-sensor fusion frameworks—with high-level semantic reasoning and generative priors derived from modern deep learning architectures.

Simultaneous Localization and Mapping (SLAM)

I develop tightly-coupled optimization frameworks using LiDAR, IMU, and UWB to achieve drift-free, real-time state estimation.

Bird's Eye View (BEV) Representation for Localization

I explore BEV-based geometric and semantic alignment between ego-view and satellite-view images to enable cross-view localization.

Tightly-Coupled LiDAR-Inertial Localization for GNSS-Denied Environments

I develop localization frameworks that tightly fuse LiDAR and inertial measurements to achieve robust, accurate navigation in GNSS-denied or GNSS-challenging environments.

In particular, I proposed RLIL (Robust LiDAR-Inertial Localization), a system that integrates motion distortion correction, IMU bias estimation, and a Kd-tree-accelerated scan matcher initialized with IMU priors. The framework enhances robustness through a local map tracking module and inertial constraints, ensuring real-time performance across diverse scenarios including campus environments, crowded urban streets, and featureless open spaces.

RLIL achieves centimeter-level accuracy and significantly outperforms existing open-source baselines in both structured and dynamic environments.

Dynamic-Aware Localization and Mapping in Changing Environments

I investigate localization and dynamic object detection in complex, time-varying environments through map consistency analysis and multi-stage processing pipelines.

In my recent work, I propose a system that first constructs a clean, static TSDF map using data collected with high-precision GNSS/INS positioning, combined with dynamic object removal. During online operation, LiDAR scans are registered against this dynamic-free reference map, with scan-to-map deviations used to identify and filter dynamic points in real time.

This approach enables autonomous vehicles and UAVs to maintain robust localization while simultaneously detecting dynamic agents such as pedestrians and vehicles.

Generative Models for Cross-view Understanding

I apply diffusion models to generate semantically aligned BEV representations from monocular images, facilitating viewpoint-invariant localization.

Semantic-Guided BEV Generation for Cross-View Localization

I explore how structured semantic priors can guide generative models to bridge extreme viewpoint gaps in cross-view localization tasks.

In particular, I proposed DiffLoc, a framework that integrates Inverse Perspective Mapping (IPM), Navier-Stokes-based inpainting, and CLIP-based Scene Description (CLSD) to condition a latent diffusion model for synthesizing bird's-eye-view (BEV) representations. The generated BEV images are both geometrically consistent and semantically rich, enabling accurate matching against satellite imagery.

This approach demonstrates robust performance in challenging urban environments and offers a vision-based alternative to GNSS for localization.

Long-Horizon and Cross-Modal Navigation via Fast-Slow System Integration

In autonomous driving, onboard perception is typically limited to approximately 200 meters. While navigation APIs (e.g., Google Maps, Amap) provide coarse routing instructions such as "turn left in 300m," current systems rely heavily on short-range perception and localization, which can lead to lane miscounting or missed intersections.

I explore the integration of fast reactive modules (for perception and control) with slow semantic planners that leverage satellite imagery, vision-language models (VLMs), and onboard camera streams to enable long-horizon, cross-modal navigation. This architecture allows the system to generate semantically informed, early-warning instructions well before approaching complex urban intersections.

                    🎯 Research Vision: My goal is to synergize physics-based models with data-driven learning to create generalizable, interpretable, and robust navigation systems for autonomous robotics.