Hi there! I’m a Ph.D. student in the Automated Driving and Human-Machine System Lab (AutoMan) at
Nanyang Technological University (NTU), where I’m advised by Prof. Chen Lv.
I’m also an AGS scholar in the Robotics and Autonomous Systems department, co-supervised by
Dr. Wei-Yun Yau at the Institute for Infocomm Research (I²R), A*STAR.
Before starting my Ph.D., I completed my B.Eng. (Hons) in Mechanical Engineering at NTU,
where I specialized in Robotics and Mechatronics.
My research interests span robot learning, vision-language navigation, reinforcement learning, and foundation models for embodied agents.
I focus on building generalizable policies that transfer across tasks, environments, and robot platforms.
2026.06 ImagiNav is accepted to IROS 2026 and SysNav is accepted to IEEE RA-L!
2026.06 We are organizing a competition at IROS 2026! Welcome to join the CMU VLN Challenge!
2025.11 We won the Best Paper Award (First Prize) for our paper "COVLM-RL" at IEEE ITSC 2025!
2025.10 Our team ReasonX won the 2nd Place in CMU Vision-Language Autonomy Challenge and presented our work at IROS 2025!
2025.04 I will be visiting Carnegie Mellon University as a visiting scholar.
Cool Demos
Scalable Autonomy Stack
A full autonomy stack supporting both slow and fast walking/running modes across different robotic platforms.
The system integrates real-time mapping, path planning, terrain analysis, and
collision avoidance to enable smooth goal-directed navigation in real-world environments
while maintaining stable forward motion during turns.
This demo provides a practical platform for deploying and
evaluating high-level vision-language navigation (VLN) policies on legged robots. More
details can be found in here.
Go2 - Fast Mode
G1 - Fast Mode
Object Navigation — IntentNav
IntentNav learns human-like ObjectNav policies from human demonstrations via spatial-visual imitation learning. The trained VLM policy transfers zero-shot across wheeled, quadruped (Go2), and humanoid (G1) robots without any additional fine-tuning, demonstrating strong embodiment-agnostic generalization.
Wheeled Robot
Go2 (Quadruped)
G1 (Humanoid)
Vision-Language Navigation — Goal2Pixel
Goal2Pixel reformulates VLN-CE as navigable pixel grounding, predicting a visible pixel in the image plane that is back-projected into a 3D waypoint for navigation. This pixel-based interface enables efficient long-horizon navigation with far fewer VLM inference calls than action-prediction baselines.
IntentNav is a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. It introduces Frontier-based Human-Intent Labeling to infer high-level search intent, constructs a spatial-visual candidate space combining BEV memory and egocentric visual memory, and trains a VLM policy with an Intent-Aligned Objective. IntentNav achieves state-of-the-art performance on MP3D, HM3D-v1 and HM3D-v2 benchmarks, and transfers zero-shot to wheeled, quadruped, and humanoid robots without VLM fine-tuning.
Goal2Pixel reformulates VLN-CE as navigable pixel grounding, using the image plane as a unified spatial interface between VLM reasoning and robot motion. The model predicts a visible navigable pixel back-projected into a 3D waypoint, uses visibility-aware keyframe memory for long-horizon navigation, and introduces semantic embeddings with coordinate-aware auxiliary losses. It achieves 54.1% SR and 52.5% SPL on R2R-CE Val-Unseen with only 7.75 VLM calls per episode — 6× fewer than action prediction baselines.
Roken is a unified diffusion transformer that directly generates coordinated multi-robot trajectories satisfying both individual safety and global connectivity constraints. Each robot is represented as a discrete token that interacts via self-attention and cross-attends to map tokens. Auxiliary tasks based on Bayes' theorem — local occupancy reconstruction and long-horizon waypoint prediction — provide multi-scale supervision. Roken handles single-robot planning, multi-robot generation, and conditional generation in a single feed-forward model, and demonstrates strong scalability and generalization to unseen environments.
Enabling robots to navigate open-world environments via natural language is critical for general-purpose autonomy. ImagiNav introduces a modular hierarchy that decouples visual planning from robot actuation by combining instruction decomposition, generative future-video imagination, and inverse dynamics-based trajectory extraction. With a scalable in-the-wild video data pipeline, the method enables strong zero-shot transfer to robot navigation without robot demonstrations.
SysNav formulates real-world ObjectNav as a system-level problem and introduces a three-level architecture that decouples semantic reasoning, navigation planning, and motion control for robust cross-embodiment deployment. The system is validated on wheeled, quadruped, and humanoid robots across 190 real-world experiments, showing substantial gains in success rate and efficiency, while also achieving state-of-the-art performance on four simulation benchmarks.
COVLM-RL integrates Critical Object reasoning with VLM-guided RL to generate semantic driving
priors and align them with low-level control.
It improves training stability, interpretability,
and boosts CARLA success rates by 30% in trained and 50% in unseen environments.
VLMLight is a vision-language-based traffic signal control (TSC) framework that leverages a safety-aware LLM meta-controller to dynamically switch between a fast RL policy and a structured reasoning branch. It introduces the first image-based traffic simulator with multi-view intersection perception, enabling real-time decision-making for both routine and critical scenarios. Experiments demonstrate up to 65% improvement in emergency vehicle response over RL-only systems.
A vision-language model (VLM)-driven framework that integrates structured chain-of-thought reasoning and
closed-loop feedback to enable zero-shot generalization in object navigation tasks.
We propose a novel transformer-based multi-agent reinforcement learning framework that enables generalizable and cooperative behavior among heterogeneous robot teams across diverse task settings.
We introduce a hierarchical multi-agent reinforcement learning framework that models both inter-vehicle interactions and traffic-level dynamics to achieve robust and cooperative control for autonomous vehicles in dense, heterogeneous traffic scenarios.
We propose a context-aware driver attention estimation framework that fuses gaze tracking, saliency detection, and semantic scene understanding across multiple hierarchical levels to improve prediction accuracy in real-world driving scenarios.
Academic Services
Journal Reviewer
IEEE Transactions on Intelligent Vehicles (T-IV), 2024
IEEE Transactions on Vehicular Technology (T-VT), 2023
IEEE Robotics and Automation Letters (RA-L), 2023-2026
Conference Reviewer
IEEE International Conference on Robotics and Automation (ICRA), 2024-2025
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023-2026
IEEE Intelligent Transportation Systems Conference (ITSC) 2024-2025