Publications | Chenyang Wan

2025

arXiv Preprint
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang^†, and Xihui Liu^†

arXiv preprint arXiv:2512.08186, 2025

Abs arXiv Bib Video Code Website

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, “grounds slowly” by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, “moves fast” by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
@article{wei2025dualvln, title = {Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation}, author = {Wei, Meng and Wan, Chenyang and Peng, Jiaqi and Yu, Xiqian and Yang, Yuqiang and Feng, Delin and Cai, Wenzhe and Zhu, Chenming and Wang, Tai and Pang, Jiangmiao and Liu, Xihui}, journal = {arXiv preprint arXiv:2512.08186}, year = {2025}, dimensions = {true}, }
Technical Report
InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans

Intern Robotics

Technical Report, 2025

Abs Bib PDF Video Code Website

We introduce InternVLA-N1, the first open dual-system vision-language navigation foundation model. Unlike previous navigation foundation models that can only take short-term actions from a limited discrete space, InternVLA-N1 decouples the task as pixel-goal planning with System 2 and agile execution with System 1. A curriculum two-stage training paradigm is devised for this framework: First, two systems are pretrained with explicit pixel goals as supervision or condition. Subsequently, we freeze System 2 and finetune the newly added latent plans with System 1 in an asynchronous end-to-end manner. Such a paradigm relying on latent plans as the intermediate representation removes the ambiguity of pixel goal planning and provides new potentials for pretraining extensions with video prediction. To enable scalable training, we develop an efficient navigation data generation pipeline and introduce InternData-N1, the largest navigation dataset to date. InternData-N1 comprises over 50 million egocentric images collected from more than 3,000 scenes, amounting to 4,839 kilometers of robot navigation experience. We evaluate InternVLA-N1 across 6 challenging navigation benchmarks, where it consistently achieves state-of-the-art performance, with improvements ranging from 3% to 28%. In particular, it demonstrates synergistic integration of long-horizon planning (>150m) and real-time decision-making (>30Hz) capabilities and can be zero-shot generalized across diverse embodiments (wheeled, quadruped, humanoid) and in-the-wild environments. All code, models, and datasets are publicly available.
@article{wang2025internvla, title = {InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans}, author = {Robotics, Intern}, journal = {Technical Report}, year = {2025}, dimensions = {true}, }
arXiv Preprint
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei^*, Chenyang Wan^*, Xiqian Yu^*, Tai Wang^*, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai , Hanqing Wang, Yilun Chen, Xihui Liu^†, and Jiangmiao Pang^†

arXiv preprint arXiv:2507.05240, 2025

Abs arXiv Bib Video Code Website

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment.
@article{wei2025streamvln, title = {StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling}, author = {Wei, Meng and Wan, Chenyang and Yu, Xiqian and Wang, Tai and Yang, Yuqiang and Mao, Xiaohan and Zhu, Chenming and Cai, Wenzhe and Wang, Hanqing and Chen, Yilun and Liu, Xihui and Pang, Jiangmiao}, journal = {arXiv preprint arXiv:2507.05240}, year = {2025}, dimensions = {true}, }