Chenyang Wan

701 Yunjin Rd, Xuhui Dist

Shanghai, China

Hi there, I’m Chenyang (Bryce) Wan, a senior at the College of Control Science and Engineering and Chu Kochen Honors College at Zhejiang University, and will earn my Bachelor’s degree in 2025. I’m an incoming PhD student at the OpenRobotLab in Shanghai AI Lab, through a joint program with Zhejiang University, supervised by Jiangmiao Pang and Dahua Lin.

My research develops vision-language navigation and exploration systems for embodied AI, aiming to achieve spatiotemporal intelligence in autonomous agents. This involves creating AI that not only navigates complex environments in real-time, but also understands temporal dynamics and spatial relationships over extended periods. The goal is to build adaptive systems capable of persistent environmental understanding and predictive decision-making in dynamic settings.

News

Jul 07, 2025	`[Preprint]` We released a preprint of our work StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. `Paper` `Project` `Video` `Code` `Data`
Jun 30, 2025	`[New Start]` I graduated with honors from Zhejiang University, earning a B.Eng degree with Outstanding Graduate distinction.
Nov 15, 2024	`[Award]` Awarded the First-Class Scholarship at Zhejiang University.
Jul 02, 2024	`[New Start]` Starting internship at Shanghai AI Lab: Advancing research in Embodied Intelligence.
Nov 08, 2023	`[Award]` Awarded the Zhejiang Provincial Government Scholarship.

Publications

arXiv Preprint
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei^*, Chenyang Wan^*, Xiqian Yu^*, Tai Wang^*, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai , Hanqing Wang, Yilun Chen, Xihui Liu^†, and Jiangmiao Pang^†

arXiv preprint arXiv:2507.05240, 2025

Abs arXiv Bib Video Code Website

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment.
@article{wei2025streamvln, title = {StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling}, author = {Wei, Meng and Wan, Chenyang and Yu, Xiqian and Wang, Tai and Yang, Yuqiang and Mao, Xiaohan and Zhu, Chenming and Cai, Wenzhe and Wang, Hanqing and Chen, Yilun and Liu, Xihui and Pang, Jiangmiao}, journal = {arXiv preprint arXiv:2507.05240}, year = {2025}, dimensions = {true}, }