Yuxuan Wang's Homepage

Yuxuan Wang

flagwyx [at] gmail.com

I am currently a research engineer at the Qwen team, Alibaba Inc. I obtained my Master's degree from Peking University, under the supervision of Dongyan Zhao. I have had the wonderful experience of working with Zilong Zheng @ BIGAI, Cihang Xie @ UCSC, and Alan L. Yuille @ JHU. My current work primarily focuses on omni-LMs. I am especially interested in studies that offer novel insights and impactful applications.

I am looking for Interns for omni-LM and open-world modeling research. Please feel free to contact me!

Scholar • Github • CV

VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
Yuxuan Wang*, Yiqi Song*, Cihang Xie, Yang Liu, Zilong Zheng
ICCV 2025 | PDF | Code | Homepage | Cite

Friends-MMC: Dataset for Multi-modal Multi-party Conversation Understanding
Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Qun Liu, Dongyan Zhao
AAAI 2025 | PDF | Code | Cite

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng
EMNLP 2024 | PDF | Code & Demo | Cite

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng
COLM 2024 | PDF | Code | Cite

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao
AAAI 2024 | PDF | Code | Cite

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao
ACL 2023 | PDF | Code | Homepage | Cite

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training
Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng
ACL 2023 Findings | PDF | Code | Cite

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation
Xueliang Zhao*, Yuxuan Wang*, Chongyang Tao, Chenshuo Wang, Dongyan Zhao
EMNLP 2022 Findings | PDF | Code | Cite

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng
PDF | Code | Homepage | Cite

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
PDF | Code | Cite

HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
PDF | Code | Cite

Open-Omni-Nexus

A fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
Multimodal Needle In A Video Haystack

Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at various video lengths to measure accuracy.
Streaming Grounded SAM 2

Grounded SAM 2 for streaming video tracking using natural language queries.

Colorful Multimodal Research

Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, and Fundamental Sciences such as Mathematics.
Language Modeling Research Hub

A comprehensive compendium for enthusiasts and scholars delving into the fascinating realm of language models (LMs), with a particular focus on large language models (LLMs).
Multimodal Memory Research

Reading List of Memory Augmented Multimodal Research, including multimodal context modeling, memory in vision and robotics, and external memory/knowledge augmented MLLM.

Reviewer: ARR 2023-Present (Great Review Mention), CVPR 2024
Area Chair: ARR 2024-Present
Organizer: NLPCC 2022 Shared Task 4, NLPCC 2023 Shared Task 10