Yuxuan Wang

I am currently a research engineer at the NLP Lab of the Beijing Institute for General Artificial Intelligence (BIGAI), led by Zilong Zheng and Songchun Zhu. Prior to this, I obtained my Master's degree from Peking University under the supervision of Dongyan Zhao. Additionally, I completed a summer internship at Johns Hopkins University, where I was mentored by Zhuowan Li and Alan L. Yuille. Currently, I also collaborate with Cihang Xie at the University of California, Santa Cruz. My current research primarily concentrates on the domains of video-language learning and multimodal agents. I am especially captivated by studies that provide novel insights and helpful applications.

I am looking for a PhD position. Please feel free to contact me without any hesitation!
ScholarGithub CV

Selected Publications (* = equal contribution)

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng
EMNLP 2024 | PDF | Code & Demo | Cite
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng
COLM 2024 | PDF | Code | Cite
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao
AAAI 2024 | PDF | Code | Cite
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao
ACL 2023 | PDF | Code | Homepage | Cite
Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training
Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng
ACL 2023 Findings | PDF | Code | Cite
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation
Xueliang Zhao*, Yuxuan Wang*, Chongyang Tao, Chenshuo Wang, Dongyan Zhao
EMNLP 2022 Findings | PDF | Code | Cite

Preprints (* = equal contribution)

VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng
PDF | Code | Homepage | Cite
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng
PDF | Code | Homepage | Cite
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang*, Yuxuan Wang*, Dongyan Zhao, Zilong Zheng
PDF | Code | Homepage | Cite
Understanding Multimodal Hallucination with Parameter-Free Representation Alignment
Yueqian Wang, Jianxin Liang, Yuxuan Wang, Huishuai Zhang, Dongyan Zhao
PDF | Code | Cite
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
PDF | Code | Cite
Teaching Text-to-Image Models to Communicate
Xiaowen Sun*, Jiazhan Feng*, Yuxuan Wang, Yuxuan Lai, Dongyan Zhao
PDF | Cite

Open-Source Projects

Multimodal Needle In A Video Haystack
Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at various video lengths to measure accuracy.
Streaming Grounded SAM 2
Grounded SAM 2 for streaming video tracking using natural language queries.

Open-Source Learning Hub

Colorful Multimodal Research
Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, and Fundamental Sciences such as Mathematics.
Language Modeling Research Hub
A comprehensive compendium for enthusiasts and scholars delving into the fascinating realm of language models (LMs), with a particular focus on large language models (LLMs).
Multimodal Memory Research
Reading List of Memory Augmented Multimodal Research, including multimodal context modeling, memory in vision and robotics, and external memory/knowledge augmented MLLM.