Wentong Li 李文通

College of Artificial Intelligence

Nanjing University of Aeronautics and Astronautics (NUAA)

No.29 Jiangjun Road, Nanjing,China

Office: 1205, No.1 Building

About Me

I am an Associate Professor of the College of Artificial Intelligence at Nanjing University of Aeronautics and Astronautics. In August 2025, I was a visiting researcher at Department of Computing, The Hong Kong Polytechnic University, where I collaborated with my Ph.D. advisor, Prof. Lei Zhang (IEEE Fellow). Previously, I completed my Ph.D at College of Computer Science and Technology, Zhejiang University, supervised by Prof. Jianke Zhu and Prof. Lei Zhang , in June 2024. My recent research interests are Visual/Scene Understanding, Embodied AI and Multimodal Large Language Models, particularly in:

Fine-grained object-level spatial-temporal understanding: PixelRefer(ArXiv2025), VideoRefer(CVPR2025), Osprey(CVPR2024)
Embodied understanding, reasoning, planning and action: Inst3D-LM(CVPR2025), EOC-Bench(NeurIPS2025)
Efficient and effective VLMs/MLLMs: TokenPacker(IJCV2025)
Visual detection & segmentation: Box2Mask(T-PAMI2024), Point2Mask(ICCV2023), APro(NeurIPS2023), H2RBox(ICLR2023), Oriented RepPoints(CVPR2022)

Besides, I am also interested in autonomous driving tasks (HD-Map, 3D-Occupancy, etc.) and 3D reconstruction tasks.

Looking for self-motivated Masters, Research Interns/Assistants and Ph.Ds (co-supervised), please email me if you have interest.

News

[2025.12]: We released a Survey forging Spatial Intelligence for Autonomous Systems.
[2025.11]: One paper about Object-level Generation on Camouflage Images is accepted by AAAI 2026.
[2025.11]: Our PixelRefer is reported by PaperWeekly and 机器之心, respectively.
[2025.10]: We released PixelRefer, a new unified pixel-level MLLM framework for fine-grained regional understanding.
[2025.10]: Shared a talk@PRCV2025.[Slides]
[2025.9]: Two papers are accepted by NeurIPS 2025.
[2025.8]: Be funded by NSFC 🎉.
[2025.8]: Be invited to serve as Area Chair for ICLR 2026.
[2025.8]: Visited The Hong Kong Polytechnic University, where I enjoyed the visit and shared a talk.[Slides]
[2025.6]: We released the EOC-Bench, an object-centric embodied cognition benchmark in dynamic egocentric scenarios.
[2025.5]: One paper is accepted by IJCV (TokenPacker, 57 citations at the time of acceptance).
[2025.4]: Our VideoRefer and VideoRefer-Bench have been discussed and adopted by NVIDIA & UC Berkely in their DAM work.
[2025.2]: Five papers are accepted by CVPR 2025 (One Highlight).
[2025.2]: We released the VideoRefer-700K dataset on HuggingFace. Please see the VideoRefer Suite for the details.
[2024.12]: Awarded Outstanding Doctoral Dissertation Award of ZJU (浙江大学优秀博士学位论文).
[2024.6]: Obtained my Ph.D. degree from ZJU.

Publications

(*:equal contribution, #:corresponding author, +:project leader)

Preprints

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi

Arxiv, 2025

Paper ｜ Code

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li#, Jun Xiao, Lei Zhang, Beng Chin Ooi

Arxiv, 2025

Project Page | Paper ｜ Code | HuggingFace | PaperWeekly | 机器之心

Selected Publications

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng, Jinwei Hu, Qijia Lu, Jiawei Niu, Li Tan, Shuo Yuan, Ziyi Yan, Yizhen Jia, Qingzhi He, Shiping Ge, Ethan Q. Chen, Wentong Li#, Limin Wang, Jie Qin

NeurIPS (DB Track), 2025

Paper｜ Code｜ Data

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan*, Ronghao Dang*, Long Li*, Wentong Li*, Diao Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

NeurIPS (DB Track), 2025

Paper ｜ Project Page | Code | HuggingFace | LeaderBoard | 中文解读

TokenPacker: Efficient Visual Projector for Multimodal LLM

Wentong Li*, Yuqian Yuan*, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang

IJCV, 2025

Paper ｜ ArXiv ｜ Code ｜ HuggingFace Model | 中文解读｜ Daily Papers

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

CVPR, 2025

Paper ｜ Code ｜ HuggingFace Model | Dataset | VideoRefer-Bench | 中文解读 | 视频解读

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning

Hanxun Yu*, Wentong Li*, Song Wang, Junbo Chen, Jianke Zhu

CVPR, 2025 (Highlight, 2.9%)

Paper ｜ Code

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan*, Wentong Li*+, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu

CVPR, 2024 (Project Leader)

Paper ｜ Code ｜ Online Demo ｜ Video Demo ｜中文解读 | 视频解读

Box2Mask: Box-supervised Instance Segmentation via Level-set Evolution

Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Risheng Yu, Xiansheng Hua, Lei Zhang

T-PAMI, 2024

Paper | Code(BoxInstSeg) ｜ Code(MMDet)

Full Publications

Text-guided Controllable Diffusion for Realistic Camouflage Images Generation
Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin
AAAI, 2026.

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence
Yue Feng, Jinwei Hu, Qijia Lu, Jiawei Niu, Li Tan, Shuo Yuan, Ziyi Yan, Yizhen Jia, Qingzhi He, Shiping Ge, Ethan Q. Chen, Wentong Li#, Limin Wang, Jie Qin
NeurIPS, 2025.

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
Yuqian Yuan*, Ronghao Dang*, Long Li*, Wentong Li*, Diao Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang
NeurIPS, 2025.

OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM
Jinhong Wang, Shuo Tong, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu
ICCV, 2025.

TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li*, Yuqian Yuan*, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
IJCV, 2025.

Reliable and Calibrated Semantic Occupancy Prediction by Hybrid Uncertainty Learning
Song Wang, Zhongdao Wang, Jiawei Yu, Wentong Li, Bailan Feng,Junbo Chen, Jianke Zhu
IJCAI, 2025.

Large Models are Good Annotators for Zero-Shot Learning
Qingzhi He, Yizhen Jia, Wentong Li, Shengcai Liao, Rong Quan, Tong Cui, Jie Qin
SIGIR, 2025.

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Hanxun Yu*, Wentong Li*, Song Wang, Junbo Chen, Jianke Zhu
CVPR, 2025.

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
CVPR, 2025.

PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
Song Wang, Xiaolu Liu, Lingdong Kong, Jianyun Xu, Chunyong Hu, Gongfan Fang, Wentong Li, Jianke Zhu, Xinchao Wang
CVPR, 2025.

Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction
Xiaolu Liu, Ruizi Yang, Song Wang, Wentong Li, Junbo Chen, Jianke Zhu
CVPR, 2025.

Scalable Autoregressive Monocular Depth Estimation
Jinhong Wang, Jian Liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Danny Chen, Jintai Chen, Jian Wu
CVPR, 2025.

Label-efficient Semantic Scene Completion with Scribble Annotations
Song Wang, Jiawei Yu, Wentong Li, Hao Shi, Kailun Yang, Junbo Chen, Jianke Zhu
IJCAI, 2025.

Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan*, Wentong Li*, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu
CVPR, 2024.

Not All Voxels Are Equal: Hardness-aware Semantic Scene Completion with Self-distillation
Song Wang, Jiawei Yu, Wentong Li, Wenyu Liu, Xiaolu Liu, Junbo Chen, Jianke Zhu
CVPR, 2024.

MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction
Xiaolu Liu, Song Wang, Wentong Li, Ruizi Yang, Junbo Chen, Jianke Zhu
CVPR, 2024.

Box2mask: Box-supervised instance segmentation via level-set evolution
Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Yu Risheng, Xiansheng Hua, Lei Zhang
T-PAMI, 2024.

Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering
Qijun Gan, Wentong Li, Jinwei Ren, Jianke Zhu
AAAI, 2024.

Label-efficient Segmentation via Affinity Propagation
Wentong Li*, Yuqian Yuan*, Song Wang, Wenyu Liu, Dongqi Tang, Jian Liu, Jianke Zhu, Lei Zhang
NeurIPS, 2023.

Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport
Wentong Li, Yuqian Yuan, Song Wang, Jianke Zhu, Jianshu Li, Jian Liu, Lei Zhang
ICCV, 2023.

Improving Nighttime Driving-scene Segmentation via Dual Image-adaptive Learnable Filters
Wenyu Liu, Wentong Li, Jianke Zhu, Miaomiao Cui, Xuansong Xie, Lei Zhang
T-CSVT, 2023.

LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation
Song Wang, Wentong Li, Wenyu Liu, Xiaolu Liu, Jianke Zhu
CVPR, 2023.

H2RBox: Horizonal Box Annotation is All You Need for Oriented Object Detection
Xue Yang, Gefan Zhang, Wentong Li, Xuehui Wang, Yue Zhou, Junchi Yan
ICLR, 2023.

Box-supervised Instance Segmentation with Level Set Evolution
Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Xian-Sheng Hua, Lei Zhang
ECCV, 2022.

Translational symmetry-aware facade parsing for 3-D building reconstruction
Hantang Liu, Wentong Li, Jianke Zhu
IEEE MultiMedia, 2022.

Oriented Reppoints for Aerial Object Detection
Wentong Li, Yijie Chen, Kaixuan Hu, Jianke Zhu
CVPR, 2022.

Research Experiences

The Hong Kong Polytechnic University ｜Hong Kong SAR ｜ July.2025- Oct.2025
Supervisor: Prof. Lei Zhang , Collaborator: Shihao Wang
Visiting Scholar
Ant Group | HangZhou | Dec.2022 - Sep.2024
Collaborator: Jianshu Li,Dongqi Tang, Jian Liu
Research Intern (Ph.D)
Alibaba DAMO Academy | HangZhou | July.2020 - Oct.2020
Supervisor: Prof. Lei Zhang
Research Intern (Ph.D)
Institution of Automation, CAS | Beijing | July.2018 - June.2019
Supervisor: Prof. Peng Wang , Prof. Wanyi Li
Research Assistant (Master)

Honors

Outstanding Doctoral Dissertation Award of Zhejiang University, 2024
Excellent Doctoral Graduates of Zhejiang Province, China (Top 1%), 2024
Excellent Doctoral Graduates of Zhejiang University, 2024
Tencent Scholarship, 2023
Five-A Postgraduate Student, 2023
Outstanding Postgraduate Student, 2020-2023
Longhu Scholarship, 2022
First-class Academic Scholarship, 2018-2023
National Scholarship, 2016

Academic Services

Area Chair:
ICLR2026
Conference Reviewer:
AAAI2025, ICLR2025, CVPR2025, ICML2025, ICCV2025, NeurIPS2025, ACM MM2025
CVPR2024, ICLR2024, ICML2024, ECCV2024, ACM MM2024, NeurIPS2024
CVPR2023, ICCV2023, NeurIPS2023, ACM MM2023
Journal Reviewer:
Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
International Journal of Computer Vision (IJCV)
Transactions on Image Processing (TIP)
Transactions on Circuits and Systems for Video Technology (TCSVT)
Transactions on Multimedia (TMM)
Transactions on Geoscience and Remote Sensing (TGRS)
Pattern Recognition (PR)
ACM Computing Surveys
ISPRS Journal of Photogrammetry and Remote Sensing (P&RS)
Neurcomputing

Tech. Talks

Efficient Visual Understanding and Interaction with VLMs, PolyU HongKong, slides, 2025/08.
Fine-grained Image Understanding with VLMs, ECNU, Visual Perception+X(VPX) Group, 2024/09.
Osprey:Pixel Understanding with Visual Instruction Tuning, Video, slides, AI TIME, 2024/01.
Point-supervised Image Segmentation, AntGroup, Machine Intelligence Group, 2023/09.

Teaching

Intro. to AI: A Foundational Course, NUAA, Fall 2025.
Foundations and Frontiers of Multimodal Large Models, NUAA, Spring 2025.
Image Processing and Analysis, Police Brain of Zhejiang Province, Teaching Assistant, Fall 2022.
FDS2021: Foundation of Data Structure, Zhejiang University, Teaching Assistant, Fall 2021.