|
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie
In submission, 2025
paper / project page / code
Towards universal video grounding with superior accuracy, generalizability, and robustness.
|
|
Learning Streaming Video Representation via Multitask Training
Yibin Yan*, Jilan Xu*, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie
In submission, 2025
paper / project page / code
Learn streaming video representations of various granularities through multitask training, including retrieval, action recognition, temporal grounding, and segmentation.
|
|
Multi-Sentence Grounding for Long-term Instructional Video
Zeqian Li*, Qirui Chen*, Tengda Han, Ya Zhang, Yanfeng Wang, Weidi Xie
ECCV, 2024
paper / project page / code
An automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision.
|
|