Multi-Sentence Grounding for Long-term Instructional Video

Zeqian Li^1*

Qirui Chen^1*

Tengda Han³

Ya Zhang^1,2

Yanfeng Wang^1,2

Weidi Xie^1,2,3

¹Coop. Medianet Innovation Center, Shanghai Jiao Tong University

²Shanghai AI Laboratory, China

³Visual Geometry Group, University of Oxford

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]

Abstract

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset, named HowToStep. We make the following contributions: (i) we improve the quality of textual narration by upgrading ASR systems, i.e., to reduce errors from speech recognition, and later transform noisy ASR transcripts into descriptive steps by prompting a large language model; (ii) we propose a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the descriptive texts to corresponding video segments. (iii) to measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-term video, and associated multiple sentences, our goal is to determine the corresponding timestamps for all sentences in the video, as a result, our model demonstrates superior performance on multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset will be publicly released to the research community.

Architecture

Schematic visualization of the proposed Transformer-based video-text alignment network termed NaSVA. The visual features are treated as key-value pairs while textual features are treated as queries, to predict the alignment score matrix between video frames and texts.

ASR Transcripts → Aligned Descriptive Steps

As presented in the following figure, the entire procedure of ASR transcripts transformation can be divided into three parts: (i) we leverage LLM to summarize narrations from the ASR transcript into descriptive steps; (ii) we use the similarity between the original transcript and generated steps to roughly determine the start/end timestamp for each step as pseudo-label; (iii) we train the NaSVA model on the generated steps with pseudo-label, and then use the trained model to refine the time range for each generated step (i.e., self-training). We name the final dataset as HowToStep.

Results

We compare our best model with existing state-of-the-art approaches on three public benchmarks for multi-sentence grounding tasks. As shown in the table, on the challenging HT-Step task, that aims to ground unordered procedural steps in videos, our model achieves 46.4% R@1, leading to an absolute improvement of 9.0%, over the existing state-of-the-art (37.4%) achieved by VINA. On HTM-Align, which aligns narrations in the video, our method exceeds the SOTA model by 5.1%. On CrossTask, where we need to align video frames and task-specific steps without finetuning, our method outperforms existing the state-of-the-art approach by 1.9%, demonstrating the effectiveness of the proposed pipeline for downstream grounding tasks.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.