Video Language Model论文阅读

type

Post

status

Published

date

Feb 22, 2024

slug

video-language-model

summary

Video Understanding with Large Language Models: A Survey

发展

Conventional Methods
Neural Video Models
Self-supervised Video Pretraining
Large Language Models for Video Understanding

主要任务

Recognition and Anticipation
Captioning and Summarization
Grounding and Retrieval
Question Answering

The integration of LLMs into video understanding is currently spearheaded by four principal strategies

LLM-based Video Agents
Vid-LLM Pretraining
Vid-LLM Instruction Tuning
Hybrid Methods

Vision Integration with LLMs

Frame-Based Encoders
Temporal Encoders

模型

LLM-based Video Agents

基于LLM的video agent

通过LLM来使用其他模型处理多模态数据/使用LLM来处理来自视觉、听觉和文本信息的转化

Video ChatCaptioner

ChatGPT：选取帧并提问

BLIP-2：根据帧回答

ChatGPT：将对话合成视频字幕

Vid-LLM: Pretraining

Vid-LLM Instruction Tuning

有不同的Adapter

直接连接LLM和视觉模块的adapter，用于对齐不同模态。可以是线性投影层，MLP，cross-attention，Q-Former以及它们之间的组合
在LLM中插入adapter，能使LLM更好地泛化到视觉任务中
混合方式

Hybrid Methods

同时使用微调和video agent

数据集

Recognition and Anticipation

任务：视频分类、动作检测、行为识别、短期和长期的动作定位等

Concept：动作、时间顺序

Metrics：

单标签：Top-k Accuracy
多标签：Mean Average Precision
顺序相关：Edit Distance (ED)

Captioning and Description

生成视频的文本描述、视频摘要

音频与视频同样重要

任务：视频字幕、视频摘要等

Metrics

Grounding and Retrieval

根据描述来识别和定位视频中的特定时刻或事件

任务：视频检索（将视频内容与文本描述对齐）、时间定位（根据文本描述给出时间区间）、时空接地（同时定位时间和空间）

Metrics

检索任务与分类任务相似，如recall和mean average precision(mAP)

时空接地，intersection over union(IoU),mean IoU(mIoU)

Question Answering

多项选择与开放式问答

Metrics

分类：accuracy

开放式：BLEU, METEOR, ROUGE, and CIDEr. WUPS

Video Instruction Tuning数据集