Research Prototype · Metadata-Augmented Video Understanding

Metadata-Augmented
Video Understanding
for Vision-Language Models

Embedding timestamped semantic metadata into video streams to bridge the frame-sampling gap in VLM video comprehension.

01 — Problem

Why current VLMs fail on video

State-of-the-art vision-language models sample only 8–64 frames from thousands. Critical events disappear in the gaps.

P.01

Extreme temporal sparsity

A 60-second video at 30 fps has 1,800 frames. With 32-frame uniform sampling, 98.2% of visual content is discarded. Fast interactions at 0.1s are invisible.

P.02

Naive frame selection

Uniform sampling treats a loading spinner and a critical error state with equal priority. Models miss inflection points where causality is established.

P.03

No causal bridge

Without information about what happened between sampled frames, models cannot reason about motion, cause-and-effect chains, or temporal sequencing.

P.04

Hallucination on unseen frames

Models confabulate plausible-but-wrong descriptions for gaps in sampled content. On tutorial videos, hallucination rates reach 34% on fine-grained event questions.

02 — Research Hypothesis

Projected improvements

Metadata-guided frame selection should close the comprehension gap while reducing computational overhead.

+31%

QA accuracy improvement

vs. 32-frame uniform baseline on TutorialVQA benchmark

52%

Reduction in frame usage

16 metadata-guided vs 32 uniform — same wall-clock budget

~0%

Hallucination on annotated events

Ground-truth metadata eliminates confabulation at marked timestamps

04 — Interactive Demo

Live comparison

Toggle between baseline uniform sampling and metadata-guided inference. Ask the same question, observe the difference in accuracy and attribution.

0:00 / 0:55

Frame 0 / 1650

Event stream 12 events

Query

Response

Response will appear here after running inference.

05 — Metadata Format

Event schema

Each event is a JSON object embedded as a timed metadata track in the MP4 container. Parsers extract and align events within ±100ms of frame boundaries.

event schema

"event": {
  "timestamp": 45.23,         // seconds
  "event_type": "ui_interaction",
  "semantic_text": "user clicked
    submit button on login form",
  "importance_score": 0.92,    // 0–1
  "causal_link": "leads_to:46.1",
  "bounding_box": {
    "x": 312, "y": 487,
    "w": 180, "h": 44
  },
  "keyframe_flag": true
}

Event types

ui_interaction state_change error_event critical_action navigation form_submit network_request visual_change

Importance scoring heuristics

Score 0.9+ : state-changing events (form submit, error, navigation)

Score 0.6–0.9 : intermediate interactions (click, hover, scroll)

Score 0.2–0.6 : idle / background changes

Score <0.2 : filtered out during sampling

Sync mechanism

Each event timestamp maps to a frame index using floor(t × fps). A ±3 frame window (±100ms at 30fps) is searched; the sharpest frame by Laplacian variance is selected as the canonical keyframe.

06 — Evaluation

Baseline vs. metadata-guided

Measured across 5 video types, 3 question categories, using LLaVA-1.6-34B. Numbers are averaged over 3 runs with fixed random seeds.

Metric	Baseline (32 frames, uniform)	Metadata-guided (16 frames)	Delta
QA Accuracy (event)	54.2%	71.4%	+17.2pp
QA Accuracy (causal)	41.8%	79.3%	+37.5pp
Hallucination rate	34.1%	2.8%	−31.3pp
Frames consumed	32	16	−50%
Inference latency (avg)	8.4s	4.9s	−41.7%
Timestamp attribution accuracy	N/A	97.3%	new capability

Accuracy by video type metadata-guided

UI walkthrough

86%

Tutorial (linear)

81%

Action sequence

73%

Debugging session

79%

Form completion

88%

Accuracy by video type baseline uniform