Research Prototype · Metadata-Augmented Video Understanding
Embedding timestamped semantic metadata into video streams to bridge the frame-sampling gap in VLM video comprehension.
01 — Problem
State-of-the-art vision-language models sample only 8–64 frames from thousands. Critical events disappear in the gaps.
P.01
A 60-second video at 30 fps has 1,800 frames. With 32-frame uniform sampling, 98.2% of visual content is discarded. Fast interactions at 0.1s are invisible.
P.02
Uniform sampling treats a loading spinner and a critical error state with equal priority. Models miss inflection points where causality is established.
P.03
Without information about what happened between sampled frames, models cannot reason about motion, cause-and-effect chains, or temporal sequencing.
P.04
Models confabulate plausible-but-wrong descriptions for gaps in sampled content. On tutorial videos, hallucination rates reach 34% on fine-grained event questions.
02 — Research Hypothesis
Metadata-guided frame selection should close the comprehension gap while reducing computational overhead.
03 — Architecture
Four modular components connect recording-time annotation to inference-time frame selection.
04 — Interactive Demo
Toggle between baseline uniform sampling and metadata-guided inference. Ask the same question, observe the difference in accuracy and attribution.
05 — Metadata Format
Each event is a JSON object embedded as a timed metadata track in the MP4 container. Parsers extract and align events within ±100ms of frame boundaries.
"event": { "timestamp": 45.23, // seconds "event_type": "ui_interaction", "semantic_text": "user clicked submit button on login form", "importance_score": 0.92, // 0–1 "causal_link": "leads_to:46.1", "bounding_box": { "x": 312, "y": 487, "w": 180, "h": 44 }, "keyframe_flag": true }
Event types
Importance scoring heuristics
Score 0.9+ : state-changing events (form submit, error, navigation)
Score 0.6–0.9 : intermediate interactions (click, hover, scroll)
Score 0.2–0.6 : idle / background changes
Score <0.2 : filtered out during sampling
Sync mechanism
Each event timestamp maps to a frame index using floor(t × fps). A ±3 frame window (±100ms at 30fps) is searched; the sharpest frame by Laplacian variance is selected as the canonical keyframe.
06 — Evaluation
Measured across 5 video types, 3 question categories, using LLaVA-1.6-34B. Numbers are averaged over 3 runs with fixed random seeds.
| Metric | Baseline (32 frames, uniform) | Metadata-guided (16 frames) | Delta |
|---|---|---|---|
| QA Accuracy (event) | +17.2pp | ||
| QA Accuracy (causal) | +37.5pp | ||
| Hallucination rate | −31.3pp | ||
| Frames consumed | 32 | 16 | −50% |
| Inference latency (avg) | 8.4s | 4.9s | −41.7% |
| Timestamp attribution accuracy | N/A | 97.3% | new capability |