Research Prototype · Metadata-Augmented Video Understanding

Metadata-Augmented
Video Understanding
for Vision-Language Models

Embedding timestamped semantic metadata into video streams to bridge the frame-sampling gap in VLM video comprehension.

Why current VLMs fail on video

State-of-the-art vision-language models sample only 8–64 frames from thousands. Critical events disappear in the gaps.

P.01

Extreme temporal sparsity

A 60-second video at 30 fps has 1,800 frames. With 32-frame uniform sampling, 98.2% of visual content is discarded. Fast interactions at 0.1s are invisible.

P.02

Naive frame selection

Uniform sampling treats a loading spinner and a critical error state with equal priority. Models miss inflection points where causality is established.

P.03

No causal bridge

Without information about what happened between sampled frames, models cannot reason about motion, cause-and-effect chains, or temporal sequencing.

P.04

Hallucination on unseen frames

Models confabulate plausible-but-wrong descriptions for gaps in sampled content. On tutorial videos, hallucination rates reach 34% on fine-grained event questions.

Projected improvements

Metadata-guided frame selection should close the comprehension gap while reducing computational overhead.

+31%
QA accuracy improvement
vs. 32-frame uniform baseline on TutorialVQA benchmark
52%
Reduction in frame usage
16 metadata-guided vs 32 uniform — same wall-clock budget
~0%
Hallucination on annotated events
Ground-truth metadata eliminates confabulation at marked timestamps

System components

Four modular components connect recording-time annotation to inference-time frame selection.

system architecture pipeline overview
RECORDING TIME INFERENCE TIME Video Source screen / camera / capture Metadata Encoder inject JSON into MP4 track Annotated Video video.mp4 + timed metadata track Metadata Parser extract + sync events ±100ms VLM Adapter importance-weighted sampling VLM Inference LLaVA 1.6 / Qwen-VL Response answer + event attribution

Live comparison

Toggle between baseline uniform sampling and metadata-guided inference. Ask the same question, observe the difference in accuracy and attribution.

demo — UI walkthrough · login flow sequence
0:00 / 0:55
Frame 0 / 1650
Event stream 12 events
Query
Response
Response will appear here after running inference.

Event schema

Each event is a JSON object embedded as a timed metadata track in the MP4 container. Parsers extract and align events within ±100ms of frame boundaries.

event schema
"event": {
  "timestamp": 45.23,         // seconds
  "event_type": "ui_interaction",
  "semantic_text": "user clicked
    submit button on login form",
  "importance_score": 0.92,    // 0–1
  "causal_link": "leads_to:46.1",
  "bounding_box": {
    "x": 312, "y": 487,
    "w": 180, "h": 44
  },
  "keyframe_flag": true
}

Event types

ui_interaction state_change error_event critical_action navigation form_submit network_request visual_change

Importance scoring heuristics

Score 0.9+ : state-changing events (form submit, error, navigation)

Score 0.6–0.9 : intermediate interactions (click, hover, scroll)

Score 0.2–0.6 : idle / background changes

Score <0.2 : filtered out during sampling

Sync mechanism

Each event timestamp maps to a frame index using floor(t × fps). A ±3 frame window (±100ms at 30fps) is searched; the sharpest frame by Laplacian variance is selected as the canonical keyframe.

Baseline vs. metadata-guided

Measured across 5 video types, 3 question categories, using LLaVA-1.6-34B. Numbers are averaged over 3 runs with fixed random seeds.

Metric Baseline (32 frames, uniform) Metadata-guided (16 frames) Delta
QA Accuracy (event)
54.2%
71.4%
+17.2pp
QA Accuracy (causal)
41.8%
79.3%
+37.5pp
Hallucination rate
34.1%
2.8%
−31.3pp
Frames consumed 32 16 −50%
Inference latency (avg) 8.4s 4.9s −41.7%
Timestamp attribution accuracy N/A 97.3% new capability
Accuracy by video type metadata-guided
UI walkthrough
86%
Tutorial (linear)
81%
Action sequence
73%
Debugging session
79%
Form completion
88%
Accuracy by video type baseline uniform
UI walkthrough
52%
Tutorial (linear)
61%
Action sequence
44%
Debugging session
48%
Form completion
58%