Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mavera.io/llms.txt

Use this file to discover all available pages before exploring further.

Scenario

Before sending video content to Mavera’s analysis pipeline, use GPT-4.1’s vision capability to describe keyframes. Extract frames from a video (one per 10 seconds), send each to GPT-4.1 vision for a description, then aggregate those descriptions and pass them to Mavera for marketing analysis — scene composition, brand alignment, emotional tone, and recommendations. Flow: Extract keyframes → OpenAI GPT-4.1 vision (describe frames) → aggregate → Mavera POST /mave/chat → Marketing analysis

Code

import os, requests, time, base64, glob
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MV = os.environ["MAVERA_API_KEY"]
MV_BASE = "https://app.mavera.io/api/v1"
MV_H = {"Authorization": f"Bearer {MV}", "Content-Type": "application/json"}

VIDEO_FILE = "product-demo.mp4"
FRAME_DIR = "frames"

# 1. Extract keyframes (one every 10 seconds)
os.makedirs(FRAME_DIR, exist_ok=True)
os.system(f'ffmpeg -i {VIDEO_FILE} -vf "fps=1/10" -q:v 2 {FRAME_DIR}/frame_%04d.jpg -y 2>/dev/null')
frames = sorted(glob.glob(f"{FRAME_DIR}/frame_*.jpg"))
print(f"Extracted {len(frames)} keyframes from {VIDEO_FILE}")

# 2. Describe each frame with GPT-4.1 vision
descriptions = []
for i, frame_path in enumerate(frames[:20]):
    with open(frame_path, "rb") as img:
        b64 = base64.b64encode(img.read()).decode()

    resp = client.responses.create(
        model="gpt-4.1",
        input=[{"role": "user", "content": [
            {"type": "text", "text": "Describe this video frame in 2-3 sentences. "
                "Focus on: subjects, actions, setting, colors, text overlays, branding, mood."},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}", "detail": "low"}},
        ]}],
        max_tokens=200,
    )
    desc = resp.output[0].content[0].text
    descriptions.append(f"[{i * 10}s] {desc}")
    print(f"Frame {i+1}/{len(frames)}: {desc[:80]}...")
    time.sleep(0.5)

# 3. Aggregate and send to Mavera
frame_corpus = "\n\n".join(descriptions)
time.sleep(1)

analysis = requests.post(f"{MV_BASE}/mave/chat", headers=MV_H, json={
    "message": f"Video marketing analyst. Analyze these {len(descriptions)} frame descriptions.\n\n"
        f"FRAME DESCRIPTIONS:\n{frame_corpus[:10000]}\n\n"
        "Produce:\n"
        "1. **VISUAL NARRATIVE ARC** — Story across frames\n"
        "2. **BRAND CONSISTENCY** — Colors, logo, typography\n"
        "3. **EMOTIONAL PROGRESSION** — Mood shifts\n"
        "4. **KEY SCENES** — Most impactful frames\n"
        "5. **CONTENT GAPS** — Missing elements\n"
        "6. **RECOMMENDATIONS** — Improvements for conversion\n"
}).json()

print(f"\n{'='*60}\nVIDEO MARKETING ANALYSIS\n{'='*60}")
print(analysis.get("content", "")[:4000])

Example Output

Extracted 18 keyframes, described 18 frames (4,271 chars)

VISUAL NARRATIVE ARC
Problem → solution → proof. Frames 1-4 set context, 5-10 show
the product, 11-16 results, 17-18 CTA.

BRAND CONSISTENCY
- Blue (#2563EB) and white consistent in 16/18 frames
- Logo visible in 11/18 frames

KEY SCENES
- Frame 5 [40s]: Dashboard reveal with live data
- Frame 12 [110s]: "340% pipeline growth" metric overlay

CONTENT GAPS
- No social proof overlay — Missing pricing anchor before CTA

Error Handling

Each image consumes tokens based on detail level. Use detail: "low" (fixed 85 tokens) for keyframe descriptions. Switch to "high" only for frames requiring fine-grained text extraction.
Install with brew install ffmpeg (macOS), apt install ffmpeg (Ubuntu), or choco install ffmpeg (Windows). Alternatively, use opencv-python to extract frames programmatically.
A 10-minute video at 1 frame/10s yields 60 frames. Cap at 20 and increase the interval for longer videos. Adjust the ffmpeg filter to fps=1/30 for 30-minute+ content.