Specification · Version 0.1 Draft

Video Prompt Markup Language

VPromptML — an XML-like markup language for structuring AI video prompts.

Storyboard-readable. Parser-ready. Built for AI video generation.

Overview

VPromptML is an XML-like markup language for structuring AI video prompts: timed lyrics, scene prompts, opening-frame prompts, camera movement, characters, continuity, emotional arcs, and generation instructions.

It is not a video rendering language. It is a pre-production prompt specification language that describes what should be generated before any assets exist — what the AI should create, what the opening frame should look like, how the shot should move, and how one scene continues from another.

Positioning

VPromptML is designed for:

AI music video generation
Lyric-synchronized scene prompting
Cinematic storyboard prompting
Opening-frame generation
Video-motion prompt generation
Shot continuity & character consistency
Metaphor-driven visual storytelling and emotional arc planning

Traditional video markup formats describe existing media assets — clips, images, audio, subtitles, overlays, layers, timelines. VPromptML instead describes intent before generation.

File extensions

Use .vprompt.xml for public, readable, XML-compatible documents. Use .vpml only as a compact internal extension.

gana.vprompt.xml

Root element

Every VPromptML document must have exactly one root element: <vprompt>.

<vprompt version="0.1" type="musicVideo" lang="lt"
         aspectRatio="16:9" resolution="1920x1080">
  ...
</vprompt>

Required attribute: version — the specification version.

Optional attributes: type (musicVideo, shortFilm, ad, trailer, lyricVideo, storyboard), lang, aspectRatio, fps, resolution.

Document structure

The recommended top-level structure is:

<vprompt version="0.1" type="musicVideo" lang="lt" aspectRatio="16:9">

  <meta> ... </meta>

  <project> ... </project>

  <globalPrompt> ... </globalPrompt>

  <characters> ... </characters>

  <scenes> ... </scenes>

</vprompt>

Required: <globalPrompt> and <scenes>. Recommended: <meta>, <project>, and <characters>.

project

Defines the creative project in human-readable form. Allowed children: name, type, summary, theme, mainMetaphor, emotionalArc, targetAudience, creativeNotes.

<project>
  <name>Gana</name>
  <type>cinematic music video</type>
  <summary>
    A mature tired man crosses a frozen lake filled with frozen men.
    He almost freezes too, but at the word "Gana" becomes an iron human.
  </summary>
  <theme>Refusal to remain obedient, silent, and emotionally frozen.</theme>
  <mainMetaphor>A frozen lake as emotional numbness and obedience.</mainMetaphor>
  <emotionalArc>
    Numbness -> pressure -> inner cracking -> refusal -> iron self-possession.
  </emotionalArc>
</project>

globalPrompt

Defines global creative instructions applied to all scenes unless locally overridden. Recommended children: storyWorld, visualEvolution, style, rules, negativePrompt, renderingNotes.

<globalPrompt>
  <storyWorld>
    Cinematic dark emotional music video. Main metaphor: a vast frozen
    lake at winter dawn, filled with frozen human figures.
  </storyWorld>

  <visualEvolution>
    First numbness and waiting; then ice pressure and inner cracking;
    then the decisive transformation.
  </visualEvolution>

  <style>
    16:9 cinematic frame, full-body shots preferred. Cold blue-gray dawn,
    black ice, white frost, fog, weak sun on horizon.
  </style>

  <rules>
    <rule>No random scenery.</rule>
    <rule>Every scene must advance the story.</rule>
  </rules>

  <negativePrompt>
    No modern cars. No smiling extras. No unrelated fantasy creatures.
  </negativePrompt>
</globalPrompt>

characters

Defines recurring characters, symbolic figures, and identity continuity. Each <character> requires a unique id and may declare a role. Recommended children: name, appearance, costume, acting, symbolicMeaning, continuityRules.

<characters>
  <character id="singer" role="main">
    <name>Mature tired male singer</name>
    <appearance>Dark coat, worn face, deep eyes, restrained acting.</appearance>
    <symbolicMeaning>A man almost frozen by silence and obedience.</symbolicMeaning>
  </character>

  <character id="frozenMen" role="symbolicCollective">
    <name>Frozen men</name>
    <appearance>Male human figures frozen into ice statues.</appearance>
    <symbolicMeaning>Monuments of obedience, silence, and emotional death.</symbolicMeaning>
  </character>
</characters>

scenes & scene

<scenes> contains one or more <scene> elements in chronological order, each with a unique id.

<scenes>
  <scene id="s001" start="00:00" end="00:07" audio="Gana v1 20260613.mp3">
    ...
  </scene>

  <scene id="s002" start="00:08" end="00:15" audio="Gana v1 20260613.mp3">
    ...
  </scene>
</scenes>

A <scene> defines one timed AI video generation segment. Required attributes: id, start, end. Optional: audio, duration, type (intro, verse, chorus, bridge, instrumental, climax, outro, transition).

<scene id="s001" start="00:00" end="00:07"
       audio="Gana v1 20260613.mp3" type="intro">

  <lyric type="lyric">Ilgai tylėjo, galvą nuleidęs.</lyric>

  <intent>Establish numbness and the frozen world.</intent>

  <charactersPresent>
    <ref character="singer" />
    <ref character="frozenMen" />
  </charactersPresent>

  <emotion start="numbness" end="recognition">
    He begins to understand he is becoming one of the frozen men.
  </emotion>

  <openingFrame continuity="new" shot="wide" frame="fullBody">
    Wide cinematic still frame, winter dawn over a vast frozen lake,
    black ice, cold fog, weak blue-gray sunrise. A tired man stands
    full-body, head lowered, among countless frozen figures.
  </openingFrame>

  <camera movement="pushIn" shot="wide">Slow push from behind.</camera>

  <motionPrompt>
    The camera slowly pushes toward the man from behind. His coat moves
    in cold wind. The frozen men remain completely still around him.
  </motionPrompt>

  <negativePrompt>No modern buildings. No cheerful lighting.</negativePrompt>
</scene>

Scene elements

Required scene children are <lyric>, <openingFrame>, and <motionPrompt>. The rest are recommended or optional.

<lyric>

<lyric type="lyric">Gyvenimui leidau spręst už save.</lyric>

<lyric type="instrumental">[No speech detected]</lyric>

<lyric type="vocalization">Ooh-ooh-ooh</lyric>

<intent>

Defines the dramatic purpose of the scene. Useful for human review, prompt generation, and validation — not necessarily sent directly to a generator.

<openingFrame>

Defines the first visual frame of the generated shot (video-native; replaces the older "opening image"). Required continuity: new | extend. Optional shot (wide, medium, closeup, aerial, lowAngle, highAngle, extremeCloseup), frame (fullBody, portrait, landscape, symbolic, detail), and cameraAngle.

The first scene must use continuity="new". Use continuity="extend" to continue the previous scene's location, setup, or symbolic environment without contradicting it.

<openingFrame continuity="extend" shot="lowAngle" frame="fullBody">
  Same frozen lake, same cold dawn, continuous from previous shot.
  Low-angle frame near the man's boots as he begins walking.
</openingFrame>

<motionPrompt>

Defines motion, camera behavior, acting, transformation, and visual development from the opening frame — camera movement, subject movement, environmental motion, symbolic transformation, and the end state of the shot.

<charactersPresent>

Lists characters visible in the scene via <ref character="id" />. Each ref should match a defined character; undefined references should trigger a validation warning.

<emotion>

Defines emotional state or transition. Optional start and end attributes capture the emotional arc within a single scene.

<camera>

Technical/cinematic camera direction, separate from the motion prompt. Optional movement (pushIn, pullBack, track, dolly, crane, aerial, static, handheld) and shot.

<continuity>

Optional explicit continuity explanation with a mode of new | extend. May provide a more detailed explanation than the openingFrame continuity attribute.

<negativePrompt>

Defines forbidden content, either globally (inside <globalPrompt>) or per-scene.

<export>

Optionally links scene instructions to generated assets or downstream rendering formats via a target and one or more <asset /> children.

<export target="vsml">
  <asset type="openingFrame" src="./generated/s001.png" />
  <asset type="video" src="./generated/s001.mp4" />
</export>

Timestamps

Preferred format is MM:SS (e.g. start="01:52"); the extended HH:MM:SS form is also allowed. Rules:

start must be earlier than end
scenes should normally be chronological
overlaps should be flagged unless explicitly allowed
gaps should be flagged unless intentional

Naming & text rules

Attributes use camelCase where applicable (aspectRatio, openingFrame, motionPrompt). Avoid snake_case, PascalCase, or kebab-case. Short attributes (id, type, role, lang, start, end, audio, fps) are allowed as-is.

Text inside elements may be natural language spanning multiple lines. Escape literal <, >, and &, and preserve lyric line breaks where timing or phrasing matters. Self-closing tags are allowed for references and empty technical metadata (e.g. <ref character="singer" />).

Strict scene template

Recommended production scene template:

<scene id="" start="" end="" audio="" type="">
  <lyric type=""></lyric>

  <intent></intent>

  <charactersPresent>
    <ref character="" />
  </charactersPresent>

  <emotion start="" end=""></emotion>

  <openingFrame continuity="" shot="" frame=""></openingFrame>

  <camera movement="" shot=""></camera>

  <motionPrompt></motionPrompt>

  <negativePrompt></negativePrompt>

  <export target="">
    <asset type="" src="" />
  </export>
</scene>

Canonical example

<vprompt version="0.1" type="musicVideo" lang="lt"
         aspectRatio="16:9" resolution="1920x1080">
  <meta>
    <title>Gana</title>
    <author>Ilja Laurs</author>
    <audio default="Gana v1 20260613.mp3" />
    <targetFormat aspectRatio="16:9" resolution="1920x1080" fps="30" />
  </meta>

  <project>
    <name>Gana</name>
    <type>cinematic music video</type>
    <emotionalArc>Numbness -> refusal -> iron self-possession.</emotionalArc>
  </project>

  <globalPrompt>
    <storyWorld>
      A vast frozen lake at winter dawn, filled with men turned to ice.
    </storyWorld>
  </globalPrompt>

  <characters>
    <character id="singer" role="main">
      <name>Mature tired male singer</name>
    </character>
  </characters>

  <scenes>
    <scene id="s001" start="00:00" end="00:07" audio="Gana v1 20260613.mp3" type="intro">
      <lyric type="lyric">Ilgai tylėjo, galvą nuleidęs.</lyric>
      <openingFrame continuity="new" shot="wide" frame="fullBody">
        Wide still frame, winter dawn over a vast frozen lake.
      </openingFrame>
      <motionPrompt>
        The camera slowly pushes toward the man from behind.
      </motionPrompt>
    </scene>
  </scenes>
</vprompt>

Minimal valid document

<vprompt version="0.1" type="musicVideo" lang="lt" aspectRatio="16:9">
  <globalPrompt>
    <storyWorld>
      Cinematic emotional music video. One man walks across a frozen
      lake and slowly transforms from numbness into strength.
    </storyWorld>
  </globalPrompt>

  <scenes>
    <scene id="s001" start="00:00" end="00:07" audio="song.mp3">
      <lyric>He lowers his head.</lyric>
      <openingFrame continuity="new" shot="wide" frame="fullBody">
        Wide still frame of a man standing alone on a frozen lake at dawn.
      </openingFrame>
      <motionPrompt>
        The camera slowly pushes toward him as cold fog moves across the ice.
      </motionPrompt>
    </scene>
  </scenes>
</vprompt>

JSON equivalence

Every VPromptML document should be convertible into JSON, so it can stay readable like XML while being transformable into structured data for prompt engines, validators, and exporters.

<scene id="s001" start="00:00" end="00:07" audio="song.mp3">
  <lyric>He lowers his head.</lyric>
  <openingFrame continuity="new">Wide frozen lake.</openingFrame>
  <motionPrompt>Camera pushes forward.</motionPrompt>
</scene>

Equivalent JSON

{
  "id": "s001",
  "start": "00:00",
  "end": "00:07",
  "audio": "song.mp3",
  "lyric": { "text": "He lowers his head." },
  "openingFrame": { "continuity": "new", "text": "Wide frozen lake." },
  "motionPrompt": { "text": "Camera pushes forward." }
}

Validation rules

Document: exactly one <vprompt> root with a version; <globalPrompt> and <scenes> must exist; <scenes> must contain at least one scene.

Scenes: every scene needs id, start, end and must contain <lyric>, <openingFrame>, and <motionPrompt>. Every <openingFrame> needs a continuity attribute; the first scene must be continuity="new"; start must precede end.

References & timing: every <ref character=""> should match a defined character; overlaps and unintentional gaps should trigger warnings.

Prompt quality: opening frames describe a still first frame, motion prompts describe motion from that frame, character identity stays consistent, and the global metaphor remains visible across the sequence.

Rendering pipeline

Recommended workflow from markup to final video:

VPromptML
  -> scene prompt extraction
  -> opening-frame generation
  -> video motion generation
  -> generated PNG / MP4 assets
  -> video composition timeline
  -> final rendered video

Design principles

Prompt-first, not asset-first — describe what to generate before assets exist.
One root document — every file begins with <vprompt> and ends with </vprompt>.
Scene objects — each scene has timing, lyric, opening frame, and motion prompt.
Image before motion — openingFrame defines the first frame, motionPrompt what happens next.
Continuity is explicit — each scene declares new setup or extension of the last.
Story drives visuals — built for metaphor, emotion, and transformation.
Human-readable, parser-ready — a human can write it, a machine can validate it.

End of VPromptML 0.1 specification.