Technical Deep Dive

How NovelVids Works

A comprehensive look at how NovelVids transforms raw novel text into production-ready short drama videos through its five-step AI pipeline — from entity extraction to final video synthesis.

Table of Contents

What NovelVids Does

NovelVids is an AI-driven production platform that automatically converts novel text into style-consistent short drama videos. Given a novel as input, the system handles the entire journey — from understanding the narrative and characters to generating the final playable video.

This is not a demo or proof-of-concept. NovelVids is built as an industrial-grade, production-ready platform with a rigorous four-layer backend architecture, comprehensive test coverage, and a modern React 19 frontend.

The core value proposition is simple: paste in a novel, get back a short drama. Everything in between is handled by AI, with human oversight at every stage through an intuitive five-step workflow interface.

Key Insight: NovelVids maintains visual consistency across all generated video frames by establishing character and scene identity early in the pipeline and propagating it through every subsequent step via a unique @mention binding system.

The Complete Pipeline

Every chapter processed by NovelVids flows through a structured five-step pipeline. Each stage feeds its output directly into the next, creating a deterministic, traceable production chain.

Upload Novel

Input text & split into chapters

Extract Entities

Characters, scenes, props

Generate Assets

Reference images for each entity

Storyboard

Cinematic shot-by-shot script

Generate Videos

AI video from each shot

Merge & Export

FFmpeg concat to final video

The pipeline is designed for human-in-the-loop control. Users can review, edit, or regenerate the output at each step before proceeding. The system tracks progress through a workflow status state machine that enforces forward-only progression.

draft
chapters_extracted
characters_extracted
storyboard_ready
generating
completed

Five Steps in Detail

01 Entity Extraction

When a user selects a chapter and triggers extraction, the system performs deep narrative analysis using a large language model. It automatically identifies and extracts three types of entities:

  • Characters (Person) — name, aliases, physical appearance, clothing, personality traits, and which chapters they appear in
  • Scenes — location name, spatial layout, lighting conditions, atmosphere, and time of day
  • Props (Items) — object name, appearance, material, and narrative significance

The extraction runs three extractors concurrently (person, scene, item), each using OpenAI's structured output feature to guarantee consistent JSON responses. Results are incrementally merged into the asset database — not overwritten — so cross-chapter character information accumulates naturally.

Chapter Text
LLM (Structured Output)
PersonList + SceneList + ItemList
Incremental merge into Asset DB

Design Decision: Incremental merge means if Chapter 3 reveals new details about a character introduced in Chapter 1, the character's record is enriched rather than duplicated. Aliases are also merged to handle different name forms across chapters.

02 Visual Asset Generation

Once entities are extracted, the system generates reference images for each asset. These images establish the visual identity that all downstream video generation will follow.

The image generation uses the Bytedance Seedream model (doubao-seedream-4-5-251128) and creates different types of reference images based on asset type:

Characters

Multi-angle three-view reference sheets showing the character's face and body from front, side, and three-quarter angles. Ensures consistent identity across frames.

Scenes

Panoramic environment reference images capturing the spatial layout, lighting conditions, and atmospheric details of each location.

Props

Detailed product-style renders of significant objects, showing material textures and physical characteristics from multiple angles.

Manual Upload

Users can also upload their own reference images for any asset, giving full creative control over the visual direction.

Generated images are downloaded to the local media directory at {MEDIA_PATH}/assets/{id}.png and served via the static file endpoint.

03 Storyboard Generation

This is where the chapter narrative becomes a cinematic shot script. The system uses an LLM with a specialized system prompt that positions it as an "elite Director of Photography (DP)" to generate film-quality storyboard descriptions.

Each storyboard shot (scene) contains far more than a simple text description. It includes:

  • Shot description — what happens in the frame, character actions, and dialogue
  • Action timeline — precise moment-by-moment breakdown within the 4s or 8s duration
  • Film stock / texture — grain, color science, and visual aesthetics
  • Camera parameters — lens focal length, aperture, sensor format
  • Lighting design — key/fill/rim light setup, color temperature, ratios
  • Color grading — LUT reference, shadow/highlight tones, saturation
  • Camera movement — dolly, crane, pan/tilt with specific speeds and timing
  • Sound design — ambient audio, foley, music notes for the video generation

Critical Design: The storyboard generator uses @{entity_name} syntax to reference characters and objects. It does NOT re-describe their appearance in the shot prompt. Instead, the asset resolver handles visual identity injection at video generation time. This ensures character consistency across all shots.

04 Video Generation

Each storyboard shot is sent to a video generation model to produce the actual video clip. NovelVids supports four different video generation platforms through a unified factory pattern:

Vi

Vidu Q2

By Shengshu Technology. Excels at character-driven generation with strong identity preservation via reference-to-video mode.

So

Sora 2

By OpenAI. Advanced scene composition and extended sequences with compatible API interfaces.

Se

Seedance

By ByteDance. Supports automatic t2v/i2v switching — uses image-to-video when reference images are available, text-to-video otherwise.

Ve

Veo 3

By Google. Photorealistic output with high visual fidelity, ideal for scenes requiring cinematic production quality.

Before submitting to the video platform, the Asset Resolver (asset_resolver.py) scans each shot prompt for @{name} mentions, locates the matching asset's reference image, converts it to base64, and injects it into the API request. This is how visual consistency is maintained.

Video generation is asynchronous — the system submits the task, receives a job ID, and the frontend polls for status updates until the video is ready. Completed videos are automatically downloaded to local storage.

05 Video Merge & Export

The final step assembles all individual shot videos for a chapter into a single continuous short drama episode using FFmpeg's concat filter.

The merge process intelligently handles mixed media — some video clips may have audio tracks while others are silent. The system normalizes these differences before concatenation to ensure smooth playback.

Shot 1.mp4
Shot 2.mp4
Shot 3.mp4
...
FFmpeg Concat Filter
chapter_{id}_merged.mp4

The merged video is saved to {MEDIA_PATH}/videos/merged/chapter_{id}_merged.mp4 and is immediately available for preview and download through the web interface.

The @Mention Entity Binding System

One of NovelVids' most important design innovations is its entity binding system, which ensures consistent character appearance across all generated video frames. Here's how it works:

@ How @Mentions Work

During storyboard generation, the AI is explicitly instructed to never re-describe a character's physical appearance in the shot prompt. Instead, it uses a special @{entity_name} syntax as a placeholder:

Example Storyboard Prompt @{Zhang Wei} walks slowly across the moonlit garden toward @{Ancient Pavilion}, his expression reflecting deep contemplation. He reaches out to touch the weathered stone pillar. @{Jade Pendant} hanging from his belt catches the light as he moves...

At video generation time, the Asset Resolver processes each prompt:

  • Scans the prompt text for all @{...} patterns
  • Looks up each referenced entity name in the asset database
  • Retrieves the entity's reference image from local storage
  • Converts the image to base64 and attaches it to the video generation API request

This separation of concerns means the narrative description stays clean while the visual identity is injected programmatically. The video generation model receives both the text prompt and the reference images, producing output that maintains the established visual identity.

AI Task Scheduling System

All AI operations in NovelVids (extraction, image generation, storyboard, video) are managed through a centralized asynchronous task system built on ai_task_executor.py.

Controller
submit()
AiTask record in DB
BackgroundTask: run()
cleanup_stale_tasks()
handler.execute()
status: completed/failed

Key features of the task system:

Concurrency Control

Semaphore-based throttling prevents overwhelming external AI APIs. Multiple tasks of the same type are queued, not rejected.

Timeout Management

Each task type has a configurable timeout (600s-900s). Tasks exceeding the limit are automatically marked as failed.

Stale Task Cleanup

Before each new submission, the system sweeps for "zombie" tasks stuck in running state beyond their timeout window.

Frontend Polling

The React frontend polls GET /api/task/{id} for real-time status updates, showing progress indicators until completion.

Task Type Timeout Description
extraction 600s Entity extraction from chapter text
reference_image 600s Reference image generation for assets
storyboard 900s Cinematic storyboard generation
video 600s Video clip generation per shot

System Architecture

NovelVids follows a strict four-layer backend architecture with clear separation of concerns. Each layer has a single responsibility and communicates only with its adjacent layers.

React 19 Frontend (TypeScript + Tailwind CSS) 5-step workflow UI, video library, model configuration
API Layer — api/ RESTful routing, request parsing, response formatting, BackgroundTask registration
Controller Layer — controllers/ Business logic orchestration, status management, task coordination
Model Layer — models/ Data persistence via Tortoise ORM (SQLite dev / PostgreSQL prod)
Service Layer — services/ External AI API calls: LLM extraction, image generation, video synthesis

The API layer handles HTTP concerns — routing, parameter validation, and response formatting. It delegates all business logic to controllers.

The Controller layer orchestrates workflows, manages state transitions, and coordinates between models and services. It's the brain of the application.

The Model layer defines the data schema using Tortoise ORM and provides async CRUD operations. It supports SQLite for development and PostgreSQL for production.

The Service layer encapsulates all external API interactions. Each AI capability (extraction, image generation, storyboard, video) has its own handler with a consistent execute() interface.

Hot-Swap Configuration: All AI model configurations (API keys, base URLs, model names) are stored in the database, not in environment files. Users can switch between providers or update credentials through the web UI without restarting the server. Each task type supports multiple configurations but only one can be active at a time.

Video Generator Factory Pattern

The video generation service uses the factory pattern to support multiple providers through a unified interface:

services/video/__init__.py _GENERATORS = { VideoModelTypeEnum.viduq2: ViduGenerator, VideoModelTypeEnum.sora2: SoraGenerator, VideoModelTypeEnum.seedance: SeedanceGenerator, VideoModelTypeEnum.veo3: VeoGenerator, } def get_generator(model_type, config) -> BaseVideoGenerator: return _GENERATORS[model_type](config)

Every generator implements two methods: submit() to start generation and query() to check status. This makes adding a new video model as simple as implementing a new class that extends BaseVideoGenerator.

Data Model & Relationships

NovelVids uses Tortoise ORM with six core models that form a clear hierarchical relationship:

Novel 1 : N Contains many Chapters. Stores title, author, synopsis, cover image.
Chapter 1 : N Contains many Scenes (storyboard shots). Tracks workflow status.
Scene 1 : N Generates many Videos. Stores prompt text, shot parameters, and prompt_params JSON.
Scene M : N Links to many Assets via scene_assets junction table.
Asset Characters, scenes, and props. Stores name, type, description, aliases, reference image path.
Video Generated video clips. Stores status, model type, file path, external job metadata.
AiTask Task tracking with UUID primary key. Stores type, status, request_params, and response_data as JSON.
AiModelConfig AI provider configs. Task type + name unique constraint. Only one active per task type.
Relationship Diagram Novel ──1:N──> Chapter ──1:N──> Scene ──1:N──> Video │ │ └──1:N──> Asset <──M:N──────────┘ (scene_assets)

Tech Stack

Backend

TechnologyRole
Python 3.12+Runtime environment
FastAPIHigh-performance async web framework
UvicornASGI server
Tortoise ORMAsync ORM supporting SQLite and PostgreSQL
PydanticData validation and serialization
OpenAI SDKUnified interface for LLM and image API calls
httpxAsync HTTP client for video platform APIs
FFmpegVideo merging via subprocess
uvPackage management

Frontend

TechnologyRole
React 19UI framework
TypeScriptType safety
ViteBuild tooling & dev server
Tailwind CSSUtility-first styling
shadcn/uiComponent library (Radix UI primitives)
React Router 7Client-side routing (HashRouter)
SonnerToast notification system
Lucide ReactIcon library

External AI Services

ServicePurposeAuth
OpenAI-compatible LLMEntity extraction & storyboard generationAPI Key
Bytedance SeedreamReference image generationAPI Key
Vidu Q2 APIVideo generationToken auth
Sora 2 APIVideo generationBearer auth
Seedance APIVideo generation (auto t2v/i2v)Bearer auth
Veo 3 APIVideo generationBearer auth

No GPU Required: NovelVids does not require any local GPU. All AI-intensive tasks (LLM inference, image generation, video synthesis) are offloaded to cloud-based API services. The server itself only needs standard compute for orchestration, file I/O, and FFmpeg processing.