How NovelVids Works

Chapter 1

What NovelVids Does

NovelVids is an AI-driven production platform that automatically converts novel text into style-consistent short drama videos. Given a novel as input, the system handles the entire journey — from understanding the narrative and characters to generating the final playable video.

This is not a demo or proof-of-concept. NovelVids is built as an industrial-grade, production-ready platform with a rigorous four-layer backend architecture, comprehensive test coverage, and a modern React 19 frontend.

The core value proposition is simple: paste in a novel, get back a short drama. Everything in between is handled by AI, with human oversight at every stage through an intuitive five-step workflow interface.

Key Insight: NovelVids maintains visual consistency across all generated video frames by establishing character and scene identity early in the pipeline and propagating it through every subsequent step via a unique @mention binding system.

Chapter 2

The Complete Pipeline

Every chapter processed by NovelVids flows through a structured five-step pipeline. Each stage feeds its output directly into the next, creating a deterministic, traceable production chain.

Upload Novel

Input text & split into chapters

→

Extract Entities

Characters, scenes, props

→

Generate Assets

Reference images for each entity

→

Storyboard

Cinematic shot-by-shot script

→

Generate Videos

AI video from each shot

→

Merge & Export

FFmpeg concat to final video

The pipeline is designed for human-in-the-loop control. Users can review, edit, or regenerate the output at each step before proceeding. The system tracks progress through a workflow status state machine that enforces forward-only progression.

draft

→

chapters_extracted

→

characters_extracted

→

storyboard_ready

→

generating

→

completed

Chapter 3

Five Steps in Detail

01 Entity Extraction

When a user selects a chapter and triggers extraction, the system performs deep narrative analysis using a large language model. It automatically identifies and extracts three types of entities:

Characters (Person) — name, aliases, physical appearance, clothing, personality traits, and which chapters they appear in
Scenes — location name, spatial layout, lighting conditions, atmosphere, and time of day
Props (Items) — object name, appearance, material, and narrative significance

The extraction runs three extractors concurrently (person, scene, item), each using OpenAI's structured output feature to guarantee consistent JSON responses. Results are incrementally merged into the asset database — not overwritten — so cross-chapter character information accumulates naturally.

Chapter Text

→

LLM (Structured Output)

→

PersonList + SceneList + ItemList

↓

Incremental merge into Asset DB

Design Decision: Incremental merge means if Chapter 3 reveals new details about a character introduced in Chapter 1, the character's record is enriched rather than duplicated. Aliases are also merged to handle different name forms across chapters.

02 Visual Asset Generation

Once entities are extracted, the system generates reference images for each asset. These images establish the visual identity that all downstream video generation will follow.

The image generation uses the Bytedance Seedream model (doubao-seedream-4-5-251128) and creates different types of reference images based on asset type:

Characters

Multi-angle three-view reference sheets showing the character's face and body from front, side, and three-quarter angles. Ensures consistent identity across frames.

Scenes

Panoramic environment reference images capturing the spatial layout, lighting conditions, and atmospheric details of each location.

Props

Detailed product-style renders of significant objects, showing material textures and physical characteristics from multiple angles.

Manual Upload

Users can also upload their own reference images for any asset, giving full creative control over the visual direction.

Generated images are downloaded to the local media directory at {MEDIA_PATH}/assets/{id}.png and served via the static file endpoint.

03 Storyboard Generation

This is where the chapter narrative becomes a cinematic shot script. The system uses an LLM with a specialized system prompt that positions it as an "elite Director of Photography (DP)" to generate film-quality storyboard descriptions.

Each storyboard shot (scene) contains far more than a simple text description. It includes:

Shot description — what happens in the frame, character actions, and dialogue
Action timeline — precise moment-by-moment breakdown within the 4s or 8s duration
Film stock / texture — grain, color science, and visual aesthetics
Camera parameters — lens focal length, aperture, sensor format
Lighting design — key/fill/rim light setup, color temperature, ratios
Color grading — LUT reference, shadow/highlight tones, saturation
Camera movement — dolly, crane, pan/tilt with specific speeds and timing
Sound design — ambient audio, foley, music notes for the video generation

Critical Design: The storyboard generator uses @{entity_name} syntax to reference characters and objects. It does NOT re-describe their appearance in the shot prompt. Instead, the asset resolver handles visual identity injection at video generation time. This ensures character consistency across all shots.

04 Video Generation

Each storyboard shot is sent to a video generation model to produce the actual video clip. NovelVids supports four different video generation platforms through a unified factory pattern:

Vidu Q2

By Shengshu Technology. Excels at character-driven generation with strong identity preservation via reference-to-video mode.

Sora 2

By OpenAI. Advanced scene composition and extended sequences with compatible API interfaces.

Seedance

By ByteDance. Supports automatic t2v/i2v switching — uses image-to-video when reference images are available, text-to-video otherwise.

Veo 3

By Google. Photorealistic output with high visual fidelity, ideal for scenes requiring cinematic production quality.

Before submitting to the video platform, the Asset Resolver (asset_resolver.py) scans each shot prompt for @{name} mentions, locates the matching asset's reference image, converts it to base64, and injects it into the API request. This is how visual consistency is maintained.

Video generation is asynchronous — the system submits the task, receives a job ID, and the frontend polls for status updates until the video is ready. Completed videos are automatically downloaded to local storage.

05 Video Merge & Export

The final step assembles all individual shot videos for a chapter into a single continuous short drama episode using FFmpeg's concat filter.

The merge process intelligently handles mixed media — some video clips may have audio tracks while others are silent. The system normalizes these differences before concatenation to ensure smooth playback.

Shot 1.mp4

Shot 2.mp4

Shot 3.mp4

...

↓

FFmpeg Concat Filter

↓

chapter_{id}_merged.mp4

The merged video is saved to {MEDIA_PATH}/videos/merged/chapter_{id}_merged.mp4 and is immediately available for preview and download through the web interface.

Chapter 4

The @Mention Entity Binding System

One of NovelVids' most important design innovations is its entity binding system, which ensures consistent character appearance across all generated video frames. Here's how it works:

@ How @Mentions Work

During storyboard generation, the AI is explicitly instructed to never re-describe a character's physical appearance in the shot prompt. Instead, it uses a special @{entity_name} syntax as a placeholder:

            Example Storyboard Prompt
            @{Zhang Wei} walks slowly across the moonlit garden toward @{Ancient Pavilion},
his expression reflecting deep contemplation. He reaches out to touch the
weathered stone pillar. @{Jade Pendant} hanging from his belt catches
the light as he moves...
          

At video generation time, the Asset Resolver processes each prompt:

Scans the prompt text for all @{...} patterns
Looks up each referenced entity name in the asset database
Retrieves the entity's reference image from local storage
Converts the image to base64 and attaches it to the video generation API request

This separation of concerns means the narrative description stays clean while the visual identity is injected programmatically. The video generation model receives both the text prompt and the reference images, producing output that maintains the established visual identity.

Chapter 5

AI Task Scheduling System

All AI operations in NovelVids (extraction, image generation, storyboard, video) are managed through a centralized asynchronous task system built on ai_task_executor.py.

Controller

→

submit()

→

AiTask record in DB

↓

BackgroundTask: run()

↓

cleanup_stale_tasks()

→

handler.execute()

→

status: completed/failed

Key features of the task system:

Concurrency Control

Semaphore-based throttling prevents overwhelming external AI APIs. Multiple tasks of the same type are queued, not rejected.

Timeout Management

Each task type has a configurable timeout (600s-900s). Tasks exceeding the limit are automatically marked as failed.

Stale Task Cleanup

Before each new submission, the system sweeps for "zombie" tasks stuck in running state beyond their timeout window.

Frontend Polling

The React frontend polls GET /api/task/{id} for real-time status updates, showing progress indicators until completion.

Task Type	Timeout	Description
extraction	600s	Entity extraction from chapter text
reference_image	600s	Reference image generation for assets
storyboard	900s	Cinematic storyboard generation
video	600s	Video clip generation per shot

Chapter 6

System Architecture

NovelVids follows a strict four-layer backend architecture with clear separation of concerns. Each layer has a single responsibility and communicates only with its adjacent layers.

React 19 Frontend (TypeScript + Tailwind CSS) 5-step workflow UI, video library, model configuration

↓

API Layer — api/ RESTful routing, request parsing, response formatting, BackgroundTask registration

↓

Controller Layer — controllers/ Business logic orchestration, status management, task coordination

↓

Model Layer — models/ Data persistence via Tortoise ORM (SQLite dev / PostgreSQL prod)

↓

Service Layer — services/ External AI API calls: LLM extraction, image generation, video synthesis

The API layer handles HTTP concerns — routing, parameter validation, and response formatting. It delegates all business logic to controllers.

The Controller layer orchestrates workflows, manages state transitions, and coordinates between models and services. It's the brain of the application.

The Model layer defines the data schema using Tortoise ORM and provides async CRUD operations. It supports SQLite for development and PostgreSQL for production.

The Service layer encapsulates all external API interactions. Each AI capability (extraction, image generation, storyboard, video) has its own handler with a consistent execute() interface.

Hot-Swap Configuration: All AI model configurations (API keys, base URLs, model names) are stored in the database, not in environment files. Users can switch between providers or update credentials through the web UI without restarting the server. Each task type supports multiple configurations but only one can be active at a time.

Video Generator Factory Pattern

The video generation service uses the factory pattern to support multiple providers through a unified interface:

          services/video/__init__.py
          _GENERATORS = {
    VideoModelTypeEnum.viduq2:   ViduGenerator,
    VideoModelTypeEnum.sora2:    SoraGenerator,
    VideoModelTypeEnum.seedance: SeedanceGenerator,
    VideoModelTypeEnum.veo3:     VeoGenerator,
}

def get_generator(model_type, config) -> BaseVideoGenerator:
    return _GENERATORS[model_type](config)
        

Every generator implements two methods: submit() to start generation and query() to check status. This makes adding a new video model as simple as implementing a new class that extends BaseVideoGenerator.

Chapter 7

Data Model & Relationships

NovelVids uses Tortoise ORM with six core models that form a clear hierarchical relationship:

Novel 1 : N Contains many Chapters. Stores title, author, synopsis, cover image.

Chapter 1 : N Contains many Scenes (storyboard shots). Tracks workflow status.

Scene 1 : N Generates many Videos. Stores prompt text, shot parameters, and prompt_params JSON.

Scene M : N Links to many Assets via scene_assets junction table.

Asset — Characters, scenes, and props. Stores name, type, description, aliases, reference image path.

Video — Generated video clips. Stores status, model type, file path, external job metadata.

AiTask — Task tracking with UUID primary key. Stores type, status, request_params, and response_data as JSON.

AiModelConfig — AI provider configs. Task type + name unique constraint. Only one active per task type.

          Relationship Diagram
          Novel ──1:N──> Chapter ──1:N──> Scene ──1:N──> Video
  │                                │
  └──1:N──> Asset <──M:N──────────┘
                     (scene_assets)
        

Chapter 8

Tech Stack

Backend

Technology	Role
Python 3.12+	Runtime environment
FastAPI	High-performance async web framework
Uvicorn	ASGI server
Tortoise ORM	Async ORM supporting SQLite and PostgreSQL
Pydantic	Data validation and serialization
OpenAI SDK	Unified interface for LLM and image API calls
httpx	Async HTTP client for video platform APIs
FFmpeg	Video merging via subprocess
uv	Package management

Frontend

Technology	Role
React 19	UI framework
TypeScript	Type safety
Vite	Build tooling & dev server
Tailwind CSS	Utility-first styling
shadcn/ui	Component library (Radix UI primitives)
React Router 7	Client-side routing (HashRouter)
Sonner	Toast notification system
Lucide React	Icon library

External AI Services

Service	Purpose	Auth
OpenAI-compatible LLM	Entity extraction & storyboard generation	API Key
Bytedance Seedream	Reference image generation	API Key
Vidu Q2 API	Video generation	Token auth
Sora 2 API	Video generation	Bearer auth
Seedance API	Video generation (auto t2v/i2v)	Bearer auth
Veo 3 API	Video generation	Bearer auth

No GPU Required: NovelVids does not require any local GPU. All AI-intensive tasks (LLM inference, image generation, video synthesis) are offloaded to cloud-based API services. The server itself only needs standard compute for orchestration, file I/O, and FFmpeg processing.

Table of Contents

What NovelVids Does

The Complete Pipeline

Upload Novel

Extract Entities

Generate Assets

Storyboard

Generate Videos

Merge & Export

Five Steps in Detail

01 Entity Extraction

02 Visual Asset Generation

Characters

Scenes

Props

Manual Upload

03 Storyboard Generation

04 Video Generation

Vidu Q2

Sora 2

Seedance

Veo 3

05 Video Merge & Export

The @Mention Entity Binding System

@ How @Mentions Work

AI Task Scheduling System

Concurrency Control

Timeout Management

Stale Task Cleanup

Frontend Polling

System Architecture

Video Generator Factory Pattern

Data Model & Relationships

Tech Stack

Backend

Frontend

External AI Services