A comprehensive look at how NovelVids transforms raw novel text into production-ready short drama videos through its five-step AI pipeline — from entity extraction to final video synthesis.
NovelVids is an AI-driven production platform that automatically converts novel text into style-consistent short drama videos. Given a novel as input, the system handles the entire journey — from understanding the narrative and characters to generating the final playable video.
This is not a demo or proof-of-concept. NovelVids is built as an industrial-grade, production-ready platform with a rigorous four-layer backend architecture, comprehensive test coverage, and a modern React 19 frontend.
The core value proposition is simple: paste in a novel, get back a short drama. Everything in between is handled by AI, with human oversight at every stage through an intuitive five-step workflow interface.
Key Insight: NovelVids maintains visual consistency across all generated video frames by establishing character and scene identity early in the pipeline and propagating it through every subsequent step via a unique @mention binding system.
Every chapter processed by NovelVids flows through a structured five-step pipeline. Each stage feeds its output directly into the next, creating a deterministic, traceable production chain.
Input text & split into chapters
Characters, scenes, props
Reference images for each entity
Cinematic shot-by-shot script
AI video from each shot
FFmpeg concat to final video
The pipeline is designed for human-in-the-loop control. Users can review, edit, or regenerate the output at each step before proceeding. The system tracks progress through a workflow status state machine that enforces forward-only progression.
When a user selects a chapter and triggers extraction, the system performs deep narrative analysis using a large language model. It automatically identifies and extracts three types of entities:
The extraction runs three extractors concurrently (person, scene, item), each using OpenAI's structured output feature to guarantee consistent JSON responses. Results are incrementally merged into the asset database — not overwritten — so cross-chapter character information accumulates naturally.
Design Decision: Incremental merge means if Chapter 3 reveals new details about a character introduced in Chapter 1, the character's record is enriched rather than duplicated. Aliases are also merged to handle different name forms across chapters.
Once entities are extracted, the system generates reference images for each asset. These images establish the visual identity that all downstream video generation will follow.
The image generation uses the Bytedance Seedream model (doubao-seedream-4-5-251128) and creates different types of reference images based on asset type:
Multi-angle three-view reference sheets showing the character's face and body from front, side, and three-quarter angles. Ensures consistent identity across frames.
Panoramic environment reference images capturing the spatial layout, lighting conditions, and atmospheric details of each location.
Detailed product-style renders of significant objects, showing material textures and physical characteristics from multiple angles.
Users can also upload their own reference images for any asset, giving full creative control over the visual direction.
Generated images are downloaded to the local media directory at {MEDIA_PATH}/assets/{id}.png and served via the static file endpoint.
This is where the chapter narrative becomes a cinematic shot script. The system uses an LLM with a specialized system prompt that positions it as an "elite Director of Photography (DP)" to generate film-quality storyboard descriptions.
Each storyboard shot (scene) contains far more than a simple text description. It includes:
Critical Design: The storyboard generator uses @{entity_name} syntax to reference characters and objects. It does NOT re-describe their appearance in the shot prompt. Instead, the asset resolver handles visual identity injection at video generation time. This ensures character consistency across all shots.
Each storyboard shot is sent to a video generation model to produce the actual video clip. NovelVids supports four different video generation platforms through a unified factory pattern:
By Shengshu Technology. Excels at character-driven generation with strong identity preservation via reference-to-video mode.
By OpenAI. Advanced scene composition and extended sequences with compatible API interfaces.
By ByteDance. Supports automatic t2v/i2v switching — uses image-to-video when reference images are available, text-to-video otherwise.
By Google. Photorealistic output with high visual fidelity, ideal for scenes requiring cinematic production quality.
Before submitting to the video platform, the Asset Resolver (asset_resolver.py) scans each shot prompt for @{name} mentions, locates the matching asset's reference image, converts it to base64, and injects it into the API request. This is how visual consistency is maintained.
Video generation is asynchronous — the system submits the task, receives a job ID, and the frontend polls for status updates until the video is ready. Completed videos are automatically downloaded to local storage.
The final step assembles all individual shot videos for a chapter into a single continuous short drama episode using FFmpeg's concat filter.
The merge process intelligently handles mixed media — some video clips may have audio tracks while others are silent. The system normalizes these differences before concatenation to ensure smooth playback.
The merged video is saved to {MEDIA_PATH}/videos/merged/chapter_{id}_merged.mp4 and is immediately available for preview and download through the web interface.
One of NovelVids' most important design innovations is its entity binding system, which ensures consistent character appearance across all generated video frames. Here's how it works:
During storyboard generation, the AI is explicitly instructed to never re-describe a character's physical appearance in the shot prompt. Instead, it uses a special @{entity_name} syntax as a placeholder:
@{Zhang Wei} walks slowly across the moonlit garden toward @{Ancient Pavilion},
his expression reflecting deep contemplation. He reaches out to touch the
weathered stone pillar. @{Jade Pendant} hanging from his belt catches
the light as he moves...
At video generation time, the Asset Resolver processes each prompt:
@{...} patternsThis separation of concerns means the narrative description stays clean while the visual identity is injected programmatically. The video generation model receives both the text prompt and the reference images, producing output that maintains the established visual identity.
All AI operations in NovelVids (extraction, image generation, storyboard, video) are managed through a centralized asynchronous task system built on ai_task_executor.py.
Key features of the task system:
Semaphore-based throttling prevents overwhelming external AI APIs. Multiple tasks of the same type are queued, not rejected.
Each task type has a configurable timeout (600s-900s). Tasks exceeding the limit are automatically marked as failed.
Before each new submission, the system sweeps for "zombie" tasks stuck in running state beyond their timeout window.
The React frontend polls GET /api/task/{id} for real-time status updates, showing progress indicators until completion.
| Task Type | Timeout | Description |
|---|---|---|
| extraction | 600s | Entity extraction from chapter text |
| reference_image | 600s | Reference image generation for assets |
| storyboard | 900s | Cinematic storyboard generation |
| video | 600s | Video clip generation per shot |
NovelVids follows a strict four-layer backend architecture with clear separation of concerns. Each layer has a single responsibility and communicates only with its adjacent layers.
The API layer handles HTTP concerns — routing, parameter validation, and response formatting. It delegates all business logic to controllers.
The Controller layer orchestrates workflows, manages state transitions, and coordinates between models and services. It's the brain of the application.
The Model layer defines the data schema using Tortoise ORM and provides async CRUD operations. It supports SQLite for development and PostgreSQL for production.
The Service layer encapsulates all external API interactions. Each AI capability (extraction, image generation, storyboard, video) has its own handler with a consistent execute() interface.
Hot-Swap Configuration: All AI model configurations (API keys, base URLs, model names) are stored in the database, not in environment files. Users can switch between providers or update credentials through the web UI without restarting the server. Each task type supports multiple configurations but only one can be active at a time.
The video generation service uses the factory pattern to support multiple providers through a unified interface:
_GENERATORS = {
VideoModelTypeEnum.viduq2: ViduGenerator,
VideoModelTypeEnum.sora2: SoraGenerator,
VideoModelTypeEnum.seedance: SeedanceGenerator,
VideoModelTypeEnum.veo3: VeoGenerator,
}
def get_generator(model_type, config) -> BaseVideoGenerator:
return _GENERATORS[model_type](config)
Every generator implements two methods: submit() to start generation and query() to check status. This makes adding a new video model as simple as implementing a new class that extends BaseVideoGenerator.
NovelVids uses Tortoise ORM with six core models that form a clear hierarchical relationship:
Novel ──1:N──> Chapter ──1:N──> Scene ──1:N──> Video
│ │
└──1:N──> Asset <──M:N──────────┘
(scene_assets)
| Technology | Role |
|---|---|
| Python 3.12+ | Runtime environment |
| FastAPI | High-performance async web framework |
| Uvicorn | ASGI server |
| Tortoise ORM | Async ORM supporting SQLite and PostgreSQL |
| Pydantic | Data validation and serialization |
| OpenAI SDK | Unified interface for LLM and image API calls |
| httpx | Async HTTP client for video platform APIs |
| FFmpeg | Video merging via subprocess |
| uv | Package management |
| Technology | Role |
|---|---|
| React 19 | UI framework |
| TypeScript | Type safety |
| Vite | Build tooling & dev server |
| Tailwind CSS | Utility-first styling |
| shadcn/ui | Component library (Radix UI primitives) |
| React Router 7 | Client-side routing (HashRouter) |
| Sonner | Toast notification system |
| Lucide React | Icon library |
| Service | Purpose | Auth |
|---|---|---|
| OpenAI-compatible LLM | Entity extraction & storyboard generation | API Key |
| Bytedance Seedream | Reference image generation | API Key |
| Vidu Q2 API | Video generation | Token auth |
| Sora 2 API | Video generation | Bearer auth |
| Seedance API | Video generation (auto t2v/i2v) | Bearer auth |
| Veo 3 API | Video generation | Bearer auth |
No GPU Required: NovelVids does not require any local GPU. All AI-intensive tasks (LLM inference, image generation, video synthesis) are offloaded to cloud-based API services. The server itself only needs standard compute for orchestration, file I/O, and FFmpeg processing.