HunyuanCustom is a multimodal, conditional, and controllable video generation model focused on subject consistency. It accepts text, image, audio, and video inputs for flexible, user-defined video creation.
HunyuanCustom is designed to address the core challenges of customized video generation: maintaining subject identity and supporting diverse input modalities. Building on the HunyuanVideo framework, it introduces a novel image-text fusion module (LLaVA) for richer multimodal understanding, and an identity enhancement mechanism that leverages temporal modeling to keep subjects consistent across frames. For audio and video-driven scenarios, HunyuanCustom employs specialized condition injection networks, enabling precise and disentangled control over each modality. Extensive experiments show that HunyuanCustom not only excels in single- and multi-subject video generation, but also achieves state-of-the-art performance in realism, identity preservation, and flexible scenario adaptation.
Key Features
Supports text, image, audio, and video as input conditions for highly flexible video generation.
Ensures subject identity consistency across frames with advanced temporal modeling and multimodal fusion.
Outperforms state-of-the-art methods in ID consistency, realism, and text-video alignment.
Multimodal Input
Supports text, image, audio, and video as input conditions for highly flexible and controllable video generation.
Identity Consistency
Advanced temporal modeling and multimodal fusion ensure subject identity consistency across all frames.
LLaVA-based Fusion
Integrates LLaVA-based image-text fusion for enhanced multimodal understanding and generation.
AudioNet & Video Injection
AudioNet module and video-driven injection enable robust audio- and video-conditioned generation.
Better Performance
Outperforms SOTA methods in ID consistency, realism, and text-video alignment on multiple benchmarks.
Robustness & Versatility
Demonstrates robustness across single- and multi-subject scenarios and downstream tasks.
Generate videos featuring a specific subject with high identity consistency, supporting flexible user-defined scenarios.
Customize videos with multiple subjects, each maintaining their unique identity and appearance throughout the video.
Drive video generation using audio input, enabling lip-sync and motion that matches the provided sound.
Leverage reference videos to control motion and style, achieving highly realistic and controllable video outputs.
FAQ
Here are some of the most frequently asked questions about HunyuanCustom.
HunyuanCustom is a multimodal, conditional, and controllable video generation model that supports text, image, audio, and video as input conditions. It is designed to ensure subject identity consistency and enable flexible, user-defined video generation scenarios.
HunyuanCustom introduces LLaVA-based image-text fusion, an image ID enhancement module, AudioNet for audio-driven generation, and a video-driven injection module for robust multimodal control and identity preservation.
It supports text, image, audio, and video as input conditions, allowing for highly flexible and customizable video generation.
By leveraging advanced temporal modeling and multimodal fusion, HunyuanCustom maintains subject identity consistency across all video frames.
Extensive experiments show that HunyuanCustom outperforms state-of-the-art open- and closed-source methods in ID consistency, realism, and text-video alignment.
HunyuanCustom is suitable for personalized video creation, content generation, entertainment, education, and any scenario requiring controllable, subject-consistent video synthesis.
You can find the code and more resources on GitHub, read the paper on arXiv, or try the model online via the official demo links.
Experience the next generation of controllable, multimodal video generation!
Try HunyuanCustom