HunyuanVideo-Foley

Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan 1,2,* Qiulin Li 1,3,* Yutao Cui 1 Miles Yang 1 Yuehai Wang 2 Qun Yang 3 Jin Zhou 1,† Zhao Zhong 1
1 Tencent Hunyuan     2 Zhejiang University     3 Nanjing University of Aeronautics and Astronautics

*Equal contribution, Corresponding author

Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in Video-to-Audio (V2A) generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing V2A methods, we propose HunyuanVideo-Foley, an end-to-end Text-Video-to-Audio (TV2A) framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets via automated annotation; (2) a novel multimodal diffusion transformer resolving modal competition through dual-stream temporal fusion and cross-modal semantic injection; (3) representation alignment (REPA) using self-supervised audio features to guide latent diffusion training, efficiently improving generation stability and audio quality. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual alignment and distribution matching.

Data Pipeline

Overall architecture of HunyuanVideo-Foley framework
Data pipeline for filtering video-audio data. The workflow illustrates the processing steps from the raw video database to the filtered video-audio database.

Method Overview

Overall architecture of HunyuanVideo-Foley framework

Overview of the HunyuanVideo-Foley model architecture. The proposed model integrates encoded text (CLAP), visual (SigLIP-2), and audio (DAC-VAE) inputs through a hybrid framework with multimodal transformer blocks followed by unimodal transformer blocks. The hybrid transformer blocks are modulated and gated with synchronization features and timestep embeddings. A pre-trained ATST-Frame is used to compute REPA loss with latnet representations from a unimodal transformer block. The generated audio latent are decoded into audio waveforms by the DAC-VAE decoder.

Experiments Results

Radar chart comparison of different methods
Radar Chart of Video-to-Audio Evaluation. It contains the results on three evaluation set: Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench, demonstrating that HunyuanVideo-Foley achieves comprehensive superiority.

Objective Evaluation Results on Kling-Audio-Eval

MethodFDPaNNsFDPaSSTKL↓IS↑PQ↑PC↓CE↑CU↑IB↑DeSync↓CLAP↑
FoleyCrafter22.30322.632.477.086.052.913.285.440.221.230.22
V-AURA33.15474.563.245.805.693.983.134.830.250.860.13
Frieren16.86293.572.957.325.722.552.885.100.210.860.16
MMAudio9.01205.852.179.595.942.913.305.390.300.560.27
ThinkSound9.92228.682.396.865.783.233.125.110.220.670.22
HunyuanVideo-Foley (ours)6.07202.121.898.306.122.763.225.530.380.540.24

Objective Evaluation Results on VGGSound-Test

MethodFDPaNNsFDPaSSTKL↓IS↑PQ↑PC↓CE↑CU↑IB↑DeSync↓CLAP↑
FoleyCrafter20.65171.432.2614.586.332.873.605.740.261.220.19
V-AURA18.91291.722.408.585.704.193.494.870.270.720.12
Frieren11.6983.172.7512.235.872.993.545.320.230.850.11
MMAudio7.42116.921.7721.006.183.174.035.610.330.470.25
ThinkSound8.4667.181.9011.115.983.613.815.330.240.570.16
HunyuanVideo-Foley (ours)11.34145.222.1416.146.402.783.995.790.360.530.24

Objective and Subjective Evaluation Results on MovieGen-Audio-Bench

MethodPQ↑PC↓CE↑CU↑IB↑DeSync↓CLAP↑MOS-Q↑MOS-S↑MOS-T↑
FoleyCrafter6.272.723.345.680.171.290.143.36±0.783.54±0.883.46±0.95
V-AURA5.824.303.635.110.231.380.142.55±0.972.60±1.202.70±1.37
Frieren5.712.813.475.310.181.390.162.92±0.952.76±1.202.94±1.26
MMAudio6.172.843.595.620.270.800.353.58±0.843.63±1.003.47±1.03
ThinkSound6.043.733.815.590.180.910.203.20±0.973.01±1.043.02±1.08
HunyuanVideo-Foley (ours)6.592.743.886.130.350.740.334.14±0.684.12±0.774.15±0.75

Our experimental results demonstrate that HunyuanVideo-Foley achieves superior performance across multiple evaluation datasets, consistently outperforming baseline methods in key metrics related to audio quality, temporal alignment, and cross-modal consistency.

Results and Comparisons

Our HunyuanVideo-Foley framework demonstrates superior performance compared to existing methods. Below are video-audio generation comparisons across different methods:

Sample 1

Prompt: With a faint sound as their hands parted, the two embraced, a soft ‘mm’ escaping between them.
HunyuanVideo-Foley (Ours)
MMAudio
FoleyCrafter
ThinkSound

Sample 2

Prompt: Schools of colorful tropical fish dart through the coral reefs, their flapping fins creating a gentle gurgling sound with the water currents.
HunyuanVideo-Foley (Ours)
MMAudio
FoleyCrafter
ThinkSound

Sample 3

Prompt: The sound of the number 3's bouncing footsteps is as light and clear as glass marbles hitting the ground. Each step carries a magical sound.
HunyuanVideo-Foley (Ours)
MMAudio
FoleyCrafter
ThinkSound

Sample 4

Prompt: The crackling of the fire, the whooshing of the flames, and the occasional crisp popping of charred leaves filled the forest.
HunyuanVideo-Foley (Ours)
MMAudio
FoleyCrafter
ThinkSound

Sample 5

Prompt: distant thunder rumbles and crackles, and music plays in the background which is a dramatic and intense orchestral piece with an inspiring melody and powerful percussion, creating an atmosphere of danger and uncertainty.
HunyuanVideo-Foley (Ours)
MMAudio
FoleyCrafter
ThinkSound

Comprehensive Model Comparison

We provide an extensive comparison of our HunyuanVideo-Foley model against five state-of-the-art methods across 28 different video samples. Each comparison includes the original prompt used for generation, allowing for detailed analysis of how different models interpret and synthesize audio for the same visual content.

Video ID: 006

Prompt: gentle licking the fur, high-quality

HunyuanVideo-Foley
FoleyCrafter
MMAudio
ThinkSound
V-AURA
Frieren

Video ID: 007

Prompt: dog's tongue lapping against the bowl.

HunyuanVideo-Foley
FoleyCrafter
MMAudio
ThinkSound
V-AURA
Frieren

Video ID: 021

Prompt: gritting teeth and heavy breaths are heard in the deserted alley, as a person walks.

HunyuanVideo-Foley
FoleyCrafter
MMAudio
ThinkSound
V-AURA
Frieren

Video ID: 025

Prompt: thunderous footsteps shake the ground, and the dinosaur's massive roar echoes through the valley.

HunyuanVideo-Foley
FoleyCrafter
MMAudio
ThinkSound
V-AURA
Frieren

Citation

    @misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.16930}, 
}