HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in Video-to-Audio (V2A) generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing V2A methods, we propose HunyuanVideo-Foley, an end-to-end Text-Video-to-Audio (TV2A) framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets via automated annotation; (2) a novel multimodal diffusion transformer resolving modal competition through dual-stream temporal fusion and cross-modal semantic injection; (3) representation alignment (REPA) using self-supervised audio features to guide latent diffusion training, efficiently improving generation stability and audio quality. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual alignment and distribution matching.

Data Pipeline

Method Overview

Experiments Results

Objective Evaluation Results on Kling-Audio-Eval

Method	FD_PaNNs↓	FD_PaSST↓	KL↓	IS↑	PQ↑	PC↓	CE↑	CU↑	IB↑	DeSync↓	CLAP↑
FoleyCrafter	22.30	322.63	2.47	7.08	6.05	2.91	3.28	5.44	0.22	1.23	0.22
V-AURA	33.15	474.56	3.24	5.80	5.69	3.98	3.13	4.83	0.25	0.86	0.13
Frieren	16.86	293.57	2.95	7.32	5.72	2.55	2.88	5.10	0.21	0.86	0.16
MMAudio	9.01	205.85	2.17	9.59	5.94	2.91	3.30	5.39	0.30	0.56	0.27
ThinkSound	9.92	228.68	2.39	6.86	5.78	3.23	3.12	5.11	0.22	0.67	0.22
HunyuanVideo-Foley (ours)	6.07	202.12	1.89	8.30	6.12	2.76	3.22	5.53	0.38	0.54	0.24

Objective Evaluation Results on VGGSound-Test

Method	FD_PaNNs↓	FD_PaSST↓	KL↓	IS↑	PQ↑	PC↓	CE↑	CU↑	IB↑	DeSync↓	CLAP↑
FoleyCrafter	20.65	171.43	2.26	14.58	6.33	2.87	3.60	5.74	0.26	1.22	0.19
V-AURA	18.91	291.72	2.40	8.58	5.70	4.19	3.49	4.87	0.27	0.72	0.12
Frieren	11.69	83.17	2.75	12.23	5.87	2.99	3.54	5.32	0.23	0.85	0.11
MMAudio	7.42	116.92	1.77	21.00	6.18	3.17	4.03	5.61	0.33	0.47	0.25
ThinkSound	8.46	67.18	1.90	11.11	5.98	3.61	3.81	5.33	0.24	0.57	0.16
HunyuanVideo-Foley (ours)	11.34	145.22	2.14	16.14	6.40	2.78	3.99	5.79	0.36	0.53	0.24

Objective and Subjective Evaluation Results on MovieGen-Audio-Bench

Method	PQ↑	PC↓	CE↑	CU↑	IB↑	DeSync↓	CLAP↑	MOS-Q↑	MOS-S↑	MOS-T↑
FoleyCrafter	6.27	2.72	3.34	5.68	0.17	1.29	0.14	3.36±0.78	3.54±0.88	3.46±0.95
V-AURA	5.82	4.30	3.63	5.11	0.23	1.38	0.14	2.55±0.97	2.60±1.20	2.70±1.37
Frieren	5.71	2.81	3.47	5.31	0.18	1.39	0.16	2.92±0.95	2.76±1.20	2.94±1.26
MMAudio	6.17	2.84	3.59	5.62	0.27	0.80	0.35	3.58±0.84	3.63±1.00	3.47±1.03
ThinkSound	6.04	3.73	3.81	5.59	0.18	0.91	0.20	3.20±0.97	3.01±1.04	3.02±1.08
HunyuanVideo-Foley (ours)	6.59	2.74	3.88	6.13	0.35	0.74	0.33	4.14±0.68	4.12±0.77	4.15±0.75

Our experimental results demonstrate that HunyuanVideo-Foley achieves superior performance across multiple evaluation datasets, consistently outperforming baseline methods in key metrics related to audio quality, temporal alignment, and cross-modal consistency.

Results and Comparisons

Our HunyuanVideo-Foley framework demonstrates superior performance compared to existing methods. Below are video-audio generation comparisons across different methods:

Comprehensive Model Comparison

We provide an extensive comparison of our HunyuanVideo-Foley model against five state-of-the-art methods across 28 different video samples. Each comparison includes the original prompt used for generation, allowing for detailed analysis of how different models interpret and synthesize audio for the same visual content.

Citation

    @misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.16930}, 
}

Abstract

Data Pipeline

Method Overview

Experiments Results

Objective Evaluation Results on Kling-Audio-Eval

Objective Evaluation Results on VGGSound-Test

Objective and Subjective Evaluation Results on MovieGen-Audio-Bench

Results and Comparisons

Sample 1

HunyuanVideo-Foley (Ours)

MMAudio

FoleyCrafter

ThinkSound

Sample 2

HunyuanVideo-Foley (Ours)

MMAudio

FoleyCrafter

ThinkSound

Sample 3

HunyuanVideo-Foley (Ours)

MMAudio

FoleyCrafter

ThinkSound

Sample 4

HunyuanVideo-Foley (Ours)

MMAudio

FoleyCrafter

ThinkSound

Sample 5

HunyuanVideo-Foley (Ours)

MMAudio

FoleyCrafter

ThinkSound

Comprehensive Model Comparison

Video ID: 006

HunyuanVideo-Foley

FoleyCrafter

MMAudio

ThinkSound

V-AURA

Frieren

Video ID: 007

HunyuanVideo-Foley

FoleyCrafter

MMAudio

ThinkSound

V-AURA

Frieren

Video ID: 021

HunyuanVideo-Foley

FoleyCrafter

MMAudio

ThinkSound

V-AURA

Frieren

Video ID: 025

HunyuanVideo-Foley

FoleyCrafter

MMAudio

ThinkSound

V-AURA

Frieren

Citation