JoVA

Unified Multimodal Learning for Joint Video-Audio Generation

Xiaohu Huang1,* Hao Zhou2,* Qiangpeng Yang2 Shilei Wen2 Kai Han1,✉
1The University of Hong Kong    2ByteDance
* Equal Contribution    ✉ Corresponding Author

Abstract

We present JoVA, a unified framework for joint video-audio generation. Unlike previous methods that rely on explicit fusion modules or lack speech capability, JoVA employs joint self-attention across video and audio tokens, enabling direct cross-modal interaction within a streamlined architecture. To address the challenge of lip-speech synchronization, we introduce a simple yet effective mouth-area specific loss based on facial keypoint detection. Extensive experiments demonstrate that JoVA outperforms state-of-the-art unified and audio-driven methods in lip-sync accuracy, speech quality, and overall generation fidelity.

Demo Video

Figure 1: Demonstration of JoVA generating synchronized video and audio from text prompts.

Method

Method Framework Diagram

JoVA processes video and audio tokens directly within a single transformer layer using Joint Self-Attention, eliminating redundant alignment modules. To ensure high-fidelity synchronization, we employ a mouth-area specific loss that enhances supervision on the critical mouth region during training without compromising architectural simplicity.

Construction of Data

Data Construction Pipeline

Figure 2: Overview of the data construction pipeline, including Text-to-Audio, Text-to-Video-Audio, and Text-to-Avatar-Speech subsets.

Performance Comparison

Method Type Lip-Sync (LSE-C) TTS (WER) Audio Quality (PQ) Video Motion (MS)
Audio-Driven Generation
FantasyTalking Audio-Driven 3.10 - - 0.22
Wan-S2V Audio-Driven 6.43 - - 0.82
Joint Video-Audio Generation
Universe-1 Joint Gen 1.62 0.37 4.39 0.43
OVI Joint Gen 6.41 0.23 5.77 0.94
JoVA (Ours) Joint Gen 6.64 0.18 6.45 0.98

Table 1: Quantitative comparison on standard benchmarks. JoVA achieves state-of-the-art performance in lip-sync and audio quality.

Generated Examples

Keep update: these examples will be continuously updated.

Qualitative results showing diversity in style, geometry, and audio-visual synchronization.
(Click on the cards to play/pause videos; hover to view full prompt)

Citation

If you find our work useful for your research, please consider citing:

@article{huang2025JoVA, title={JoVA: Unified Multimodal Learning for Joint Video-Audio Generation}, author={Huang, Xiaohu and Zhou, Hao and Yang, Qiangpeng and Wen, Shilei and Han, Kai}, journal={arXiv preprint}, year={2025} }