Unified Multimodal Learning for Joint Video-Audio Generation
We present JoVA, a unified framework for joint video-audio generation. Unlike previous methods that rely on explicit fusion modules or lack speech capability, JoVA employs joint self-attention across video and audio tokens, enabling direct cross-modal interaction within a streamlined architecture. To address the challenge of lip-speech synchronization, we introduce a simple yet effective mouth-area specific loss based on facial keypoint detection. Extensive experiments demonstrate that JoVA outperforms state-of-the-art unified and audio-driven methods in lip-sync accuracy, speech quality, and overall generation fidelity.
Figure 1: Demonstration of JoVA generating synchronized video and audio from text prompts.
JoVA processes video and audio tokens directly within a single transformer layer using Joint Self-Attention, eliminating redundant alignment modules. To ensure high-fidelity synchronization, we employ a mouth-area specific loss that enhances supervision on the critical mouth region during training without compromising architectural simplicity.
Figure 2: Overview of the data construction pipeline, including Text-to-Audio, Text-to-Video-Audio, and Text-to-Avatar-Speech subsets.
| Method | Type | Lip-Sync (LSE-C) | TTS (WER) | Audio Quality (PQ) | Video Motion (MS) |
|---|---|---|---|---|---|
| Audio-Driven Generation | |||||
| FantasyTalking | Audio-Driven | 3.10 | - | - | 0.22 |
| Wan-S2V | Audio-Driven | 6.43 | - | - | 0.82 |
| Joint Video-Audio Generation | |||||
| Universe-1 | Joint Gen | 1.62 | 0.37 | 4.39 | 0.43 |
| OVI | Joint Gen | 6.41 | 0.23 | 5.77 | 0.94 |
| JoVA (Ours) | Joint Gen | 6.64 | 0.18 | 6.45 | 0.98 |
Table 1: Quantitative comparison on standard benchmarks. JoVA achieves state-of-the-art performance in lip-sync and audio quality.
Keep update: these examples will be continuously updated.
Qualitative results showing diversity in style, geometry, and audio-visual synchronization.
(Click on the cards to play/pause videos; hover to view full prompt)
If you find our work useful for your research, please consider citing: