LTX-2: The First Open-Source Efficient Joint Audio-Visual Foundation Model
Introduction
For years, text-to-video diffusion models have produced visually impressive results. Motion quality has improved. Temporal consistency has improved. Resolution has improved. Yet one core element has been missing: sound.
Most text-to-video systems generate silent clips. Audio, if present, is added later using a separate model or manual editing. This approach breaks realism. Sound is not an afterthought. It carries emotion, physical meaning, and narrative context. A falling object implies impact. A speaking character implies voice, tone, and timing. When sound is generated separately, these connections are often lost.
Current industry pipelines rely on decoupled generation. Video comes first. Audio is filled in later. These sequential methods fail to model the deep, bidirectional relationship between sight and sound. They cannot naturally capture how a physical action produces a specific sound, or how speech timing affects facial motion.
LTX-2 is designed to solve this problem directly. It is the first open-source Efficient Joint Audio-Visual Foundation Model that generates video and audio together, in a single unified process. Instead of treating sound as an accessory, LTX-2 models the true joint distribution of audio and video.
Efficiency and openness matter. Closed systems limit research, creativity, and trust. LTX-2 provides full model weights, code, and inference tools. This allows researchers, engineers, and creators to inspect, adapt, and extend the system without restrictions.
What Is LTX-2?
Overview of LTX-2
Official Paper link
Official Github link
LTX-2 is an open-source generative system that models the text-conditioned joint distribution of video and audio signals. It is developed by Lightricks and built as a 19-billion-parameter foundation model.
The defining feature of LTX-2 is how it treats audio and video. They are not separate tasks. They are interdependent streams generated together. This design enables tight temporal alignment between motion, speech, music, and environmental sounds.
Because the model operates jointly, it achieves sub-frame synchronization. This allows realistic lip-syncing, physically grounded sound effects, and ambient audio that matches the visual environment. The model produces speech, background noise, and foley elements as part of the same generative process.
LTX-2 supports multilingual prompts and generates coherent audiovisual tracks without post-processing. This makes it suitable for storytelling, research, and creative prototyping.
Architecture Overview
Architecture of LTX-2
Image link
LTX-2 is built from three major architectural components: modality-specific encoders, a refined text embedding pipeline, and an asymmetric dual-stream Diffusion Transformer.
1. Modality-Specific Encoders
Audio and video differ fundamentally in structure. Forcing them into a single encoder would reduce efficiency and quality. LTX-2 instead uses separate Variational Autoencoders (VAEs) for each modality.
Video encoder: A spatiotemporal causal VAE compresses raw video into latent tokens that preserve motion and spatial structure.
Audio encoder: A dedicated causal audio VAE processes mel spectrograms into a one-dimensional latent representation. This VAE natively supports stereo audio by handling two-channel spectrograms.
These encoders allow each modality to retain its natural structure before joint modeling begins.
2. Asymmetric Dual-Stream Diffusion Transformer
The core of LTX-2 is a dual-stream transformer backbone that processes audio and video in parallel.
A 14B-parameter video stream models complex spatial layouts and motion dynamics.
A 5B-parameter audio stream focuses on the temporal evolution of sound.
The streams are connected through bidirectional cross-attention layers. This allows information to flow in both directions. Visual events can influence audio generation, such as a collision producing a sharp sound. Audio can influence visual motion, such as speech driving lip articulation.
3. Text Understanding and Thinking Tokens
Text conditioning is handled by a large language backbone based on Gemma-3-12B. Instead of using only the final layer of the language model, LTX-2 extracts features from all decoder layers. This captures both low-level phonetic structure and high-level semantic intent.
The model also introduces thinking tokens. These are additional tokens appended to the text sequence. They act as an internal workspace where information is aggregated before diffusion begins. This improves adherence to complex prompts and makes speech generation more expressive and context-aware.
Training & Efficiency Design
LTX-2’s efficiency comes from multiple design choices:
1 . Asymmetric Stream Sizes
The video stream is larger than the audio stream. Video data tends to be more complex and high-dimensional than audio signals. Giving the video stream more capacity improves visual fidelity without over-allocating compute to audio.
2 . Diffusion Over Latents
Instead of generating raw pixel and waveform outputs directly, LTX-2 refines latent representations. Latent diffusion models operate in compressed spaces, which makes denoising steps much faster and more efficient.
3 . Cross-Modality Attention
By coupling audio and video streams through attention, the model avoids separate generation pipelines. This saves compute and ensures coherence without additional post-processing.
4 . Quantized and Distilled Variants
The open-source release includes quantized versions (e.g., fp8) and distilled models. These are smaller and faster variants optimized for lower compute usage or real-time preview workflows.
Comparison with Other Models
Traditional text-to-video models like Make-A-Video or Imagen Video generate visually rich clips but output silent video. A character may appear to speak or objects may collide, yet there is no voice, impact sound, or ambient audio tied to the visual events.
Audio-only generation models such as AudioLM or MusicLM create realistic speech and music but have no awareness of visual motion. Sounds are generated without knowing what appears on screen, leading to audio that cannot react to actions or timing in a video.
Sequential audio-video pipelines first generate video and then add sound using a separate model. This approach is common in practice, but speech often fails to align perfectly with lip movement, and sound effects feel disconnected because the models never observe each other during generation.
LTX-2 differs by generating audio and video together in a single diffusion process, allowing visual actions to directly influence sound and audio cues to shape motion timing.
Who Should Use LTX-2?
AI Engineers
LTX-2’s open weights and code make it suitable for research and custom adaptation. You can fine-tune variants for specialized domains or integrate them into larger systems.
Researchers
Academics studying multimodal learning can inspect training methods, experiment with cross-modal attention mechanisms, or compare LTX-2 with other foundation models.
Creative Technologists
Filmmakers, animators, and audiovisual artists can build prototypes, concept visuals, or generative tools for production workflows.
Open-Source Contributors
Since the entire stack is public, contributors can improve efficiency, add new generation modes, and expand datasets. LTX-2 offers a foundation for community-driven innovation.
Limitations
Clip Duration: The current open-source release supports up to around 20-second clips in high quality. Longer sequences pose challenges for temporal coherence.
Compute Requirements: Although efficient compared to proprietary systems, high-resolution generation still needs strong GPUs with adequate VRAM.
Output Fidelity Variance: For complex scene compositions, generated audio and visuals may show minor coherence or sync issues typical of current diffusion-based video systems.
Conclusion
LTX-2 represents a major advance in generative AI. It is the first widely accessible model that creates synchronized audio and video in one go with open-source code and weights. Its design balances efficiency and quality, making cinematic-grade generation possible even for local experiments and research.
The model unifies audio and visual generation through a dual-stream architecture with cross-modal mechanisms. It allows rich control over the creative process and serves as a stable foundation for future multimodal learning research.
As tools like LTX-2 evolve, the boundary between AI research and real-world creative workflows will continue to blur. Opening these powerful models to the global community expands who can innovate and what forms creative AI systems can take.
What makes LTX-2 different from existing text-to-video models?
LTX-2 generates audio and video together in a single diffusion process, ensuring tight temporal synchronization instead of adding sound later.
Is LTX-2 fully open-source?
Yes. LTX-2 provides open weights, source code, and inference tools, allowing inspection, research use, and custom adaptation.
What are the main limitations of LTX-2?
The model currently supports clips up to about 20 seconds and requires strong GPUs for high-resolution generation.