Two gunshots followed by birds chirping
A dog is barking
People cheering in a stadium while rolling thunder and lightning strikes
Explore state-of-the-art Text-to-Audio, audio-to-audio, and Audio InPainting techniques powered by diffusion and large language models.
1 Navigation
- Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
- Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
2 Paper Overview
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
1, Beijing University of Posts and Telecommunications, Beijing, China
Paper on ArXiv | Code on GitHub | Hugging Face
2.1 Abstract
Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion—a TTA system that adapts T2I model frameworks for audio generation by leveraging inherent generative strengths and precise cross-modal alignment. Objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches, even when using limited data and computational resources. Comprehensive ablation studies and innovative cross-attention map visualizations further showcase its superior text-audio alignment, benefiting related tasks such as audio style transfer, inpainting, and other manipulations.
2.2 Note
- Auffusion generates text-conditional sound effects, human speech, and music.
- The latent diffusion model (LDM) is trained on a single A6000 GPU, based on Stable Diffusion using cross attention.
- Its strong text-audio alignment enables text-guided audio style transfer, inpainting, and attention-based reweighting/replacement manipulations.
2.3 Figure 1: Overview of Auffusion Architecture
The training and inference process involves back-and-forth transformations between four feature spaces: audio, spectrogram, pixel, and latent space. Note that the U-Net is initialized with a pretrained text-to-image LDM.
3 Table of Contents
- Text-to-Audio Generation
- TTA Generation with ChatGPT Text Prompt
- Multi Event Comparision
- Cross Attention Map Comparision
- Text-Guided Audio Style Transfer
- Audio Inpainting
- Attention-based Replacement
- Attention-based Reweighting
- Other Comments
- Future Enhancements
- FAQ
4 Text-to-Audio Generation
4.1 Short Samples:
- Two gunshots followed by birds chirping / A dog is barking / People cheering in a stadium while rolling thunder and lightning strikes
4.2 Acoustic Environment Control:
- A man is speaking in a huge room / A man is speaking in a small room / A man is speaking in a studio
4.3 Material Control:
- Chopping tomatoes on a wooden table / Chopping meat on a wooden table / Chopping potatoes on a metal table
4.4 Pitch Control:
- Sine wave with low pitch / Sine wave with medium pitch / Sine wave with high pitch
4.5 Temporal Order Control:
- A racing car is passing by and disappearing / Two gunshots followed by birds flying away while chirping / Wooden table tapping sound followed by water pouring sound
4.6 Label-to-Audio Generation:
- Siren / Thunder / Oink
- Explosion / Applause / Fart
- Chainsaw / Fireworks / Chicken, rooster
- Unconditional Generation: “Null”
5 TTA Generation with ChatGPT Text Prompt
- Birds singing sweetly in a blooming garden
- A kitten mewing for attention
- Magical fairies laughter echoing through an enchanted forest
- Soft whispers of a bedtime story being told
- A monkey laughs before getting hit on the head by a large atomic bomb
- A pencil scribbling on a notepad
- The splashing of water in a pond
- Coins clinking in a piggy bank
- A kid is whistling in a studio
- A distant church bell chiming noon
- A car’s horn honking in traffic
- Angry kids breaking glass in frustration
- An old-fashioned typewriter clacking
- A girl screaming at the most demented and vile sight
- A train whistle blowing in the distance
6 Multi Event Comparision
Text Descriptions vs. Ground-Truth vs. AudioGen vs. AudioLDM vs. AudioLDM2 vs. Tango vs. Auffusion
- A bell chiming as a clock ticks and a man talks through a television speaker in the background followed by a muffled bell chiming
- Buzzing and humming of a motor with a man speaking
- A series of machine gunfire and two gunshots firing as a jet aircraft flies by followed by soft music playing
- Woman speaks, girl speaks, clapping, croaking noise interrupts, followed by laughter
- A man talking as paper crinkles followed by plastic creaking then a toilet flushing
- Rain falls as people talk and laugh in the background
- People walk heavily, pause, slide their feet, walk, stop, and begin walking again
7 Cross Attention Map Comparision
Comparisons include:
Auffusion-no-pretrain / Auffusion-w-clip / Auffusion-w-clap / Auffusion-w-flant5 / Tango.
8 Text-Guided Audio Style Transfer
Examples:
- From cat screaming to car racing.
- From bird chirping to ambulance siren.
- From baby crying to cat meowing.
Other Comments
- We will share our code on GitHub to open source the audio generation model training and evaluation for easier comparison.
- We are confirming the data-related copyright issues, after which the pretrained models will be released.
Future Enhancements
- Publish demo website and arXiv link.
- Publish Auffusion and Auffusion-Full checkpoints.
- Add text-guided style transfer.
- Add audio-to-audio generation.
- Add audio inpainting.
- Add attention-based word swap and reweight control (prompt2prompt-based).
- Add audio super-resolution.
- Build a Gradio web application integrating audio-to-audio, inpainting, style transfer, and super-resolution.
- Add data preprocessing and training code.
Acknowledgement
This website is created based on the work at AudioLDM GitHub.
FAQ
- What is Auffusion?
Auffusion is a state-of-the-art text-to-audio generation model that leverages diffusion models and large language models to create high-quality audio from textual prompts. - How does Text-to-Audio generation work?
The system transforms textual descriptions into audio by mapping text embeddings into audio feature spaces using a latent diffusion model, ensuring high fidelity and precise alignment. - What are the core features of Auffusion?
Auffusion supports Text-to-Audio generation, audio-to-audio transformation, audio inpainting, and text-guided audio style transfer. - What role does diffusion play in this model?
Diffusion models help in gradually transforming random noise into coherent audio signals by following the reverse diffusion process guided by textual inputs. - Is the model open-source?
Yes, the code and model checkpoints are intended to be open-sourced, allowing the research community to access and build upon the project. - What hardware is required to run Auffusion?
The model has been trained on a single A6000 GPU; however, performance may vary depending on your hardware and specific setup. - How can I try generating audio with Auffusion?
You can run the provided inference code or use the Colab notebooks to generate audio samples from your own text prompts. - What is Audio InPainting?
Audio InPainting is the process of filling in missing parts of an audio clip, ensuring seamless transitions and maintaining the overall sound integrity. - Can I use the model for commercial purposes?
Usage rights depend on the model’s license; please review the repository license and accompanying documentation for commercial usage guidelines. - How can I contribute to the Auffusion project?
You can contribute by reporting issues, suggesting improvements, or submitting pull requests via the project’s GitHub repository.