Auffusion：the next gen Text-to-Audio Generation model

Two gunshots followed by birds chirping

A dog is barking

People cheering in a stadium while rolling thunder and lightning strikes

Explore state-of-the-art Text-to-Audio, audio-to-audio, and Audio InPainting techniques powered by diffusion and large language models.

1 Navigation

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

2 Paper Overview

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
1, Beijing University of Posts and Telecommunications, Beijing, China

Paper on ArXiv | Code on GitHub | Hugging Face

2.1 Abstract

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion—a TTA system that adapts T2I model frameworks for audio generation by leveraging inherent generative strengths and precise cross-modal alignment. Objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches, even when using limited data and computational resources. Comprehensive ablation studies and innovative cross-attention map visualizations further showcase its superior text-audio alignment, benefiting related tasks such as audio style transfer, inpainting, and other manipulations.

2.2 Note

Auffusion generates text-conditional sound effects, human speech, and music.
The latent diffusion model (LDM) is trained on a single A6000 GPU, based on Stable Diffusion using cross attention.
Its strong text-audio alignment enables text-guided audio style transfer, inpainting, and attention-based reweighting/replacement manipulations.

2.3 Figure 1: Overview of Auffusion Architecture

The training and inference process involves back-and-forth transformations between four feature spaces: audio, spectrogram, pixel, and latent space. Note that the U-Net is initialized with a pretrained text-to-image LDM.

3 Table of Contents

4 Text-to-Audio Generation

4.1 Short Samples:

Two gunshots followed by birds chirping / A dog is barking / People cheering in a stadium while rolling thunder and lightning strikes

4.2 Acoustic Environment Control:

A man is speaking in a huge room / A man is speaking in a small room / A man is speaking in a studio

4.3 Material Control:

Chopping tomatoes on a wooden table / Chopping meat on a wooden table / Chopping potatoes on a metal table

4.4 Pitch Control:

Sine wave with low pitch / Sine wave with medium pitch / Sine wave with high pitch

4.5 Temporal Order Control:

A racing car is passing by and disappearing / Two gunshots followed by birds flying away while chirping / Wooden table tapping sound followed by water pouring sound

4.6 Label-to-Audio Generation:

Siren / Thunder / Oink
Explosion / Applause / Fart
Chainsaw / Fireworks / Chicken, rooster
Unconditional Generation: “Null”

5 TTA Generation with ChatGPT Text Prompt

Birds singing sweetly in a blooming garden
A kitten mewing for attention
Magical fairies laughter echoing through an enchanted forest
Soft whispers of a bedtime story being told
A monkey laughs before getting hit on the head by a large atomic bomb
A pencil scribbling on a notepad
The splashing of water in a pond
Coins clinking in a piggy bank
A kid is whistling in a studio
A distant church bell chiming noon
A car’s horn honking in traffic
Angry kids breaking glass in frustration
An old-fashioned typewriter clacking
A girl screaming at the most demented and vile sight
A train whistle blowing in the distance

6 Multi Event Comparision

Text Descriptions vs. Ground-Truth vs. AudioGen vs. AudioLDM vs. AudioLDM2 vs. Tango vs. Auffusion

A bell chiming as a clock ticks and a man talks through a television speaker in the background followed by a muffled bell chiming
Buzzing and humming of a motor with a man speaking
A series of machine gunfire and two gunshots firing as a jet aircraft flies by followed by soft music playing
Woman speaks, girl speaks, clapping, croaking noise interrupts, followed by laughter
A man talking as paper crinkles followed by plastic creaking then a toilet flushing
Rain falls as people talk and laugh in the background
People walk heavily, pause, slide their feet, walk, stop, and begin walking again

7 Cross Attention Map Comparision

Comparisons include:
Auffusion-no-pretrain / Auffusion-w-clip / Auffusion-w-clap / Auffusion-w-flant5 / Tango.

8 Text-Guided Audio Style Transfer

Examples:

From cat screaming to car racing.
From bird chirping to ambulance siren.
From baby crying to cat meowing.

Other Comments

We will share our code on GitHub to open source the audio generation model training and evaluation for easier comparison.
We are confirming the data-related copyright issues, after which the pretrained models will be released.

Future Enhancements

Publish demo website and arXiv link.
Publish Auffusion and Auffusion-Full checkpoints.
Add text-guided style transfer.
Add audio-to-audio generation.
Add audio inpainting.
Add attention-based word swap and reweight control (prompt2prompt-based).
Add audio super-resolution.
Build a Gradio web application integrating audio-to-audio, inpainting, style transfer, and super-resolution.
Add data preprocessing and training code.

Acknowledgement

This website is created based on the work at AudioLDM GitHub.

FAQ

What is Auffusion?
Auffusion is a state-of-the-art text-to-audio generation model that leverages diffusion models and large language models to create high-quality audio from textual prompts.
How does Text-to-Audio generation work?
The system transforms textual descriptions into audio by mapping text embeddings into audio feature spaces using a latent diffusion model, ensuring high fidelity and precise alignment.
What are the core features of Auffusion?
Auffusion supports Text-to-Audio generation, audio-to-audio transformation, audio inpainting, and text-guided audio style transfer.
What role does diffusion play in this model?
Diffusion models help in gradually transforming random noise into coherent audio signals by following the reverse diffusion process guided by textual inputs.
Is the model open-source?
Yes, the code and model checkpoints are intended to be open-sourced, allowing the research community to access and build upon the project.
What hardware is required to run Auffusion?
The model has been trained on a single A6000 GPU; however, performance may vary depending on your hardware and specific setup.
How can I try generating audio with Auffusion?
You can run the provided inference code or use the Colab notebooks to generate audio samples from your own text prompts.
What is Audio InPainting?
Audio InPainting is the process of filling in missing parts of an audio clip, ensuring seamless transitions and maintaining the overall sound integrity.
Can I use the model for commercial purposes?
Usage rights depend on the model’s license; please review the repository license and accompanying documentation for commercial usage guidelines.
How can I contribute to the Auffusion project?
You can contribute by reporting issues, suggesting improvements, or submitting pull requests via the project’s GitHub repository.