Generative music with language+diffusion models
September 16, 2022 — December 6, 2023
Placeholder.
A special class of generative AI for music. For other alternatives, see nn music.
Here we consider specifically using diffusion models, much like the diffusion image synthesis, but for audio.
(N. Chen et al. 2020; Goel et al. 2022; Hernandez-Olivan, Hernandez-Olivan, and Beltran 2022; Kreuk, Taigman, et al. 2022; Kreuk, Synnaeve, et al. 2022; Lee and Han 2021; Pascual et al. 2022; von Platen et al. 2022)
1 Text-to-music
Not really my jam, but very interesting.
CLAP seems to be the dominant labeling method.
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models - Speech Research
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.
- MusicLDM extends this with some interesting music-specific tricks, such as tempo-aware controls
MusicGen: Simple and Controllable Music Generation
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples can be found on the supplemental materials. Code and models are available on our repo github.com/facebookresearch/audiocraft.
2 Tooling
HuggingFace diffusers look like a de facto standard
archinetai/audio-diffusion-pytorch: Audio generation using diffusion models, in PyTorch.
diffusion_models/diffusion_03_waveform.ipynb at main · acids-ircam/diffusion_models
Apple acquires song-shifting startup AI Music, here’s what it could mean for users