This week Google released new sound clips from its MusicLM AI project and the results are impressive. Using a text input method similar to ChatGPT or Dall-E2, this new music AI will output music based on what you write. While this sort of technology has been in progress for some time, the sound clips released this week sound more accurate to real life recordings than anything we’ve heard from AI before. Listen to audio examples accompanied by text prompts on the Google Research GitHub page.
Google MusicLM
MusicLM is a music language model developed by Google’s AI division, which uses machine learning to generate music and analyze existing music pieces. It has the capability to analyze music based on different attributes like harmony, rhythm, melody, style, and form and can generate new pieces based on these attributes and user input. The aim is to assist musicians and composers in their creative process and provide new ways of generating and analyzing music.
Before you start clicking around to find it be aware that it’s not yet available to the public. From what I can gather from news articles, there are copyright barriers (and I imagine the looming threat of litigation) that may keep this from the public domain for the short term.
MusicLM: Generating Music From Text
Abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Read more and listen here.