While this might be a technical breakthrough, none of the examples sounded any good. Every aspect of the provided sounds are bad. The music sounds muffled and badly mixed. The generated beat isn't a beat that grooves, or has anything interesting in it. The barking saxophone sounded just bad. The voices sounded somewhat convincing.
In general I think that with ai generated audio it is much more noticeable how utterly bad everything is, that ai generates. I already absolutely hate the two AI voices that are in a lot of YouTube videos and are a reason for me to close the Video immediately most of the time.
While I agree with you, this is the release of a research paper [1] and some accompanying demos on GitHub [2]. This is not a finished product fine tuned for high quality output.
There are things I like from Suno, but, having used it to make quite a lot, I also get vibes of something subtly wrong that I can't put my finger on, which I assume is somewhere between the audio version of bad kerning and Cronenberg fingers. Too many examples of vocoder/autotune in the training set, perhaps?
That said, I mostly prefer AI over "real" human-made recordings (pop, classical, metal, bardcore, whatever) because I tend to learn the patterns too fast to enjoy, or really even tolerate, any recording more than about 3 times* — I assume I'd like live jazz for longer, but have only been to one place that ever had it so I don't know if it breaks that pattern.
* sole exception: TV theme tunes, though the point of them isn't to listen to them
Suno is by far the best generated AI music I've heard. That said, it is hot garbage.
I've listened to both songs on my Bose QC ultra headphones, which are far from perfect headphones. But even on them, the female voice has unbearable resonances in the higher frequencies. The male voice sounds mostly ok, but has also something that sounds like compression artifacts (like mp3 compression, not loudness compression). All instruments in these songs have these problems. They sound somewhat like the real thing, but really badly recorded. Also, the mixing isn't any good.
It is still very impressive that AI can generate that. But if I would record my band and someone would create such a mix out of it, I would fire them immediately. Heck, I would be furious that they fucked up so bad and would try to get my money back.
So the two links you provided just confirm what I said.
I use Suno like a producer in a music studio hires musicians to bring ideas to life. I wish more features in Suno would empower music producers. I sample pieces, re-mix doodles, get ideas to continue my tracks... I can see the future, and as an amateur, it's just liberating and a lot of fun.
Really interesting, haven't listened to their output with high quality speakers or anything like that. Do poorly made human recordings have this problem or is this currently a signal of AI generation?
I don't find any problems whatsoever in those audio, but I'm not an avid music listener, so out of intuition I'm making a guess that there's same underlying issue as image generation happening: AI makes technically horrible and rage-inducing fillers that lack high level semantic structure, but average people has no words nor experience to assess and describe what's going on.
> I don't find any problems whatsoever in those audio
I think this is why there is no real, powerful protest against all that generated stuff. Only the people, that care, are able to articulate what's wrong with it. To me, all of AI generated content sounds horrible. To almost everyone else, this sounds ok. So we will see and hear more of this generated stuff. We are in the middle of the enshittification of all consumable media.
People were happy with the included wired iPhone earbuds for years, even though they were terrible.
Listening on a laptop speaker Sumo sounds fine. Listening on my wireless earbuds it is... ok. I am too lazy anymore to pull out any of my high quality wired headphones, and if somebody who used to care about sound quality enough to purchase multiple HQ headphones can't be arsed, then the general pubic really is going to think everything is just fine.
Most popular music already feels, and to an extent is, fake. Not only because mere recording allows repeated takes until it's inhumanly "perfect".
When I played an MP3 of Britney Spears to my mum around the turn of the century, she thought it was a robot singing because of the autotune.
The Monkees was famously an attempt at a manufactured band whose members just happened to not feel like playing that game and did it for real, Gorillaz is even more obviously manufactured. Parasocial relationships are inherently different from "real" relationships, but the performers have to pretend that it's personal when they address a crowd or a camera.
Those with the power were, possibly still are, the record labels; but if the AI are trained on the works of small musicians that can't afford the copyright cases or the political influence, but also whose works are not under the umbrella of the labels who do have those resources but not the right or short term motivation to intercede on their behalf, the big labels themselves may lose the consumer market to free AI output, while professionals will dismiss both the AI output and the label's output as "just different kinds of slop but both slop" (or whatever the current insult de jour is for AI).
Agreed. To me AI generated images look horrible, and AI generated audio is still somewhat gut twisting but less painful. AI generated code works for HTML/CSS+JS, but not that great for others. AI generated e-commerce reviews ... on par with human reviews?
I'm starting to think that what AI might be replacing is high ends of consumption, not low ends of generation. Arts has followers that are often less historically significant than genre pioneering works. Doesn't that seem like what AI is doing?
I find almost all popular music that's made in the past 20+ years quite terrible. This is not worse. For people who enjoy this chewing gum stuff, which seems most of the population of earth, this is fine. And as such, this will be all popular music in the future; upload your voice, pick a style, generate 13 songs, go on tour to make money.
This. And the world isn't ready for that, including copyright laws that must be radically changed in a way that doesn't harm innovation. Suno v4 has become a complete disaster for some genres, and that could be due to the lawsuit that is forcing them to retrain the model using non copyrighted works, which in my opinion is pure bollocks. Imagine forcing an artist to unlearn what they listened to in their young years and contributed to forge their personal style.
Sorry, but I'm pessimistic. If we don't change how copyright works, pretty much every development in the field will be ruined by greedy copyright holders and their lawyers as soon as it shows any capability to produce decent music that barely resembles something else.
Should not the author be able to decide if his work may be used for generative AI?
> Imagine forcing an artist to unlearn
Mathematical models cannot learn. What happens in fact is the owner of generative AI takes a bunch of copyrighted works which took a lot of effort and money to produce (instruments, mics and other equipment is super expensive), puts it into computer and sells whatever the computer has calculated from those recording. Do you see any learning or any creativity here?
There were cases when suno (or udio) was reproducing producer tags almost verbatim (but in lower quality) for example. This shows that the model was not simply calculating some probabilities of patterns of pitches, durations etc, but was storing the copyrighted content almost unmodified.
Also, personally I have no interest in a service that generates a song for you because it takes away all the fun. Maybe something that helps to find mistakes in composed music and help learning would be much more useful.
>Sorry, but I'm pessimistic. If we don't change how copyright works, pretty much every development in the field will be ruined by greedy copyright holders and their lawyers
Sorry, but I'm pessimistic. If we don't change how AI regulation works, pretty much every creative field will be ruined by greedy tech companies and their planet-burning plagiarism devices.
I would love to see a model focusing on making virtual instruments.
There are sample-based virtual instruments, but they do miss some subtleties, and there are physics-based ones where some subtleties are preserved, but generally worse sounding because actually modelling real hardware evolved over centuries is really difficult.
Even hardware-based instruments like electric guitars/violins/cellos etc. generally sound distinct from and less interesting than their acoustic counterparts. Electric guitar players seem to use various amplifier tricks to make up for that, and that's now a big separate instrument. But I think the point stands.
This might be due to quality of the dataset because Nvidia seems to be not using copyrighted commercial recordings (if I read their paper properly). It is difficult to compete with those who have used larger and higher quality dataset without permission.
If you use it for work, AI might be ok, but generating a guitar or piano track is zero fun compared to playing a real instrument (even if AI track sounds better). I think we should not forget this part too.
But what about an AI guitar that automatically frets the strings properly if you don't press them hard enough? Or an AI piano which shifts the keyboard when it sees that you are about to hit the wrong key?
Many instruments require lot of practice before you can produce acceptable sound. Can AI help with this?
Not only do the instruments require practice to sound good (I've been playing electric bass for three years and am just beginning to sound better than bad), but a huge part of the process is learning to listen to the instrument and make adjustments. The beauty is that you can immediately hear the result of the adjustment. If it sounds better, you keep it. Otherwise you move until you get closer to what you're looking for. With a prompt based ai tool, it is not possible to make low latency adjustments. Even if you could, how would you articulate the subtle adjustment to the llm?
My sense is that contrary to marketing, ai tools will be most useful to people who already have musical skill and will actively subvert musical development in most people who rely on it too early in their process.
Most of audio and music AI have wrong incentives and are moving in a different direction professionals need. Almost all publicized innovations in the sphere are complex one-stop-shop solutions which aim to completely replace as many members of the creative process as possible.
It's a corporate dream: a thing that spews barely-passable, generic mush that's totally aligned with demands of the decision makers, but has zero opinion, zero ambition, zero professional pride and no need to uphold its ethical and aesthetic standards and its own reputation whatsoever.
Instead of tools for the creatives we have systems that generate complete tracks from tinder chat logs. On the other hand, there's still no publicly available audio style transfer with even remotely usable quality (that thing from Google is abysmal).
All I want for starters is something that turns a slightly distorted, over-reverberated and not-perfectly intonated flute recording that a client sends me into a clean workable track. I don't even ask for it to turn it into koto or marimba or whatever you think is a cool demonstration case!
Sorry for the rant, but it's all very frustrating and alarming.
Barely passable, indeed. And then to imagine the MBAs are indeed going to fire staff and downsize contractors because of this. More money for them: is that the incentive here?
Is it being trained on noticeably compressed audio or is it just outputting highly compressed audio? Can someone explain what the benefits of either would be outside of specifically asking for the sound of audio compression artifacts? Like others have pointed out, existing generative music services already output much higher fidelity audio.
Am I the only one feeling weird about the image they chose to illustrate the article? I'm not a professional in that field but I would probably feel offended if coding assistants were presented with a monkey in front of a computer.
Uh? Apart from the fact that the symbolism between a monkey and a cat is entirely different, I imagined it was because gato/ gatto means cat in Spanish/ Italian.
While this might be a technical breakthrough, none of the examples sounded any good. Every aspect of the provided sounds are bad. The music sounds muffled and badly mixed. The generated beat isn't a beat that grooves, or has anything interesting in it. The barking saxophone sounded just bad. The voices sounded somewhat convincing.
In general I think that with ai generated audio it is much more noticeable how utterly bad everything is, that ai generates. I already absolutely hate the two AI voices that are in a lot of YouTube videos and are a reason for me to close the Video immediately most of the time.
While I agree with you, this is the release of a research paper [1] and some accompanying demos on GitHub [2]. This is not a finished product fine tuned for high quality output.
[1] https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.p...
[2] https://fugatto.github.io/
With apologies for the X link, here is an example from Suno which felt very musical to me: https://x.com/sunomusic/status/1857501332560818342
Here's another example on the Suno website: https://suno.com/song/fc991b95-e4e9-4c8f-87e8-e5e4560755e7
There are things I like from Suno, but, having used it to make quite a lot, I also get vibes of something subtly wrong that I can't put my finger on, which I assume is somewhere between the audio version of bad kerning and Cronenberg fingers. Too many examples of vocoder/autotune in the training set, perhaps?
That said, I mostly prefer AI over "real" human-made recordings (pop, classical, metal, bardcore, whatever) because I tend to learn the patterns too fast to enjoy, or really even tolerate, any recording more than about 3 times* — I assume I'd like live jazz for longer, but have only been to one place that ever had it so I don't know if it breaks that pattern.
* sole exception: TV theme tunes, though the point of them isn't to listen to them
Suno is by far the best generated AI music I've heard. That said, it is hot garbage.
I've listened to both songs on my Bose QC ultra headphones, which are far from perfect headphones. But even on them, the female voice has unbearable resonances in the higher frequencies. The male voice sounds mostly ok, but has also something that sounds like compression artifacts (like mp3 compression, not loudness compression). All instruments in these songs have these problems. They sound somewhat like the real thing, but really badly recorded. Also, the mixing isn't any good.
It is still very impressive that AI can generate that. But if I would record my band and someone would create such a mix out of it, I would fire them immediately. Heck, I would be furious that they fucked up so bad and would try to get my money back.
So the two links you provided just confirm what I said.
I use Suno like a producer in a music studio hires musicians to bring ideas to life. I wish more features in Suno would empower music producers. I sample pieces, re-mix doodles, get ideas to continue my tracks... I can see the future, and as an amateur, it's just liberating and a lot of fun.
Really interesting, haven't listened to their output with high quality speakers or anything like that. Do poorly made human recordings have this problem or is this currently a signal of AI generation?
I don't find any problems whatsoever in those audio, but I'm not an avid music listener, so out of intuition I'm making a guess that there's same underlying issue as image generation happening: AI makes technically horrible and rage-inducing fillers that lack high level semantic structure, but average people has no words nor experience to assess and describe what's going on.
> I don't find any problems whatsoever in those audio
I think this is why there is no real, powerful protest against all that generated stuff. Only the people, that care, are able to articulate what's wrong with it. To me, all of AI generated content sounds horrible. To almost everyone else, this sounds ok. So we will see and hear more of this generated stuff. We are in the middle of the enshittification of all consumable media.
People were happy with the included wired iPhone earbuds for years, even though they were terrible.
Listening on a laptop speaker Sumo sounds fine. Listening on my wireless earbuds it is... ok. I am too lazy anymore to pull out any of my high quality wired headphones, and if somebody who used to care about sound quality enough to purchase multiple HQ headphones can't be arsed, then the general pubic really is going to think everything is just fine.
I think there's a lot of different reasons all simultaneously going on.
Most human musicians have very little power; that's been going away for a long time, even since "canned music" "robots" pushed live bands out of cinemas a century ago: https://www.smithsonianmag.com/history/musicians-wage-war-ag...
Most popular music already feels, and to an extent is, fake. Not only because mere recording allows repeated takes until it's inhumanly "perfect".
When I played an MP3 of Britney Spears to my mum around the turn of the century, she thought it was a robot singing because of the autotune.
The Monkees was famously an attempt at a manufactured band whose members just happened to not feel like playing that game and did it for real, Gorillaz is even more obviously manufactured. Parasocial relationships are inherently different from "real" relationships, but the performers have to pretend that it's personal when they address a crowd or a camera.
Axis of Awesome demonstrated the similarity of most modern hits with their "4 Chords": https://youtu.be/oOlDewpCfZQ?feature=shared
Those with the power were, possibly still are, the record labels; but if the AI are trained on the works of small musicians that can't afford the copyright cases or the political influence, but also whose works are not under the umbrella of the labels who do have those resources but not the right or short term motivation to intercede on their behalf, the big labels themselves may lose the consumer market to free AI output, while professionals will dismiss both the AI output and the label's output as "just different kinds of slop but both slop" (or whatever the current insult de jour is for AI).
Agreed. To me AI generated images look horrible, and AI generated audio is still somewhat gut twisting but less painful. AI generated code works for HTML/CSS+JS, but not that great for others. AI generated e-commerce reviews ... on par with human reviews?
I'm starting to think that what AI might be replacing is high ends of consumption, not low ends of generation. Arts has followers that are often less historically significant than genre pioneering works. Doesn't that seem like what AI is doing?
I find almost all popular music that's made in the past 20+ years quite terrible. This is not worse. For people who enjoy this chewing gum stuff, which seems most of the population of earth, this is fine. And as such, this will be all popular music in the future; upload your voice, pick a style, generate 13 songs, go on tour to make money.
> go on tour
Why go on tour if you can send an AI singer instead and if you cannot sing as good as it anyway?
They want to believe it is human; however when the robots get good enough... That's further away though maybe.
This might be because of dataset quality, because most of high-quality content is in commercial music and sample libraries.
This. And the world isn't ready for that, including copyright laws that must be radically changed in a way that doesn't harm innovation. Suno v4 has become a complete disaster for some genres, and that could be due to the lawsuit that is forcing them to retrain the model using non copyrighted works, which in my opinion is pure bollocks. Imagine forcing an artist to unlearn what they listened to in their young years and contributed to forge their personal style. Sorry, but I'm pessimistic. If we don't change how copyright works, pretty much every development in the field will be ruined by greedy copyright holders and their lawyers as soon as it shows any capability to produce decent music that barely resembles something else.
Should not the author be able to decide if his work may be used for generative AI?
> Imagine forcing an artist to unlearn
Mathematical models cannot learn. What happens in fact is the owner of generative AI takes a bunch of copyrighted works which took a lot of effort and money to produce (instruments, mics and other equipment is super expensive), puts it into computer and sells whatever the computer has calculated from those recording. Do you see any learning or any creativity here?
There were cases when suno (or udio) was reproducing producer tags almost verbatim (but in lower quality) for example. This shows that the model was not simply calculating some probabilities of patterns of pitches, durations etc, but was storing the copyrighted content almost unmodified.
Also, personally I have no interest in a service that generates a song for you because it takes away all the fun. Maybe something that helps to find mistakes in composed music and help learning would be much more useful.
>Sorry, but I'm pessimistic. If we don't change how copyright works, pretty much every development in the field will be ruined by greedy copyright holders and their lawyers
Sorry, but I'm pessimistic. If we don't change how AI regulation works, pretty much every creative field will be ruined by greedy tech companies and their planet-burning plagiarism devices.
They say the models are under 3b parameters. If only for voice generation it sounds pretty good, no ?
I would love to see a model focusing on making virtual instruments.
There are sample-based virtual instruments, but they do miss some subtleties, and there are physics-based ones where some subtleties are preserved, but generally worse sounding because actually modelling real hardware evolved over centuries is really difficult.
Even hardware-based instruments like electric guitars/violins/cellos etc. generally sound distinct from and less interesting than their acoustic counterparts. Electric guitar players seem to use various amplifier tricks to make up for that, and that's now a big separate instrument. But I think the point stands.
I concur. So much focus on grandiose ideas when there are so many low-hanging fruit around.
The description is amazing, but the demo video feels underwhelming.
Available music generation models sound much more musical and have much better diction on vocals.
This might be due to quality of the dataset because Nvidia seems to be not using copyrighted commercial recordings (if I read their paper properly). It is difficult to compete with those who have used larger and higher quality dataset without permission.
If you use it for work, AI might be ok, but generating a guitar or piano track is zero fun compared to playing a real instrument (even if AI track sounds better). I think we should not forget this part too.
But what about an AI guitar that automatically frets the strings properly if you don't press them hard enough? Or an AI piano which shifts the keyboard when it sees that you are about to hit the wrong key?
Many instruments require lot of practice before you can produce acceptable sound. Can AI help with this?
Not only do the instruments require practice to sound good (I've been playing electric bass for three years and am just beginning to sound better than bad), but a huge part of the process is learning to listen to the instrument and make adjustments. The beauty is that you can immediately hear the result of the adjustment. If it sounds better, you keep it. Otherwise you move until you get closer to what you're looking for. With a prompt based ai tool, it is not possible to make low latency adjustments. Even if you could, how would you articulate the subtle adjustment to the llm?
My sense is that contrary to marketing, ai tools will be most useful to people who already have musical skill and will actively subvert musical development in most people who rely on it too early in their process.
Most of audio and music AI have wrong incentives and are moving in a different direction professionals need. Almost all publicized innovations in the sphere are complex one-stop-shop solutions which aim to completely replace as many members of the creative process as possible.
It's a corporate dream: a thing that spews barely-passable, generic mush that's totally aligned with demands of the decision makers, but has zero opinion, zero ambition, zero professional pride and no need to uphold its ethical and aesthetic standards and its own reputation whatsoever.
Instead of tools for the creatives we have systems that generate complete tracks from tinder chat logs. On the other hand, there's still no publicly available audio style transfer with even remotely usable quality (that thing from Google is abysmal).
All I want for starters is something that turns a slightly distorted, over-reverberated and not-perfectly intonated flute recording that a client sends me into a clean workable track. I don't even ask for it to turn it into koto or marimba or whatever you think is a cool demonstration case!
Sorry for the rant, but it's all very frustrating and alarming.
Barely passable, indeed. And then to imagine the MBAs are indeed going to fire staff and downsize contractors because of this. More money for them: is that the incentive here?
Is it being trained on noticeably compressed audio or is it just outputting highly compressed audio? Can someone explain what the benefits of either would be outside of specifically asking for the sound of audio compression artifacts? Like others have pointed out, existing generative music services already output much higher fidelity audio.
Anyone knows what melody this is at 2:07 in the Video?
https://youtu.be/qj1Sp8He6e4?t=2m7s
Am I the only one feeling weird about the image they chose to illustrate the article? I'm not a professional in that field but I would probably feel offended if coding assistants were presented with a monkey in front of a computer.
Calling a musician a "cool cat" has been a slang term of high praise for jazz musicians since at least the 50s.
It's low quality AI slob. That's already offensively disrespectful to the readers on its own.
>I would probably feel offended if coding assistants were presented with a monkey in front of a computer
As a professional monkey in front of a computer, I feel offended.
Nah. Cats are cool. S/he has a smart cool look. Most people would like it. (Not everyone.)
Disclainmer: I prefer dogs, they are more friendly and even part of the family, but I have to recognize that cats look more cool.
Uh? Apart from the fact that the symbolism between a monkey and a cat is entirely different, I imagined it was because gato/ gatto means cat in Spanish/ Italian.
Same in greek! I was pleasantly surprised to see a cat, I think this was a nice touch.
Another day, another model made by engineers who think their technical prowess needs absolutely no understanding of the subtlety of human creativity
Will they be releasing weights?
Please don't put headphones over a cat's head and especially don't play any loud music!
Mm.
The headphones aren't on the ears — a surprisingly common error even in pre-AI human-made cartoons.
But they are in close proximity.