Nvidia Fugatto: "World's Most Flexible Sound Machine"

(blogs.nvidia.com)

63 points | by microsoftedging 15 hours ago

42 comments

  • ahofmann 11 hours ago

    While this might be a technical breakthrough, none of the examples sounded any good. Every aspect of the provided sounds are bad. The music sounds muffled and badly mixed. The generated beat isn't a beat that grooves, or has anything interesting in it. The barking saxophone sounded just bad. The voices sounded somewhat convincing.

    In general I think that with ai generated audio it is much more noticeable how utterly bad everything is, that ai generates. I already absolutely hate the two AI voices that are in a lot of YouTube videos and are a reason for me to close the Video immediately most of the time.

    • leopoldj 5 hours ago

      While I agree with you, this is the release of a research paper [1] and some accompanying demos on GitHub [2]. This is not a finished product fine tuned for high quality output.

      [1] https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.p...

      [2] https://fugatto.github.io/

    • RobinL 10 hours ago

      With apologies for the X link, here is an example from Suno which felt very musical to me: https://x.com/sunomusic/status/1857501332560818342

      Here's another example on the Suno website: https://suno.com/song/fc991b95-e4e9-4c8f-87e8-e5e4560755e7

      • ben_w 9 hours ago

        There are things I like from Suno, but, having used it to make quite a lot, I also get vibes of something subtly wrong that I can't put my finger on, which I assume is somewhere between the audio version of bad kerning and Cronenberg fingers. Too many examples of vocoder/autotune in the training set, perhaps?

        That said, I mostly prefer AI over "real" human-made recordings (pop, classical, metal, bardcore, whatever) because I tend to learn the patterns too fast to enjoy, or really even tolerate, any recording more than about 3 times* — I assume I'd like live jazz for longer, but have only been to one place that ever had it so I don't know if it breaks that pattern.

        * sole exception: TV theme tunes, though the point of them isn't to listen to them

      • ahofmann 9 hours ago

        Suno is by far the best generated AI music I've heard. That said, it is hot garbage.

        I've listened to both songs on my Bose QC ultra headphones, which are far from perfect headphones. But even on them, the female voice has unbearable resonances in the higher frequencies. The male voice sounds mostly ok, but has also something that sounds like compression artifacts (like mp3 compression, not loudness compression). All instruments in these songs have these problems. They sound somewhat like the real thing, but really badly recorded. Also, the mixing isn't any good.

        It is still very impressive that AI can generate that. But if I would record my band and someone would create such a mix out of it, I would fire them immediately. Heck, I would be furious that they fucked up so bad and would try to get my money back.

        So the two links you provided just confirm what I said.

        • CraftingLinks 2 hours ago

          I use Suno like a producer in a music studio hires musicians to bring ideas to life. I wish more features in Suno would empower music producers. I sample pieces, re-mix doodles, get ideas to continue my tracks... I can see the future, and as an amateur, it's just liberating and a lot of fun.

        • snapcaster 8 hours ago

          Really interesting, haven't listened to their output with high quality speakers or anything like that. Do poorly made human recordings have this problem or is this currently a signal of AI generation?

      • numpad0 10 hours ago

        I don't find any problems whatsoever in those audio, but I'm not an avid music listener, so out of intuition I'm making a guess that there's same underlying issue as image generation happening: AI makes technically horrible and rage-inducing fillers that lack high level semantic structure, but average people has no words nor experience to assess and describe what's going on.

        • ahofmann 9 hours ago

          > I don't find any problems whatsoever in those audio

          I think this is why there is no real, powerful protest against all that generated stuff. Only the people, that care, are able to articulate what's wrong with it. To me, all of AI generated content sounds horrible. To almost everyone else, this sounds ok. So we will see and hear more of this generated stuff. We are in the middle of the enshittification of all consumable media.

          • com2kid 2 hours ago

            People were happy with the included wired iPhone earbuds for years, even though they were terrible.

            Listening on a laptop speaker Sumo sounds fine. Listening on my wireless earbuds it is... ok. I am too lazy anymore to pull out any of my high quality wired headphones, and if somebody who used to care about sound quality enough to purchase multiple HQ headphones can't be arsed, then the general pubic really is going to think everything is just fine.

          • ben_w 8 hours ago

            I think there's a lot of different reasons all simultaneously going on.

            Most human musicians have very little power; that's been going away for a long time, even since "canned music" "robots" pushed live bands out of cinemas a century ago: https://www.smithsonianmag.com/history/musicians-wage-war-ag...

            Most popular music already feels, and to an extent is, fake. Not only because mere recording allows repeated takes until it's inhumanly "perfect".

            When I played an MP3 of Britney Spears to my mum around the turn of the century, she thought it was a robot singing because of the autotune.

            The Monkees was famously an attempt at a manufactured band whose members just happened to not feel like playing that game and did it for real, Gorillaz is even more obviously manufactured. Parasocial relationships are inherently different from "real" relationships, but the performers have to pretend that it's personal when they address a crowd or a camera.

            Axis of Awesome demonstrated the similarity of most modern hits with their "4 Chords": https://youtu.be/oOlDewpCfZQ?feature=shared

            Those with the power were, possibly still are, the record labels; but if the AI are trained on the works of small musicians that can't afford the copyright cases or the political influence, but also whose works are not under the umbrella of the labels who do have those resources but not the right or short term motivation to intercede on their behalf, the big labels themselves may lose the consumer market to free AI output, while professionals will dismiss both the AI output and the label's output as "just different kinds of slop but both slop" (or whatever the current insult de jour is for AI).

          • numpad0 8 hours ago

            Agreed. To me AI generated images look horrible, and AI generated audio is still somewhat gut twisting but less painful. AI generated code works for HTML/CSS+JS, but not that great for others. AI generated e-commerce reviews ... on par with human reviews?

            I'm starting to think that what AI might be replacing is high ends of consumption, not low ends of generation. Arts has followers that are often less historically significant than genre pioneering works. Doesn't that seem like what AI is doing?

          • anonzzzies 9 hours ago

            I find almost all popular music that's made in the past 20+ years quite terrible. This is not worse. For people who enjoy this chewing gum stuff, which seems most of the population of earth, this is fine. And as such, this will be all popular music in the future; upload your voice, pick a style, generate 13 songs, go on tour to make money.

            • codedokode 7 hours ago

              > go on tour

              Why go on tour if you can send an AI singer instead and if you cannot sing as good as it anyway?

              • anonzzzies 2 hours ago

                They want to believe it is human; however when the robots get good enough... That's further away though maybe.

    • codedokode 8 hours ago

      This might be because of dataset quality, because most of high-quality content is in commercial music and sample libraries.

      • squarefoot 7 hours ago

        This. And the world isn't ready for that, including copyright laws that must be radically changed in a way that doesn't harm innovation. Suno v4 has become a complete disaster for some genres, and that could be due to the lawsuit that is forcing them to retrain the model using non copyrighted works, which in my opinion is pure bollocks. Imagine forcing an artist to unlearn what they listened to in their young years and contributed to forge their personal style. Sorry, but I'm pessimistic. If we don't change how copyright works, pretty much every development in the field will be ruined by greedy copyright holders and their lawyers as soon as it shows any capability to produce decent music that barely resembles something else.

        • codedokode 6 hours ago

          Should not the author be able to decide if his work may be used for generative AI?

          > Imagine forcing an artist to unlearn

          Mathematical models cannot learn. What happens in fact is the owner of generative AI takes a bunch of copyrighted works which took a lot of effort and money to produce (instruments, mics and other equipment is super expensive), puts it into computer and sells whatever the computer has calculated from those recording. Do you see any learning or any creativity here?

          There were cases when suno (or udio) was reproducing producer tags almost verbatim (but in lower quality) for example. This shows that the model was not simply calculating some probabilities of patterns of pitches, durations etc, but was storing the copyrighted content almost unmodified.

          Also, personally I have no interest in a service that generates a song for you because it takes away all the fun. Maybe something that helps to find mistakes in composed music and help learning would be much more useful.

        • Arainach 5 hours ago

          >Sorry, but I'm pessimistic. If we don't change how copyright works, pretty much every development in the field will be ruined by greedy copyright holders and their lawyers

          Sorry, but I'm pessimistic. If we don't change how AI regulation works, pretty much every creative field will be ruined by greedy tech companies and their planet-burning plagiarism devices.

  • olup 21 minutes ago

    They say the models are under 3b parameters. If only for voice generation it sounds pretty good, no ?

  • olau 10 hours ago

    I would love to see a model focusing on making virtual instruments.

    There are sample-based virtual instruments, but they do miss some subtleties, and there are physics-based ones where some subtleties are preserved, but generally worse sounding because actually modelling real hardware evolved over centuries is really difficult.

    Even hardware-based instruments like electric guitars/violins/cellos etc. generally sound distinct from and less interesting than their acoustic counterparts. Electric guitar players seem to use various amplifier tricks to make up for that, and that's now a big separate instrument. But I think the point stands.

    • ulbu 8 hours ago

      I concur. So much focus on grandiose ideas when there are so many low-hanging fruit around.

  • SonOfLilit 13 hours ago

    The description is amazing, but the demo video feels underwhelming.

    Available music generation models sound much more musical and have much better diction on vocals.

    • codedokode 7 hours ago

      This might be due to quality of the dataset because Nvidia seems to be not using copyrighted commercial recordings (if I read their paper properly). It is difficult to compete with those who have used larger and higher quality dataset without permission.

  • codedokode 8 hours ago

    If you use it for work, AI might be ok, but generating a guitar or piano track is zero fun compared to playing a real instrument (even if AI track sounds better). I think we should not forget this part too.

    But what about an AI guitar that automatically frets the strings properly if you don't press them hard enough? Or an AI piano which shifts the keyboard when it sees that you are about to hit the wrong key?

    Many instruments require lot of practice before you can produce acceptable sound. Can AI help with this?

    • norir 4 hours ago

      Not only do the instruments require practice to sound good (I've been playing electric bass for three years and am just beginning to sound better than bad), but a huge part of the process is learning to listen to the instrument and make adjustments. The beauty is that you can immediately hear the result of the adjustment. If it sounds better, you keep it. Otherwise you move until you get closer to what you're looking for. With a prompt based ai tool, it is not possible to make low latency adjustments. Even if you could, how would you articulate the subtle adjustment to the llm?

      My sense is that contrary to marketing, ai tools will be most useful to people who already have musical skill and will actively subvert musical development in most people who rely on it too early in their process.

  • ZoomZoomZoom 9 hours ago

    Most of audio and music AI have wrong incentives and are moving in a different direction professionals need. Almost all publicized innovations in the sphere are complex one-stop-shop solutions which aim to completely replace as many members of the creative process as possible.

    It's a corporate dream: a thing that spews barely-passable, generic mush that's totally aligned with demands of the decision makers, but has zero opinion, zero ambition, zero professional pride and no need to uphold its ethical and aesthetic standards and its own reputation whatsoever.

    Instead of tools for the creatives we have systems that generate complete tracks from tinder chat logs. On the other hand, there's still no publicly available audio style transfer with even remotely usable quality (that thing from Google is abysmal).

    All I want for starters is something that turns a slightly distorted, over-reverberated and not-perfectly intonated flute recording that a client sends me into a clean workable track. I don't even ask for it to turn it into koto or marimba or whatever you think is a cool demonstration case!

    Sorry for the rant, but it's all very frustrating and alarming.

    • tgv 8 hours ago

      Barely passable, indeed. And then to imagine the MBAs are indeed going to fire staff and downsize contractors because of this. More money for them: is that the incentive here?

  • ecocentrik 8 hours ago

    Is it being trained on noticeably compressed audio or is it just outputting highly compressed audio? Can someone explain what the benefits of either would be outside of specifically asking for the sound of audio compression artifacts? Like others have pointed out, existing generative music services already output much higher fidelity audio.

  • SushiHippie 12 hours ago

    Anyone knows what melody this is at 2:07 in the Video?

    https://youtu.be/qj1Sp8He6e4?t=2m7s

  • pil0u 12 hours ago

    Am I the only one feeling weird about the image they chose to illustrate the article? I'm not a professional in that field but I would probably feel offended if coding assistants were presented with a monkey in front of a computer.

    • dagw 9 hours ago

      Calling a musician a "cool cat" has been a slang term of high praise for jazz musicians since at least the 50s.

    • gloflo 12 hours ago

      It's low quality AI slob. That's already offensively disrespectful to the readers on its own.

    • Cumpiler69 10 hours ago

      >I would probably feel offended if coding assistants were presented with a monkey in front of a computer

      As a professional monkey in front of a computer, I feel offended.

    • gus_massa 9 hours ago

      Nah. Cats are cool. S/he has a smart cool look. Most people would like it. (Not everyone.)

      Disclainmer: I prefer dogs, they are more friendly and even part of the family, but I have to recognize that cats look more cool.

    • throw310822 12 hours ago

      Uh? Apart from the fact that the symbolism between a monkey and a cat is entirely different, I imagined it was because gato/ gatto means cat in Spanish/ Italian.

      • Klaster_1 11 hours ago

        Same in greek! I was pleasantly surprised to see a cat, I think this was a nice touch.

  • camillomiller 12 hours ago

    Another day, another model made by engineers who think their technical prowess needs absolutely no understanding of the subtlety of human creativity

  • popalchemist 14 hours ago

    Will they be releasing weights?

  • varispeed 8 hours ago

    Please don't put headphones over a cat's head and especially don't play any loud music!

    • ben_w 5 hours ago

      Mm.

      The headphones aren't on the ears — a surprisingly common error even in pre-AI human-made cartoons.

      • varispeed an hour ago

        But they are in close proximity.