Captain Suave
Caesar si viveret, ad remum dareris.
- 5,256
- 8,953
Linking this again.
TL;DR - 70 years of AI research has demonstrated that trying to make AI 'smarter' via clever tricks and cognitive hacks produced effectively 0 results. People are under the mistaken assumption that this new batch of AI is new tech, but Deep Learning is actually a return to a very old approach that just wasn't feasible at the time. "The Bitter Lesson" basically says that the only way to make anything resembling 'smart AI' is to just throw a ton of data and compute at the problem--aka there are no shortcuts to smart. The reason the latest round of models work as well as they do is because they collected basically all of the coherent data they could for the domains they cared about, then shoved it through a tremendous amount of compute to distill it.
To build models twice as smart, they'll likely need ~20x the data and ~20x the compute. The compute is 'easy' so long as we continue building chip fabs and avoiding a war over Taiwan. But there's very unlikely to exist 20x the high-quality data you'd need, unless humans keep producing new, high-quality data.
Do you see the trap here? As the internet gets more filled with mediocre goyslop, it becomes harder to build future smarter models using this method, with a high probability of leading to a doomloop of mediocrity as the models starve on their own shit. And 70 years of research has shown that none of the shortcut methods work.
While the math is old, the engineering of it is new because we haven't had computing power or data scale to execute on this level until very recently. As far as whether it represents any kind of actual intelligence or pegs to some metric of "smartness", I don't think that's relevant to whether the output voice quality will be good enough for video games. I'm not talking about progress towards true general AI, I'm saying that these more limited tools will continue to improve in output quality.
The reason these models use vast quantities of data is that most of it SUCKS because it comes from the random cesspool of the internet. The signal to noise ratio is laughable. Curating a purpose-built, high-quality data set that is still large enough for training on the narrow task of something like voice replication will take a lot of human time and that's why no one did it. But someone will do it in the future, and the resulting model will be better.