Ramblings of an old Doc
AI models fed AI-generated data quickly spew nonsense
Published on July 27, 2024 By DrJBHL In Artificial Intelligence

"The increasingly distorted images produced by an artificial-intelligence model that is trained on data generated by a previous version of the model. Credit: M. Boháček & H. Farid/arXiv (CC BY 4.0)" - Nature (ext link)

Unfortunately, or fortunately, researchers have found that feeding AI LLMs AI results leads rapidly to the generation of nonsense.

This is rather important as there's tons of AI data being generated, and unless we find and apply some sort of watermarking or branding of this data as AI generated, results may become unusable or worse unrecognizable as unusable. 

The problem turns out to be especially acute as human generated data is becoming sparcer quantitatively when compared to the availability of AI generated data.

“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI...even before complete collapse, learning from AI-derived texts caused models to forget the information mentioned least frequently in their data sets as their outputs became more homogeneous...(this) is a concern when it comes to making AI models that represent all groups fairly, because low-probability events often relate to marginalized groups, says study co-author Ilia Shumailov, who worked on the project while at the University of Oxford, UK. How much synthetic data is used in training matters. When Shumailov and his team fine-tuned each model on 10% real data, alongside synthetic data, collapse occurred more slowly." - Nature

So, it's time to get a handle on all this before we start seeing the phenomenon in the wild.

"You are what you eat." has meaning in preventive medicine and, it appears in AI, as well.

This was only a brief summary and I didn't talk about the experiments the researchers conducted. It's well worth reading the original article linked above.

Have a great weekend!

 

Photocredit reference: Bohacek, M. & Farid, H. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.12202 (2023).

 


Comments
on Jul 27, 2024

on Jul 27, 2024

I read the article, and this is very concerning. This seems to be similar to AI hallucinations with text, only with images. There are still so many unknowns with AI, and we have pushed it so quickly into so many parts of our lives. These phenomena have the potential to destroy the use of the internet for research. We won't be able to tell fact from fiction (or distorted fact). I agree with you that this needs to be resolved quickly, before we destroy our ability to trust anything on the internet.

on Jul 27, 2024

pelaird

I read the article, and this is very concerning. This seems to be similar to AI hallucinations with text, only with images. There are still so many unknowns with AI, and we have pushed it so quickly into so many parts of our lives. These phenomena have the potential to destroy the use of the internet for research. We won't be able to tell fact from fiction (or distorted fact). I agree with you that this needs to be resolved quickly, before we destroy our ability to trust anything on the internet.

Precisely. The only solution I see as practical is the "watermarking", however it's very clear that one intelligence agency/corporate usage of AI generated data will be poisoning the well from which plans can be made against those entities. We will not be able to tell real data/human generated data from "deep fake" images and "deep fake" data. I think it's plausible to believe this is probably going on already. 

on Jul 27, 2024

Yup. They've run me through enough until I look pretty much like column 3.

on Jul 27, 2024

We're on the fast track to idiocracy!

on Jul 27, 2024

pelaird

We're on the fast track to idiocracy!

Are you sure we aren't very close to it, already?

on Jul 27, 2024

Interesting, thanks Doc.  For some reason it brings to mind the phrase "echo chamber".

on Jul 27, 2024

DaveRI

Interesting, thanks Doc.  For some reason it brings to mind the phrase "echo chamber".

A pleasure, Dave.

on Jul 27, 2024

As long as we are talking AI, I am wondering if it is sustainable. In my opinion, AI has become the new corporate status symbol. Large corporations are scrambling to be the biggest and baddest AI provider in the marketplace. With the costs involved to build and maintain AI (including the power requirements), will it ever be profitable? The largest players are struggling to figure out how to monetize this new service. So far their attempts seem to be failing. The big question is how long can they continue to subsidize AI before it becomes profitable?

I'm interested in any opinions others have about this subject.

on Jul 27, 2024

pelaird

As long as we are talking AI, I am wondering if it is sustainable. In my opinion, AI has become the new corporate status symbol. Large corporations are scrambling to be the biggest and baddest AI provider in the marketplace.

I'd only suggest they beware what they wish for, lest they receive it.   

nVidia certainly hasn't lost a penny...and big players aren't going to lose betting on human laziness. The end result will most likely be tragic as privacy will completely disappear, competition for AI augmented search engines will cost current big players (Google) their monopolies and may end up being better. I don't think the "art"/image creation will lose as they are addicting more and more folks who'll pay their price. Adobe? Hard to imagine them losing money, but alternate, high quality software already exists and competition won't make things more expensive...Phone and computer companies will sell AI enabled machines, and folks will buy, especially as AI increases efficiency/productivity.

Hard to predict where things will go without asking AI. 

on Jul 28, 2024

I just saw an article on TechSpot with the subtitle "Who's Profiting from AI (besides Nvidia)?" This should probably include TSMC.

The article is about the projected costs of their Blackwell server cabinets. "Nvidia's GB200 NVL36 server rack system will cost $1.8 million, and the NVL72 will be $3 million."

...wishing I had bought stock in Nvidia a couple of years ago!

 

on Jul 28, 2024

pelaird

...wishing I had bought stock in Nvidia a couple of years ago!

Me too!

on Jul 28, 2024

Double plus ungood doublethink.....