"The increasingly distorted images produced by an artificial-intelligence model that is trained on data generated by a previous version of the model. Credit: M. Boháček & H. Farid/arXiv (CC BY 4.0)" - Nature (ext link)
Unfortunately, or fortunately, researchers have found that feeding AI LLMs AI results leads rapidly to the generation of nonsense.
This is rather important as there's tons of AI data being generated, and unless we find and apply some sort of watermarking or branding of this data as AI generated, results may become unusable or worse unrecognizable as unusable.
The problem turns out to be especially acute as human generated data is becoming sparcer quantitatively when compared to the availability of AI generated data.
“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI...even before complete collapse, learning from AI-derived texts caused models to forget the information mentioned least frequently in their data sets as their outputs became more homogeneous...(this) is a concern when it comes to making AI models that represent all groups fairly, because low-probability events often relate to marginalized groups, says study co-author Ilia Shumailov, who worked on the project while at the University of Oxford, UK. How much synthetic data is used in training matters. When Shumailov and his team fine-tuned each model on 10% real data, alongside synthetic data, collapse occurred more slowly." - Nature
So, it's time to get a handle on all this before we start seeing the phenomenon in the wild.
"You are what you eat." has meaning in preventive medicine and, it appears in AI, as well.
This was only a brief summary and I didn't talk about the experiments the researchers conducted. It's well worth reading the original article linked above.
Have a great weekend!
Photocredit reference: Bohacek, M. & Farid, H. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.12202 (2023).