Experts warn AI is running out of training data

By: Dale Arasa - @inquirerdotnet

10:06 AM November 20, 2023

Artificial intelligence systems like ChatGPT may soon run out of data for training. A recent paper from the Epoch AI research group predicts this scenario may become real by 2026. Soon, it may severely hamper AI development worldwide and reduce the capabilities of existing AI tools.

You’ll notice that more of our devices and programs have been integrated with artificial intelligence. That is why many companies and corporations have declared AI will inevitably transform lives worldwide. However, we must deal with the potential issues that could reduce their capabilities. Otherwise, they may harm instead of help our lives.

This article will discuss why scientists predict AI data will run out in roughly two years. Later, I will explain how artificial intelligence works to illustrate the gravity of this issue.

Why would we run out of AI data?

Analyzing reasons for potential AI data scarcity

The Conversation first reported on this issue on November 7, 2023. It said we need much data to train high-quality, accurate, and powerful AI algorithms.

For example, the news website said ChatGPT trained on 570 gigabytes of text data or roughly 300 billion words. If such programs train on an insufficient amount of data, they would likely produce low-quality and inaccurate outputs.

The quality of training data is also important. Subpar data, such as social media posts, aren’t enough to create advanced AI like ChatGPT. Nowadays, OpenAI, Anthropic, and other tech firms are developing more sophisticated AI programs.

That means they’re consuming more data than ever before, which may lead them to run out by 2026. Also, researchers say we may exhaust all low-language data around 2030 and 2050.

Article continues after this advertisement

We might run out of low-quality image data around 2030 and 2060. That is bad news for AI image generators like DALL-E and Stable Diffusion.

Article continues after this advertisement

Artificial intelligence may add $15.7 trillion to the global economy by 2030. However, exhausting usable AI data could delay development. Nevertheless, The Conversation says the situation may not be as bad as it seems.

You may also like: AI perception is different from humans’

Article continues after this advertisement

We do not know how tech firms would develop future AI models. Perhaps they would create them in a way that addresses the risk of data shortages.

For example, AI devs may improve algorithms to extract more value from existing data. The CEO of Lamini, a startup that assists developers in building large language models, said ChatGPT may have had an AI system change.

Specifically, Sharon Zhou said OpenAI might be using a new approach called a “Mixture of Experts” or MOE. The smaller expert models specialize in multiple subject areas. It may also merge results from two or more expert models for complex requests.

How do AI systems work?

Illustration explaining the functioning of AI systems

Understanding how modern artificial intelligence models can help explain this AI perception study. ChatGPT and similar tools rely on algorithms and embeddings.

Algorithms are rules computers follow to execute tasks. Meanwhile, Microsoft defines embeddings as “a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information-dense representation of the semantic meaning of a piece of text.”

ChatGPT is arguably the most famous AI chatbot at the time of writing, so I will use that to explain embeddings and large language models. The latter contains numerous words classified into numerous categories.

For example, an LLM may contain the words “penguin” and “polar bear.” Both would belong under a “snow animals” group, but the former is a “bird,” and the latter is a “mammal.”

Enter those words in ChatGPT, and the embeddings will guide how algorithms will form results. Here are their most common functions:

You may also like: Researchers create generative AI robot assistant

Search: Embeddings rank queries by relevance.
Clustering: Embeddings group text strings by similarity.
Recommendations: OpenAI embeddings recommend related text strings.
Anomaly detection: Embeddings identify words with minimal relatedness.
Diversity measurement: Embeddings analyze how similarities spread among multiple words.
Classification: OpenAI embeddings classify text strings by their most similar label.

These features can make AI bots seem cold and robotic, but recent findings suggest they can show more emotional awareness than people. Zohar Elyoseph and his colleagues made human volunteers and ChatGPT describe scenarios and graded their responses with the Levels of Emotional Awareness Scale.

Humans scored Z-scores of 2.84 and 4.26 in the two consecutive trials. On the other hand, ChatGPT earned a 9.7, significantly higher than the volunteers’.

Conclusion

Researchers discovered we may run out of high-quality data for AI training by 2026. As a result, future AI development may significantly slow down.

Fortunately, the scientists say AI developers would likely adjust to this growing issue by adjusting their methods. For example, they may create new algorithms to use existing data more efficiently.

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Subscribe to our daily newsletter

By providing an email address. I agree to the Terms of Use and acknowledge that I have read the Privacy Policy.

Learn more about this AI data study on its arXiv webpage. Moreover, learn more about the latest digital tips and trends at Inquirer Tech.

TOPICS: AI, interesting topics, Science and technology, Trending

Experts warn AI is running out of training data

Why would we run out of AI data?

How do AI systems work?

Conclusion

Disclaimer: Comments do not represent the views of INQUIRER.net. We reserve the right to exclude comments which are inconsistent with our editorial standards. FULL DISCLAIMER

© Copyright 1997-2025 INQUIRER.net | All Rights Reserved