Artificial intelligence

Why computer-generated data is used to train AI models

Artificial intelligence companies are exploring a new path to obtain the massive amounts of data needed to develop powerful generative models: creating information from scratch.

Microsoft, OpenAI and Cohere are among groups testing the use of so-called synthetic data — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limitations of human-created data that can further enhance cutting-edge technology.

The launch of Microsoft-backed OpenAI’s ChatGPT last November has led to a flood of publicly deployed products this year from companies such as Google and Anthropic that can produce plausible text, images, or code in response to simple invites.

Technology, called generative AIhas sparked renewed interest from investors and consumers, with the world’s biggest tech companies including Google, Microsoft and Meta rushing to dominate the space.

Currently, the LLMs that power chatbots such as OpenAI’s ChatGPT and Google’s Bard are mostly trained by scraping the internet. The data used to train these systems includes digitized books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos, and Flickr images, among other content.

Humans are then used to provide feedback and fill information gaps in a process known as human feedback reinforcement learning (RLHF).

But as generative AI software becomes more sophisticated, even deep-pocketed AI companies lack easily accessible, high-quality data to practice with. Meanwhile, they are under fire from regulators, artists and media organizations around the world on the volume and source of personal data consumed by technology.

At an event in London in May, OpenAI chief executive Sam Altman was asked if he was worried about regulatory probes in potential breaches of ChatGPT privacy. Altman brushed it off, saying he was “pretty confident that soon all the data will be synthetic data.”

Generic data from the web is no longer good enough to improve the performance of AI models, developers say.

“If you could get all the data you need on the web, that would be fantastic,” said Aidan Gomez, chief executive of $2 billion start-up Cohere. “In reality, the web is so noisy and messy that it’s not really representative of the data you want. The web just doesn’t do everything we need.

Currently, the most advanced models, such as OpenAI’s GPT-4, approximate human performance in areas such as writing and coding, and are able to pass criteria such as the US bar exam. .

To significantly improve their performance and be able to meet scientific, medical or business challenges, AI models will require unique and sophisticated data sets. These will either need to be created by global experts such as scientists, doctors, authors, actors or engineers or acquired as proprietary data from large corporations such as pharmaceutical companies, banks and retailers. . However, “man-made data. . . is extremely expensive,” Gomez said.

The new trend of using synthetic data avoids this costly requirement. Instead, companies can use AI models to produce more complex text, code, or information related to healthcare or financial fraud. This synthetic data is then used to train advanced LLMs so that they become increasingly capable.

According to Gomez, Cohere as well as several of its competitors already use synthetic data that is then refined and modified by humans. “[Synthetic data] is already huge. . . even if it’s not widely publicized,” he said.

For example, to train a model on advanced math, Cohere can use two AI models talking to each other, one acting as a math tutor and the other as a student.

“They are having a conversation about trigonometry. . . and everything is synthetic,” Gomez said. “Everything is just imagined by the model. And then the human watches that conversation and comes in and corrects it if the model said something wrong. It is the status quo today.

Two recent studies from Microsoft Research have shown that synthetic data can be used to train smaller and simpler models than state-of-the-art software such as OpenAI’s GPT-4 or Google’s PaLM-2.

One article described a synthetic dataset of short stories generated by GPT-4, which contained only words that a typical four-year-old child could understand. This data set, called TinyStories, was then used to train a simple LLM capable of producing fluent and grammatically correct stories. The other paper showed that AI could be trained on synthetic Python code in the form of textbooks and exercises, which they said performed relatively well on coding tasks.

Startups such as Scale AI and have sprung up to provide synthetic data as a service. Gretel, created by former National Security Agency and CIA US intelligence analysts, is working with companies such as Google, HSBC, Riot Games and Illumina to augment their existing data with synthetic versions that can help form best AI models.

The key element of synthetic data, according to Gretel chief executive Ali Golshan, is that it preserves the confidentiality of all individuals in a dataset, while maintaining its statistical integrity.

Well-designed synthetic data can also eliminate biases and imbalances in existing data, he added. “Hedge funds can look at black swan events and, for example, create a hundred variants to see if our models crack,” Golshan said. For banks, where fraud typically accounts for less than 100% of total data, Gretel’s software can generate “thousands of fraud edge scenarios and train [AI] models with her.

Critics point out that not all synthetic data will be carefully curated to reflect or enhance real-world data. As AI-generated text and images begin to fill the internet, it’s likely that AI companies crawling the web for training data will inevitably end up using raw data produced by primitive versions of their own patterns – a phenomenon known as “dog-fooding”. .

Research Universities such as Oxford and Cambridge have recently warned that training AI models on their own raw outputs, which may contain falsehoods or fabrications, could corrupt and degrade the technology over time, causing “irreversible flaws “.

Golshan agrees that training on poor synthetic data could hamper progress. “Content on the web is increasingly AI-generated, and I think that will lead to degradation over time. [because] LLMs produce regurgitated knowledge, without any new perspectives,” he said.

Despite these risks, AI researchers such as Cohere’s Gomez have said synthetic data has the potential to accelerate the path to super-intelligent AI systems.

“What you really want are role models who can learn on their own. You want them to be able to . . . ask their own questions, discover new truths and create their own knowledge,” he said. declared. “It’s the dream.”

Leave a Reply