What is a “corpus”? And why is everyone in AI suddenly talking about it? Here’s what you need to know

 

By Michael Grothaus

Thanks to ChatGPT and similar platforms, the rise of artificial intelligence has been one of the most headline-grabbing subjects of 2023. Not a day goes by without a new article coming out about some way AI tech spells eithe doom or salvation for the creative fields, your job, or humanity.

And if you’ve been reading these articles, you might have noticed one particular word being thrown around by tech executives recently: “corpus.” Reddit’s CEO has mentioned it; so has Wikipedia’s founder Jimmy Wales; and so has Microsoft founder Bill Gates.

Here’s what it means, and why it’s critical to understanding how artificial intelligence platforms like ChatGPT and Midjourney operate.

What is an AI corpus?

Those who studied Latin in school will immediately know that corpus means “body.” (The modern word for a dead body—“corpse”—is derived from corpus.) Others might recognize the word corpus because of its use in a legal mechanism still in place today: habeas corpus. This phrase literally means “you should have the body” and it ensures that anyone arrested has the right to appear before a judge (thus, the judge “has the body” of the person arrested) to determine if that arrest is lawful.

But when used in the artificial intelligence realm, the term “corpus” doesn’t refer to a physical body at all. Instead, it refers to the metaphorical “body,” or collection, of data that was used to train the AI. This corpus is the material the AI reviews to become intelligent in whatever it was designed for.

Every AI’s corpus will be different, because it is humans who decide what kind of data they want to train an AI on. And the corpus the humans decide to train the AI on will depend on what they want the AI to be proficient in.

Types of corpora

There is no limit to the types of corpora (the plural of corpus) that can exist. What makes up an AI’s corpus simply depends upon what the human creator of the AI intends for it to do.

Take Midjourney, for example. Midjourney is a popular generative art platform for creating images with AI. Since Midjouney lets a user create images using nothing but text prompts, its AI needed to be trained on both a series of images and associated text descriptions. For example, in order for Midjounry to generate an image of a waterfall, its corpus must have included images of waterfalls and the accompanying text that labeled a wall of falling water as a “waterfall”.

Then there are AI platforms such as ChatGPT, a type of AI known as a large language model, or LLM. Robust LLM’s have the ability to have conversational text-based chats with a person—provided their corpus is large and rich enough. And depending on what its corpus contains, an LLM can also answer complex questions or even generate original creative works, like short stories or the code to create a space shooter game. Its abilities simply depend on the data contained in the corpus that was used to train the AI.

In ChatGPT’s case, I wanted to know what made up its corpus, so I just asked it. “[The ChatGPT corpus] consists of a wide range of text from the internet, including websites, books, articles, and other publicly available sources,” it replied. Not content with the rather vague answer, I asked ChatGPT to elaborate on the types of data in its corpus. This time ChatGPT was more detailed:

• Websites: Text from websites across different domains and topics.

• Books: Text from a wide range of books covering various genres and subjects.

• Articles: Text from news articles, magazine features, and blog posts.

• Research Papers: Text from scientific papers and publications.

• Conversational Data: Text from dialogues, conversations, and interactions.

• Social Media: Text from platforms like Twitter, Reddit, and online forums.

• Wikipedia: Text from Wikipedia articles spanning numerous topics.

ChatGPT

Notice one big omission from ChatGPT’s corpus: images. That’s because ChaptGPT is a text-based AI generator. It can’t generate images because its corpus never contained any to train on.

The data funneled into Midjourney and ChatGPT are just two examples of what can make up a corpus. But a corpus can be made of any kind of data. For example, if you wanted to make an AI that could create music, you would simply include audio songs in its corpus. Or if you wanted an AI that could write a novel in the sparse style of Hemingway, you would use a corpus containing only Hemingway’s written works.

The legality of corpora

If you don’t have a corpus to feed an AI, the AI cannot learn. And the larger your corpus is, the more proficient, or intelligent, the AI can become. But the actual data that makes up an AI’s corpus opens up a whole new can of worms when it comes to copyright and intellectual property law.

 

Have the owners of AI that trained on a corpus of copyrighted material violated the law? For example, if I create an AI that can generate Banksy-like artwork, and I trained the AI on a corpus of Banksy’s works, have I violated Banksy’s copyrights or intellectual property? My AI doesn’t reproduce his artwork, just his style, so is it still a violation of copyright or intellectual property? Or, say I create an AI with a corpus containing Rihanna’s songs. The AI can then generate completely new, original songs, but with Rihanna’s voice, or something close to it. Is that legal?

Universal Music Group already answered with a hard “no” after AI-generated songs by Drake and The Weekend made the rounds on streaming services earlier this year. But creators who use AI tools might say otherwise. Ultimately, whether it’s in regard to AI-generated audio, visual, or text-based media, it’s a question that is likely to tie up courts around the world for years to come as generative AI programs like ChatGPT and Midjourney become more commonplace.

At the same time, governments are already planning legislation that would place regulations on generative AI models. The European Union, for example, is proposing a law that would require the owner of an AI to divulge whether the AI’s corpus contained copyrighted material. That transparency would make it easier for copyright holders to identify which corpora their work has been used in—and thus seek compensation.

In the United States, the Congressional Research Service recently advised Congress that it may wish to “adopt a wait-and-see approach” before updating copyright legislation, suggesting that it monitor how the courts react in the years ahead to AI-generated copyright cases.

AI corpora as a revenue stream

Of course, some content creators will choose to embrace the revenue-generating opportunities that AI stands to offer—those that have large enough bodies of work, anyway. Let’s say a living painter did want to make some extra cash. She could simply package her collection of works in a corpus and sell access to it to generative AI companies. Authors could sell a corpus of their novels; magazine publishers could sell a corpus of their back-issues; and singers could sell a corpus of their vocals—or demand a part of the cut earned by any AI-generated work their corpus fueled, as Grimes has already proposed.

Heck, if Elon Musk wanted a new revenue stream for his flailing Twitter, he might consider packaging all the tweets on the platform into a corpus to sell to AI startups. Meta’s Facebook would also find a new revenue stream in this (provided Twitter and Meta can claim ownership of users’ posts, that is). Indeed, Reddit’s corpus of users’ posts has been used to help train ChatGPT, and in a recent interview with The New York Times, Reddit CEO Steve Huffman said he knew the value of that corpus. “The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free.”

In this sense, as more companies expand into the AI space, robust, pre-packaged corpora may become as important in the tech world as pick axes were to the miners of the gold rush, and a whole new cottage industry of corpora sellers may appear.

If that’s the case, in the months and years ahead, “corpus” is set to become a regular part of the vernacular when we talk about, and debate, AI.

Fast Company

(11)