JoinIn.ai
Launching soon · Request demo

Why AI Depends on Data, and Where to Get It

November 28, 2025

Data sets like MultiLIGHT and MetaMind are invaluable to training an LLM about how conversations work

AI learning from data illustration

In a recent episode of The Weekly Show with Jon Stewart1, 2018 Turing Award winner and "Godfather of AI" Geoffrey Hinton walked his host through an extremely accessible explanation of how neural networks learn, starting with the classic example of determining whether a picture contains a bird or not. The focus of their discussion is on examining the pixels, extracting features, and moving up layers of abstraction, but buried in there somewhere is an implicit assumption: you have a bunch of pictures of birds.

That's reasonable to assume, but what if the thing you are after isn't so easy to pin down? Whether a picture has a bird in it or not isn't so hard to agree on, but what does it mean for text prediction to be "correct"? The better metric is probably "acceptable", and that is ultimately in the eye of the user, but the best chance of achieving that is by going off of what all the training text says.

The Challenge of Conversation Data

Companies like OpenAI and Anthropic have the entirety of the public internet to scrape, so the sheer quantity of content covers their text prediction needs. But to train AI to understand conversations better, it needs a corpus of structured dialog, which isn't nearly as readily available. Our first idea for such a source was closed captioning transcriptions of dialog for TV shows, and while that's not nothing, the disparity between what screenwriters create and how people actually interact with one another seemed risky.

That's where efforts like MultiLIGHT2 from Meta come in as the culmination of thousands of hours of manual labor to create conversations involving multiple parties. With that we can compare an AI's predictions against what humans who were doled out roles in an arbitrary situation actually did.

Without such an invaluable data set, we would have a much higher hill to climb before an AI could train itself to understand typical human conversations. Maybe we would have had to go the route of AlphaZero3, where they threw away the mountains of chess game databases and had it play thousands of games against itself to decide what "good" looked like, but somehow mastering the game of kings and being a good conversationalist didn't seem quite apples-to-apples, and so we owe a debt of gratitude to those who compiled the data that our AI depends on.