Why AI Depends on Data – Especially the Data We Don't Have Yet

Data sets like MultiLIGHT and others are invaluable for teaching models how conversations actually work. But compared to written text, we're still living in a data desert.

Most of today's frontier models sit on top of a mountain of written language: web pages, books, Wikipedia, PDFs, code, forum posts, and more. For images and text, that works surprisingly well. If you want to learn what an spoon looks like, there are millions of labeled spoon photos.

On The Weekly Show with Jon Stewart, Geoffrey Hinton makes this point about vision models: the whole approach assumes a giant bucket of labeled examples. Show the model millions of birds and it learns "birdness."

That assumption quietly breaks the moment you move from simple labelling of images to live, messy, multi-person conversation.

"Conversation is the natural way humans think together." – Margaret Wheatley

The Data Desert for Conversation

We don't just have a data problem. We have a data desert. Models are overfed on written text and starving for rich, labeled conversation.

On the left, a vast blue ocean labeled Written Text; on the right, a single glass of water labeled Labeled Conversational Audio. — If written text is an ocean, labeled multi-speaker conversational audio is just a glass of water. Yet in our everyday life, conversation is the ocean we swim in – and it's what we want our AI to understand and join.

Even when we say "conversational AI," most of the training data is still:

Clean text chat
Forum threads and support logs
Email-like, turn-based exchanges
One-on-one phone conversations

Very little of it looks like the reality of group talk:

Real-time, overlapping multi-speaker speech
Accurate speaker and addressee labels
Emotions, intentions, decisions, and follow-up actions
All synchronized to audio (and often video)

We're asking models to join our meetings and "read the room," but we've mostly taught them to read blog posts.

The First Fragile Datasets

Because of this imbalance, early multi-party and meeting datasets are far more important than their size implies. They're like concept cars: expensive, limited, but showing what the future could look like.

A non-exhaustive set of building blocks includes:

MultiLIGHT – multi-character role-play dialogue that starts to resemble real multi-party interaction.
Persona-aware multi-party corpora – conversations that track who is speaking to whom, with rich social and persona metadata.
Multi-session dialogue datasets – modeling what a main participant knows about each partner over time.
Meeting corpora – real meetings with audio, video, roles, topics, decisions, and actions annotated.
Emotion-labeled dialogue – multi-party conversations with emotion labels on each utterance.
Task and agent benchmarks – embodied and multi-agent setups where language drives coordination and action.

Even if you combine everything we have today, you still get orders of magnitude less conversational data than written language. These datasets are precious – and painfully rare.

Where Conversational Data Will Actually Come From

We can't hand-label our way to millions of hours of pristine, multi-speaker conversation. Growth has to come from new sources and new tricks.

In broad strokes, there are three main sources of trainable conversation-like data:

Human-recorded & hand-labeled audio – slow, expensive, but extremely high-signal.
Unlabeled raw audio – calls, meetings, games; plentiful but chaotic and hard to use directly.
Synthetic + auto-labeled audio – scripted or programmatically generated conversations, rendered with modern TTS and richly labeled from the start.

The future bulk of high-value conversational training data will likely come from the third category: synthetic and auto-labeled pipelines, bootstrapped by the models we already have. Just as self-driving cars now learn more from simulated miles than real roads, conversational AI may soon train primarily on synthetic dialogue.

Growth of Conversational Data Sources Over Time

Conceptual projection of data availability from 2020–2030.

Human-labeled data remains scarce and flat – it simply can't scale. Raw audio grows steadily but lacks structure. Synthetic and auto-labeled data follows an exponential curve and may eventually surpass even raw recordings in usefulness.

How We Terraform the Desert?

1. Auto-label the messy real world

We can take real-world multi-speaker audio and layer structure on top using existing models:

Speaker diarization – who is speaking when.
Addressee detection – who they are talking to.
Emotion and prosody – how they are saying it.
Dialog acts – questions, answers, agreements, hedges, decisions, and so on.

These labels are noisy, but at scale they become powerful pretraining material.

2. Generate structured conversations

We can also work from the top down and let models help write their own curriculum:

Multi-agent systems simulate teams arguing, aligning, and planning.
Program-guided generators define roles, goals, and scene constraints; language models fill in plausible dialogue.
Theory-of-mind setups specify what each participant believes and wants, then generate conversations that reflect those internal states.

Because we control the generator, we know who each speaker is, what they know, and what each utterance is trying to accomplish.

3. Turn scripts into audio

Modern TTS can translate scripts into believable multi-speaker audio:

Distinct voices and speaking styles per participant.
Natural interruptions, overlaps, and backchannels.
Tunable pacing and emotional tone.

Now every token and every timestamp has a known speaker, intent, and often world state behind it. That makes these synthetic datasets incredibly trainable.

Where Joinin.ai Fits

This is exactly the gap Joinin.ai is designed to fill: moving from "summarize my meeting after it's over" to "join the conversation while it's happening."

The Data Desert Won't Solve Itself

There's an uncomfortable truth underneath all of this: we're trying to build deeply conversational AI on top of data that mostly isn't conversation.

The data desert won't solve itself. Someone has to design systems that work inside it today, squeezing structure out of limited, noisy data, while also irrigating it for tomorrow by creating, curating, and labeling new, high-quality conversational datasets.