Why AI Depends on Data – Especially the Data We Don't Have Yet
Data sets like MultiLIGHT and others are invaluable for teaching models how conversations actually work. But compared to written text, we're still living in a data desert.

Most of today's frontier models sit on top of a mountain of written language: web pages, books, Wikipedia, PDFs, code, forum posts, and more. For images and text, that works surprisingly well. If you want to learn what an spoon looks like, there are millions of labeled spoon photos.
On The Weekly Show with Jon Stewart, Geoffrey Hinton makes this point about vision models: the whole approach assumes a giant bucket of labeled examples. Show the model millions of birds and it learns "birdness."
That assumption quietly breaks the moment you move from simple labelling of images to live, messy, multi-person conversation.
"Conversation is the natural way humans think together." – Margaret Wheatley
The Data Desert for Conversation
We don't just have a data problem. We have a data desert. Models are overfed on written text and starving for rich, labeled conversation.

Even when we say "conversational AI," most of the training data is still:
- Clean text chat
- Forum threads and support logs
- Email-like, turn-based exchanges
- One-on-one phone conversations
Very little of it looks like the reality of group talk:
- Real-time, overlapping multi-speaker speech
- Accurate speaker and addressee labels
- Emotions, intentions, decisions, and follow-up actions
- All synchronized to audio (and often video)
We're asking models to join our meetings and "read the room," but we've mostly taught them to read blog posts.
The First Fragile Datasets
Because of this imbalance, early multi-party and meeting datasets are far more important than their size implies. They're like concept cars: expensive, limited, but showing what the future could look like.
A non-exhaustive set of building blocks includes:
- MultiLIGHT – multi-character role-play dialogue that starts to resemble real multi-party interaction.
- Persona-aware multi-party corpora – conversations that track who is speaking to whom, with rich social and persona metadata.
- Multi-session dialogue datasets – modeling what a main participant knows about each partner over time.
- Meeting corpora – real meetings with audio, video, roles, topics, decisions, and actions annotated.
- Emotion-labeled dialogue – multi-party conversations with emotion labels on each utterance.
- Task and agent benchmarks – embodied and multi-agent setups where language drives coordination and action.
Even if you combine everything we have today, you still get orders of magnitude less conversational data than written language. These datasets are precious – and painfully rare.
Where Conversational Data Will Actually Come From
We can't hand-label our way to millions of hours of pristine, multi-speaker conversation. Growth has to come from new sources and new tricks.
In broad strokes, there are three main sources of trainable conversation-like data:
- Human-recorded & hand-labeled audio – slow, expensive, but extremely high-signal.
- Unlabeled raw audio – calls, meetings, games; plentiful but chaotic and hard to use directly.
- Synthetic + auto-labeled audio – scripted or programmatically generated conversations, rendered with modern TTS and richly labeled from the start.
The future bulk of high-value conversational training data will likely come from the third category: synthetic and auto-labeled pipelines, bootstrapped by the models we already have. Just as self-driving cars now learn more from simulated miles than real roads, conversational AI may soon train primarily on synthetic dialogue.
Growth of Conversational Data Sources Over Time
Conceptual projection of data availability from 2020–2030.
How We Terraform the Desert?
1. Auto-label the messy real world
We can take real-world multi-speaker audio and layer structure on top using existing models:
- Speaker diarization – who is speaking when.
- Addressee detection – who they are talking to.
- Emotion and prosody – how they are saying it.
- Dialog acts – questions, answers, agreements, hedges, decisions, and so on.
These labels are noisy, but at scale they become powerful pretraining material.
2. Generate structured conversations
We can also work from the top down and let models help write their own curriculum:
- Multi-agent systems simulate teams arguing, aligning, and planning.
- Program-guided generators define roles, goals, and scene constraints; language models fill in plausible dialogue.
- Theory-of-mind setups specify what each participant believes and wants, then generate conversations that reflect those internal states.
Because we control the generator, we know who each speaker is, what they know, and what each utterance is trying to accomplish.
3. Turn scripts into audio
Modern TTS can translate scripts into believable multi-speaker audio:
- Distinct voices and speaking styles per participant.
- Natural interruptions, overlaps, and backchannels.
- Tunable pacing and emotional tone.
Now every token and every timestamp has a known speaker, intent, and often world state behind it. That makes these synthetic datasets incredibly trainable.
Where Joinin.ai Fits
This is exactly the gap Joinin.ai is designed to fill: moving from "summarize my meeting after it's over" to "join the conversation while it's happening."
The Data Desert Won't Solve Itself
There's an uncomfortable truth underneath all of this: we're trying to build deeply conversational AI on top of data that mostly isn't conversation.
The data desert won't solve itself. Someone has to design systems that work inside it today, squeezing structure out of limited, noisy data, while also irrigating it for tomorrow by creating, curating, and labeling new, high-quality conversational datasets.
