Two Ways to Make AIs Work Together

Last week, Andrej Karpathy released LLM Council.

If you don't know Karpathy: he built the AI that taught Tesla cars to drive themselves. Before that, he was a founding member of OpenAI. When he builds something, people pay attention.

His idea: ask multiple AI models the same question. Let them see each other's responses. Have them rank who did best. Then a "Chairman" AI synthesizes the final answer.

The project got thousands of stars in days. But what caught my attention was what he said about it:

"Models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst... But I'm not 100% convinced this aligns with my own qualitative assessment."

The models ranked each other—but Karpathy's gut disagreed.

He added:

"The construction of LLM ensembles seems under-explored. There's probably a whole design space of the data flow of your LLM council."

That "design space" comment stuck with me. I've been building TachiBot, a multi-model tool for Claude Code, and it takes a completely different approach. I wanted to understand why.

Two Architectures for Multi-Model AI

When you ask multiple AIs the same question, what do you do with the answers?

LLM Council's approach: Peer Ranking

All models answer the question
Each model reviews the others: "Response A does well in... does poorly by..."
Models rank each other from best to worst
A Chairman synthesizes the best elements

TachiBot's approach: Peer Challenge

Multiple models analyze the question
Each model challenges the others: "Hidden assumptions... edge cases that break this..."
Models find flaws and counter-evidence
A consensus emerges from the debate

The difference is subtle but important:

Ranking asks: "Which answer is best?"
Challenging asks: "What's wrong with these answers?"

Both systems use critique—but LLM Council's critique serves the ranking, while TachiBot's critique is the process.

Think of it this way: LLM Council is a boardroom meeting—polite, balanced, reaching consensus. TachiBot is a courtroom cross-examination—adversarial, probing, breaking the witness, with Perplexity and Grok fact-checking claims against live web sources.

But which approach finds what you actually need?

What This Looks Like in Practice

I asked both systems: "Should a startup build with microservices or a monolith?"

LLM Council's Output

The Chairman reviewed the rankings and synthesized:

"Monolith until proven guilty."

Clean. Memorable. The peer rankings noted which responses were comprehensive vs. terse, and the synthesis picked the best elements from each.

TachiBot's Output

The challenge step from Grok tore into the consensus:

"Cherry-picking and source homogeneity... This creates an echo chamber."

"Overemphasis on pro-monolith sources, downplaying counterarguments."

"Hidden assumption: Long-term project survival. Reality: 70-80% of startups fail within 2-5 years. Most codebases are abandoned before debt bites."

Then it listed edge cases: hyper-scale startups, distributed teams, compliance-heavy domains, AI/ML workloads.

The final consensus:

"Start with a modular monolith—one deployable app with clean internal boundaries—because it maximizes early speed while preserving escape hatches to spin out services when scale demands it."

Challenge workflow: 'Monolith! It's the obvious choice' → 'BUT WAIT! Hidden assumption: 70% of startups fail before debt matters!' with EDGE CASE, FLAW, COUNTER-EVIDENCE warnings → Final: 'Modular monolith with escape hatches' - 'Better answer!'

Same Question, Different Discoveries

Both systems reached similar conclusions (monolith-first). But the journey was different.

LLM Council surfaced:

Which models were most comprehensive
Which answers were too terse or too verbose
A well-synthesized soundbite

TachiBot surfaced:

Hidden assumptions in the consensus
Edge cases where the advice fails
Counter-evidence the initial answers missed

Neither is "better." They're different tools for different purposes.

So when should you use which?

When Each Approach Shines

Ranking works well when:

You need to filter out bad answers (hallucinations, errors)
You want a polished, synthesized response
The question has a "best" answer worth finding

Challenging works well when:

The question involves tradeoffs with no clear winner
You need to stress-test assumptions
You want to find edge cases before committing

Karpathy noticed something similar. When his models ranked each other, they consistently picked GPT 5.1 as best—but he found it "too wordy." The ranking reflected majority preference, not necessarily what he needed.

Here's the uncomfortable question: What if the majority is wrong?

Decision framework: 'Which approach?' - USE RANKING (filter bad answers, polished synthesis, clear best exists) vs USE CHALLENGING (tradeoffs involved, stress-test assumptions, find edge cases). Different tools for different jobs!

The Under-Explored Design Space

Karpathy called this "under-explored," and he's right.

Most multi-model systems default to consensus: average the answers, vote on the best, synthesize into one. The boardroom reaches agreement. That works for filtering noise.

But what if the minority opinion is the one you need? What if the consensus is missing a crucial edge case? Sometimes you need the courtroom—someone willing to break the witness. Ranking won't surface that. Challenging might.

The design space includes:

Ranking (LLM Council): Who answered best?
Challenging (TachiBot's pingpong): What's wrong with these answers?
Specialization: Different models for different tasks (research, analysis, synthesis)
Debate: Models argue opposing positions
Verification: One model checks another's work

We're just scratching the surface.

4-panel comic: AI RANKING (robots worshipping GPT: 'OUR GLORIOUS LEADER!' 'MARRY ME, GPT!') vs HUMAN JUDGMENT (developer asleep on papers: 'Zzz... too much text'). LLM COUNCIL FOUND (dark temple worship scene) vs TACHIBOT FOUND (detective evidence board: 'Hidden assumptions', 'Counter-evidence', 'THE EVIDENCE IS CLEAR!').

Try It Yourself

LLM Council: github.com/karpathy/llm-council

Run your own questions through it. Watch how models rank each other. Notice when you agree with their ranking—and when you don't.

TachiBot: tachibot.com

Full disclosure: I built this. But the pingpong workflow that challenges answers is genuinely different from ranking. Try both—the boardroom and the courtroom—and see which surfaces insights you'd have missed.

Thanks to Karpathy for calling this space "under-explored." He's right—and that's what makes it interesting.