Why two teams converged on the same AI research architecture

In January, three researchers I'd never met published a system that does my job's most technical work almost exactly the way I built it.Katz, A., Coloyan Fleming, G., & Main, J. B. (2026). Thematic analysis with open-source generative AI and machine learning: a new method for inductive qualitative codebook development. Humanities and Social Sciences Communications, 13, 209. Their system is called GATOS. Mine is called Insights Agent. Different models, different organizations, different fields — they work in engineering education research; I build research tooling in industry. Neither team knew about the other.

And yet, stage for stage, we built nearly the same thing.

That convergence is the most useful data point I have about how AI transformation actually works in knowledge fields — more useful, I think, than either system on its own. Two teams with no contact don't land on the same architecture by luck. They land there because something upstream of the engineering forced the shape. This essay is about what that something is.

What they built

GATOS runs thematic analysis — the workhorse method of qualitative research — as a pipeline of small, explicit steps. First it condenses each raw response into its key ideas. Then it embeds and clusters those ideas by semantic similarity. Then it generates a codebook inductively, deciding for each cluster whether an existing code covers it or a new one is needed — and it makes that decision against explicit criteria: parsimony, level of abstraction, non-redundancy. Only then, at the end, does it organize codes into themes.

If you've trained in qualitative methods, that sequence should sound familiar. It's the discipline of inductive coding, made executable: ideas before codes, codes before themes, and a codebook that has to justify every entry. The criteria GATOS applies to its own candidate codes are the same ones a methods textbook gives a human coder.

Their validation is worth describing precisely, because the numbers are the credible kind — specific and imperfect. The team generated three synthetic datasets (854, 823, and 1,110 responses) with known themes deliberately planted in them, then measured how much of the planted structure GATOS recovered. Across the three datasets, 54 of 60, 52 of 64, and 60 of 63 sub-themes came back as good matches, with most of the remainder partial rather than absent. Fewer than 5% of the planted sub-themes lacked a clear match; under 2% had no close match at all.The validation is on synthetic data, and the authors say so plainly — whether the results hold on messy human-participant data is the open question they flag themselves. That candor is part of why I trust the work.

My own pipeline's benchmarks are internal rather than published, so treat the comparison accordingly: in our evaluations against human analysts, Insights Agent's theme output matched at 95–98%.A correction, since this essay exists partly because citations deserve care: the LinkedIn version of this piece compressed the two systems' numbers into one range. The 95–98% figure is my pipeline's internal benchmark; GATOS's published per-dataset rates are the ones above. Different evals, different data — directionally similar, not interchangeable.

The convergence is not an engineering coincidence

Here is the part that matters. The places where GATOS and Insights Agent agree are not the places where engineering taste usually shows up — model choice, prompt style, orchestration framework. We differ on all of those. We converged on something else: the decomposition. Which steps exist. What order they run in. What each step is allowed to do. Where a structured criterion replaces an open-ended generation.

Both systems refuse to ask a model the tempting question — "read these transcripts and tell me the themes" — and instead ask a sequence of small, checkable ones. Both put codes before themes, because that's where the inferential chain stays auditable. Both constrain the codebook with the same discipline human methodologists use, because an unconstrained codebook bloats into uselessness in either species of analyst.

None of that comes from knowing models. All of it comes from knowing the method — from having spent years inside thematic analysis, watching where it actually produces insight and where it's mechanical transcription of judgment already made. The decomposition is the expertise.This is the same observation the reflexive thematic analysis tradition makes about human teams: Braun and Clarke's method is less a recipe than a discipline about which moves are interpretive. Two teams who internalized that discipline built it into software independently. Two teams who knew the method converged; the convergence is what deep domain knowledge looks like when it's expressed as architecture.

Where we diverged is just as telling. GATOS runs its pipeline largely end to end, with humans evaluating the output; Insights Agent places researchers at review gates inside the run. I think the gates matter — I've written a whole essay about why — but notice what kind of disagreement that is. The mechanical skeleton was settled the same way by both teams. The open design question, the one where reasonable builders still differ, is the position of the human. That's where the field's real argument is, and you can only have that argument once the decomposition is right.

The 40% that won't make it

Gartner predicts that more than 40% of agentic AI projects will be canceled by the end of 2027 — escalating costs, unclear business value, inadequate risk controls.Gartner press release, June 2025, based on a poll of 3,400+ organizations. The same release estimates that of the thousands of vendors claiming agentic AI, only about 130 are building the real thing — the rest are "agent washing" existing products. The prediction made the rounds as a verdict on the technology. I read it as a verdict on a playbook.

The default AI transformation playbook is: hire engineers, point them at a workflow, automate it. That playbook works when the workflow is genuinely mechanical, because a spec can capture it. It fails for knowledge work, and it fails in a specific way: the spec gets written by someone who can describe the workflow's steps but not its judgments. The engineers build the steps. The judgments — which step is interpretation, which output is wrong in a way that matters, which 5% of cases the whole exercise exists for — never make it into the system, because nobody in the room had them.

You can see the difference in where the domain expert sits. In the failing version, they appear at the end, as "validation" — asked to bless output from a system whose design they never touched. By then the decomposition is frozen and the expert's knowledge has nowhere to go except a thumbs up or down on someone else's architecture. In the surviving version, the expert is at the center from the first whiteboard: deciding the decomposition, owning the evaluation set, defining what failure looks like before the first line of code. GATOS was built that way. So was Insights Agent. That's the convergence underneath the convergence.

The highest-leverage AI hire isn't always another engineer. Sometimes it's the person who spent a decade doing the work the AI is supposed to transform — not because they're better at building, but because they're the only one who knows what must not be automated.

What this means if you're leading a research team

The practical version of this essay fits in three questions about any AI project touching your team's work:

Who decided the decomposition — the breakdown of the workflow into automatable and interpretive steps? If the answer is "the engineering team, from a process doc," the project is running the failing playbook regardless of how good the models are.

Who owns the evaluation? Recovery rates like the ones in the GATOS paper exist because the builders knew what ground truth looked like in their field and constructed a test for it. If nobody on your project can say what a good output is with that precision, the system's quality is unknowable — to everyone, including its builders.

And who can stop it? Not in the dramatic kill-switch sense — in the mundane sense of being empowered to say "this theme is plausible-sounding and wrong" and have that judgment shape the next iteration rather than disappear into a feedback form.

Teams that can answer all three with names of people who know the work — those are the ones I'd bet survive Gartner's cut. The architecture converges when the people who know the work design it. The projects collapse when they don't. Two systems built an ocean apart are the cleanest evidence I have.

If your team is sorting out which of these seats are filled — that's exactly the work I do. Work with me →

Expanded from a piece first published on LinkedIn, March 2026.