Why Our AI Diagnostician Outperforms Doctors

The Gist

Table of Contents

The Gist
Table of Contents

AI-Driven Search Transforms Healthcare Queries
Inside Microsoft’s Two-Bot Diagnostic System
The Orchestration Layer Is Where Value Emerges
MAI-DxO Beats Human Doctors by 4X in Accuracy
Doctors Can Learn From AI’s Diagnostic Thinking
- How MAI-DxO Works and Why It Matters
AI Can Detect Rare Diseases Doctors May Never See
Why Training Data Doesn’t Explain This Performance
Why Orchestrators May Outperform Single Models
Reducing Cost and Test Anxiety Through Smarter AI
Current Limitations and Long-Tail Use Cases
AI Won’t Replace Empathy and Human Guidance
The Doctor’s Role Is Evolving, Not Disappearing
Beyond Healthcare: Orchestrated AI for Any Field
What’s Next for MAI-DxO in Clinical Settings?

AI beats doctors at diagnosis. Microsoft’s MAI-DxO system solved 85.5% of complex cases, a fourfold improvement over human doctors.
Orchestration is the secret weapon. The system uses multiple specialized AI agents debating and collaborating in real time to improve accuracy and efficiency.
Doctors still matter — just differently. Suleyman says AI can enhance education, reduce costs and anxiety, while humans retain the role of empathetic guide and judgment provider.

As AI models get commoditized, the value will be added in that final layer of orchestration, Microsoft AI CEO Mustafa Suleyman says.

Microsoft earlier this month announced it built an AI diagnostician that outperforms human doctors on complex cases.

The system, called MAI-DxO, uses two bots to sort through a patient’s medical history and solves 85.5% of patient cases when paired with OpenAI’s o3 model. The results are a major leap above the 20% average accuracy that human doctors achieved on the same cases, although the humans were restricted from searching the web or speaking with colleagues.

In an in-depth conversation shortly after Microsoft announced the results, Microsoft AI CEO Mustafa Suleyman shared how the AI diagnostician was able to 4X the performance of human doctors, what it means for the future of medicine, and whether this is a positive trend for society.

You can read our full conversation below, edited lightly for length and clarity.

AI-Driven Search Transforms Healthcare Queries

Alex Kantrowitz: Hi Mustafa, good to see you again. First off, Copilot and Bing now field 50 million medical queries per day. Is that good?

Mustafa Suleyman: It’s incredible, because we’re already making access to information super cheap and concise with just search engines. And now with Copilot, answers are much more conversational. You can tone them down so they suit your specific level of knowledge and expertise, and as a result, more and more people are asking Copilot and Bing health-related questions.

The queries range from anything from a cancer issue that someone’s dealing with, to a death in a family, to a mental health issue, to just having a skin rash. And so the variety is huge, but obviously we’ve got a really important objective here to try and improve the quality of our consumer health products.

Do the health questions that come into chatbots look different from search?

Copilot’s answers tend to be more succinct and responsive to the style and tone of the individual person asking the question, and that tends to encourage people to ask a second follow-up question. So it turns it into more of a dialog or a consultation that you might end up having with your doctor. So they are quite different to a normal search query.

Inside Microsoft’s Two-Bot Diagnostic System

Speaking of dialogs, let’s discuss Microsoft’s new AI diagnostician system. It’s actually two bots, where one bot acts as a gatekeeper to all a patient’s medical information, and the other asks questions about that history and makes a diagnosis. You’ve found the system performs better than humans in diagnosing disease.

That’s exactly right. We essentially wanted to simulate what it would be like for an AI to act as a diagnostician, to ask the patient a series of questions, to draw out their case history, go through a whole bunch of tests that they may have had — pathology and radiology — and then iteratively examine the information that it’s getting in order to improve the accuracy and reliability of its prediction about what your diagnosis actually is.

We actually use the New England Journal of Medicine case histories, hundreds of these past cases. One of these cases comes out every single week, and it’s like an ultimate crossword for doctors. They don’t see the answer until the following week. And it’s a big guessing game to go back through five to seven pages of very detailed history, and then try to figure out what the diagnosis actually turns out to be.

The Orchestration Layer Is Where Value Emerges

I thought one of the benefits of generative AI is it can take in a lot of information and then come to answers — often in one shot. What’s the benefit of having multiple bots sort through it?

The big breakthrough of the last six months or so in AI is these thinking or reasoning models that can query other agents, or find other information sources at inference time, to improve the quality of its response. Rather than just giving the first best answer, it instead goes and consults a range of different sources, and that improves the quality of the information that it finally gets to. So we see that this orchestrator, which under the hood uses four different models from the major providers, can actually improve the accuracy of each of the individual models. And collectively, all of them together by a very significant degree, about 10% or so. So it’s a big step forward. And I think that as the AI models get commoditized, really, all the value will be added in that final layer of orchestration, product integration, and that’s what we’re seeing with this diagnostic orchestrator.

MAI-DxO Beats Human Doctors by 4X in Accuracy

So it’s a 10% increase in accurately diagnosing on top of the standard LLMs?

Yes. And in fact, we actually benchmark that against human performance. So we had a whole bunch of expert physicians play this simulated diagnosis environment game, and they, on average, get about one in five, right? So about 20%. Whereas our orchestrator gets about 85% accuracy, so it’s four times more accurate, which, in my career, I’ve never seen such a big gulf between human level performance and the AI system’s performance.

Many years ago, I worked on lots of diagnoses for radiology and head and neck cancer and mammography, and the goal was just to take a single radiology exam and predict, does it have cancer? And that was the most we could do. Whereas now it’s actually producing a very detailed diagnosis, and doing that sequentially through this interactive dialog mechanism. And so that massively improves the accuracy.

Doctors Can Learn From AI’s Diagnostic Thinking

What if you have the same thing happen to medicine as is happening with beginner level code, where people learn to code using copilots, but when something breaks, it becomes harder for them to figure out what’s going on. If you’re a doctor, if you outsource some of your thinking to these bots, is that a problem?

So this isn’t just giving a black box answer. That’s why the sequential diagnosis part is so important, because you can watch the AI in real time, ask questions of the case history, get an answer, shape a new question, get an answer, present a new question, then ask for a different type of testing, get those results, interpret it, then give an answer.

The dialogic nature means that a human doctor can follow along and actually learn in a very transparent way. It’s almost like having an interpretability mechanism inside the black box of the LLM, because you can see its thinking process in real time. And in fact, you don’t just see the chain of thought which is the inner monologue.

We’ve actually created five different types of agent which all have a debate, and we call this chain of debate. They negotiate with one another. They try to prioritize certain different aspects, like cost or efficiency. And the coordination of those different skill sets among the agents is actually what makes this so effective.

How MAI-DxO Works and Why It Matters

A summary of Microsoft AI CEO Mustafa Suleyman’s key points about the diagnostic system.

Feature	Description	Impact
Dual-bot architecture	One AI agent retrieves patient data; another interrogates the case and makes the diagnosis	Simulates a clinical dialogue for better accuracy and transparency
Agent orchestration	Five specialized agents debate and collaborate during diagnosis	Boosts accuracy by 10% beyond LLMs; enables nuanced decision-making
Performance vs. humans	85.5% diagnostic accuracy compared to 20% for expert doctors	Fourfold improvement over average physician performance
Educational value	Doctors can follow AI’s logic and learn from rare case detection	Enhances medical training and clinical exposure
Cost optimization	AI selects minimum necessary tests to reach accurate diagnoses	Reduces test burden, patient anxiety, and overall care costs
Future use	Still in research phase; potential for hospital integration	Aims for wide deployment across medical platforms and queries

AI Can Detect Rare Diseases Doctors May Never See

Even if a doctor can watch this take place, it turns their role in diagnoses from active to somewhat positive. Is there some benefit in having the doctor work in that active phase vs. watching bots have a conversation?

I think that’s totally true. I just still think this is going to be an amazing education tool for doctors to actually learn about the breadth of cases they never would have encountered. For example, we actually ran the orchestrator last week on the most recent case study in the New England Journal of Medicine, and it correctly diagnosed the case that had only ever been seen 1,500 times in all of medical literature. It was such an obscure, long tail disease, so very few doctors are ever going to get the chance to see that. And so the ability to accurately detect these kinds of conditions in the wild, in production, I think, will massively outweigh the risk of doctors not being able to sort of exercise in the way that you describe.

Learning Opportunities

I think the tools just change how you work. And everyone will have to adapt to that over time, but the utility is just so unquestionably beneficial that I think it makes it worthwhile.

Why Training Data Doesn’t Explain This Performance

Is it able to do that because the cases are in the training data. And even if they are, does it really matter?

Well, part of the reason why we partnered with the New England Journal of Medicine is because each week, they put out a brand new case which has never even been digitized. So there’s no question that it’s not in the training data. This case, for example, from last week — there’s absolutely no way it’s in the training data, because it literally just got published. So we think that’s the case going back for all of the previous cases too. So I don’t think there’s any chance of that. This really is doing an abstraction of judgment. It’s not just reproducing training data. It is actually doing some kind of inference or thinking based on the knowledge that it does already have.

Why Orchestrators May Outperform Single Models

Your system didn’t show as big of an improvement over reasoning models as it did over standard LLMs. Is it possible that state-of-the-art reasoning models will learn how to do stuff like this, and you won’t need this type of specialized sequencing to achieve similar results?

The real value here, long term, is in how you orchestrate a variety of different models with different types of expertise. So each one of these five agents has been prompted and designed to have a different type of expertise and then have them jointly, negotiate, and reason collectively. It may be the case that maybe they all get subsumed into a single model in the future. I don’t know. Right now, it doesn’t look like that. Right now, orchestrators are able to drive much bigger gains.

The other thing that we see, for example, is that it’s able to optimize for cost as well, and reduce the cost by avoiding unnecessary tests versus the humans. So that’s a function of cost being factored into the orchestrator at inference time, which wouldn’t be something that you could reconcile inside of a single model in pre-training or post-training.

In medicine, cost is a factor. You know, you could order every single test and probably do better diagnosing people, but it’s just not a reality today. And it is interesting to watch the bot work through which tests to order and then come in at a lower cost than typical doctors.

More tests also make people feel anxious. So it’s not just cost, but it’s actually the patient experience that gets optimized for as well.

Reducing Cost and Test Anxiety Through Smarter AI

And so how does it decide which tests to order and how to optimize cost?

What the model is trying to do is to get to the best diagnosis with the minimum number of tests. The model has much broader range of awareness of which test results tend to correlate with which particular diagnostic outcomes. And so given that it’s seen so many more cases than any given human, it’s showing that it can do a better job of judging, in this instance, given this case history that it already knows about a patient, what is the minimum number of necessary tests to get the next piece of information to be able to continue the diagnosis and get it more accurate.

Related Article: Preventing AI Hallucinations in Customer Service: What CX Leaders Must Know

Current Limitations and Long-Tail Use Cases

Can I tell you something else that surprised me? It seemed like the bot struggled with more common type of diagnoses. Do you think it’s just like waiting to diagnose that rare case? So it skips over the fact that it could just be a stomachache?

We haven’t applied it to your everyday sort of GP or primary care physician experience, where you have a skin rash, or you’ve got a pain with your knee, and so this does tend to be the longer tail of complex cases. But it goes without saying that there is less of that information in the training data. And we know that if there is more training data, the models do better. So the model is almost certainly going to do better in those primary care type environments than it’s doing on the long tail.

AI Won’t Replace Empathy and Human Guidance

Your release around this research says doctor clinical roles are “much broader than simply making a diagnosis. They need to navigate ambiguity and build trust with patients and their families in a way that AI isn’t set up to do.”

Can I just take the other side of this? If you’re talking with a bot every day, you might trust it more than a doctor you see once a year, or even a new specialist. So AI poised to take on some of that work as well?

It’s possible that it will do some of that work. Certainly, I hope one day that it will be good enough to do that kind of work. But nothing is going to replace the human to human connection that you get in the physical, real world at a moment of heightened anxiety and fear, when you’re just facing one of the biggest challenges in your life and you have a massive diagnosis ahead of you, or when you just need everyday, regular treatment and care. So that’s going to continue to be the role of a doctor, and hopefully they get to spend more time face to face with patients.

The Doctor’s Role Is Evolving, Not Disappearing

So doctors become something of auditors of the output of these AI bots in the future? They’re becoming shepherds that are shepherding patients through their care journey?

There’s still going to be a tremendous amount of judgment that is required by expert human doctors, both as part of the diagnosis, and then secondly, making the judgment about what works for the patient, and factoring in, and helping a patient decide what journey do I want to take, given that I now know I’ve got this diagnosis, what treatments do I want to take and when, and what are the trade-offs there. So that is going to require a tremendous amount of judgment, and so it’s not just about the human-to-human connection and being on your feet. It’s also thinking in a deeply empathetic way alongside a patient that’s received a diagnosis to plan their treatment course.

Beyond Healthcare: Orchestrated AI for Any Field

What other professions could you see this being applied to this type of system?

The basic method of these orchestrators is that they tune different AIs to play very specific roles and then have the AIs negotiate with one another. That obviously applies to a lot of different environments, be it in business or even in government in the future. And so I think if this finding holds and applies to other domains, I think it will be be very, very promising, because it’s also how we, collectively as a human species, work, right? We generally consult very widely when we make decisions, often, there’s even consensus before actually coming to a final conclusion. So it has a lot of parallels to the human world.

Related Article: Digital Experience Gets Real: Personalization, AI and What Comes Next

What’s Next for MAI-DxO in Clinical Settings?

Lastly, this isn’t being rolled out broadly in a hospital setting yet, so everyone who’s panicking at this point can relax. But is that the ultimate goal? Is it an education tool, or does this actually become integrated in medical centers and hospitals in the coming years?

At the moment, this is just early research, and we’re figuring out how best to deploy it, but I think the fact that we’re able to get a 4x improvement on human performance across the board, on diagnosis with significantly reduced cost in super fast time — to me, that feels like steps towards a true medical super intelligence, and we would want to try and make that available as widely as possible, as quickly as possible, including for our 50 million daily health queries. And so that’s going to be our ambition: Get it in front of consumers as fast as possible, in the safest way possible.

Learn how you can join our contributor community.