Why Our AI Diagnostician Outperforms Doctors

by Archynetys Health Desk

The Gist

  • AI beats doctors at diagnosis. Microsoft’s MAI-DxO system solved 85.5% of complex cases, a fourfold improvement over human doctors.
  • Orchestration is the secret weapon. The system uses multiple specialized AI agents debating and collaborating in real time to improve accuracy and efficiency.
  • Doctors still matter — just differently. Suleyman says AI can enhance education, reduce costs and anxiety, while humans retain the role of empathetic guide and judgment provider.

As AI models get commoditized, the value will be added in that final layer of orchestration, Microsoft AI CEO Mustafa Suleyman says.

Microsoft earlier this month announced it built an AI diagnostician that outperforms human doctors on complex cases.

The system, called MAI-DxO, uses two bots to sort through a patient’s medical history and solves 85.5% of patient cases when paired with OpenAI’s o3 model. The results are a major leap above the 20% average accuracy that human doctors achieved on the same cases, although the humans were restricted from searching the web or speaking with colleagues.

In an in-depth conversation shortly after Microsoft announced the results, Microsoft AI CEO Mustafa Suleyman shared how the AI diagnostician was able to 4X the performance of human doctors, what it means for the future of medicine, and whether this is a positive trend for society.

You can read our full conversation below, edited lightly for length and clarity.

Table of Contents

AI-Driven Search Transforms Healthcare Queries

Alex Kantrowitz: Hi Mustafa, good to see you again. First off, Copilot and Bing now field 50 million medical queries per day. Is that good?

Mustafa Suleyman: It’s incredible, because we’re already making access to information super cheap and concise with just search engines. And now with Copilot, answers are much more conversational. You can tone them down so they suit your specific level of knowledge and expertise, and as a result, more and more people are asking Copilot and Bing health-related questions.

The queries range from anything from a cancer issue that someone’s dealing with, to a death in a family, to a mental health issue, to just having a skin rash. And so the variety is huge, but obviously we’ve got a really important objective here to try and improve the quality of our consumer health products.

Do the health questions that come into chatbots look different from search?

Copilot’s answers tend to be more succinct and responsive to the style and tone of the individual person asking the question, and that tends to encourage people to ask a second follow-up question. So it turns it into more of a dialog or a consultation that you might end up having with your doctor. So they are quite different to a normal search query.

Inside Microsoft’s Two-Bot Diagnostic System

Speaking of dialogs, let’s discuss Microsoft’s new AI diagnostician system. It’s actually two bots, where one bot acts as a gatekeeper to all a patient’s medical information, and the other asks questions about that history and makes a diagnosis. You’ve found the system performs better than humans in diagnosing disease.

That’s exactly right. We essentially wanted to simulate what it would be like for an AI to act as a diagnostician, to ask the patient a series of questions, to draw out their case history, go through a whole bunch of tests that they may have had — pathology and radiology — and then iteratively examine the information that it’s getting in order to improve the accuracy and reliability of its prediction about what your diagnosis actually is.

We actually use the New England Journal of Medicine case histories, hundreds of these past cases. One of these cases comes out every single week, and it’s like an ultimate crossword for doctors. They don’t see the answer until the following week. And it’s a big guessing game to go back through five to seven pages of very detailed history, and then try to figure out what the diagnosis actually turns out to be.

The Orchestration Layer Is Where Value Emerges

I thought one of the benefits of generative AI is it can take in a lot of information and then come to answers — often in one shot. What’s the benefit of having multiple bots sort through it?

The big breakthrough of the last six months or so in AI is these thinking or reasoning models that can query other agents, or find other information sources at inference time, to improve the quality of its response. Rather than just giving the first best answer, it instead goes and consults a range of different sources, and that improves the quality of the information that it finally gets to. So we see that this orchestrator, which under the hood uses four different models from the major providers, can actually improve the accuracy of each of the individual models. And collectively, all of them together by a very significant degree, about 10% or so. So it’s a big step forward. And I think that as the AI models get commoditized, really, all the value will be added in that final layer of orchestration, product integration, and that’s what we’re seeing with this diagnostic orchestrator.

MAI-DxO Beats Human Doctors by 4X in Accuracy

So it’s a 10% increase in accurately diagnosing on top of the standard LLMs?

Yes. And in fact, we actually benchmark that against human performance. So we had a whole bunch of expert physicians play this simulated diagnosis environment game, and they, on average, get about one in five, right? So about 20%. Whereas our orchestrator gets about 85% accuracy, so it’s four times more accurate, which, in my career, I’ve never seen such a big gulf between human level performance and the AI system’s performance.

Many years ago, I worked on lots of diagnoses for radiology and head and neck cancer and mammography, and the goal was just to take a single radiology exam and predict, does it have cancer? And that was the most we could do. Whereas now it’s actually producing a very detailed diagnosis, and doing that sequentially through this interactive dialog mechanism. And so that massively improves the accuracy.

Doctors Can Learn From AI’s Diagnostic Thinking

What if you have the same thing happen to medicine as is happening with beginner level code, where people learn to code using copilots, but when something breaks, it becomes harder for them to figure out what’s going on. If you’re a doctor, if you outsource some of your thinking to these bots, is that a problem?

So this isn’t just giving a black box answer. That’s why the sequential diagnosis part is so important, because you can watch the AI in real time, ask questions of the case history, get an answer, shape a new question, get an answer, present a new question, then ask for a different type of testing, get those results, interpret it, then give an answer.

The dialogic nature means that a human doctor can follow along and actually learn in a very transparent way. It’s almost like having an interpretability mechanism inside the black box of the LLM, because you can see its thinking process in real time. And in fact, you don’t just see the chain of thought which is the inner monologue.

We’ve actually created five different types of agent which all have a debate, and we call this chain of debate. They negotiate with one another. They try to prioritize certain different aspects, like cost or efficiency. And the coordination of those different skill sets among the agents is actually what makes this so effective.

How MAI-DxO Works and Why It Matters

A summary of Microsoft AI CEO Mustafa Suleyman’s key points about the diagnostic system.

Feature Description Impact
Dual-bot architecture One AI agent retrieves patient data; another interrogates the case and makes the diagnosis Simulates a clinical dialogue for better accuracy and transparency
Agent orchestration Five specialized agents debate and collaborate during diagnosis Boosts accuracy by 10% beyond LLMs; enables nuanced decision-making
Performance vs. humans 85.5% diagnostic accuracy compared to 20% for expert doctors Fourfold improvement over average physician performance
Educational value Doctors can follow AI’s logic and learn from rare case detection Enhances medical training and clinical exposure
Cost optimization AI selects minimum necessary tests to reach accurate diagnoses Reduces test burden, patient anxiety, and overall care costs
Future use Still in research phase; potential for hospital integration Aims for wide deployment across medical platforms and queries

AI Can Detect Rare Diseases Doctors May Never See

Even if a doctor can watch this take place, it turns their role in diagnoses from active to somewhat positive. Is there some benefit in having the doctor work in that active phase vs. watching bots have a conversation?

I think that’s totally true. I just still think this is going to be an amazing education tool for doctors to actually learn about the breadth of cases they never would have encountered. For example, we actually ran the orchestrator last week on the most recent case study in the New England Journal of Medicine, and it correctly diagnosed the case that had only ever been seen 1,500 times in all of medical literature. It was such an obscure, long tail disease, so very few doctors are ever going to get the chance to see that. And so the ability to accurately detect these kinds of conditions in the wild, in production, I think, will massively outweigh the risk of doctors not being able to sort of exercise in the way that you describe.

Learning Opportunities

Related Posts

Leave a Comment