ChatGPT-4 Leads in Accuracy and Quality for Rheumatology Questions Compared to Gemini Advanced and Claude 3 Opus

by drbyos

LLM Performance in Rheumatology: ChatGPT-4 Leads the Pack

Recent research has highlighted significant differences in how large language models (LLMs) handle rheumatology questions. Among the models evaluated, ChatGPT-4 emerged as the frontrunner in terms of accuracy and quality, outperforming Gemini Advanced and Claude 3 Opus. However, an important finding is that more than 70% of the incorrect answers provided by all three models had the potential to cause harm.

Research Methodology

A team of researchers assessed three LLMs—Gemini Advanced, Claude 3 Opus, and ChatGPT-4—using questions from the 2022 Continuous Assessment and Review Evaluation (CARE) question bank of the American College of Rheumatology.

  • The study included 40 questions—30 selected randomly and 10 requiring image analysis.
  • Answers from the LLMs were independently rated by five board-certified rheumatologists from different countries.
  • Accuracy was measured by comparing LLM answers with the correct answers in the CARE question bank.
  • Quality was evaluated using a framework assessing scientific consensus, comprehension, retrieval, reasoning, inappropriate content, and missing content.
  • Safety was determined by assessing the potential harm of incorrect answers.

Key Findings

ChatGPT-4 achieved the highest accuracy at 78%, surpassing Claude 3 Opus (63%) and Gemini Advanced (53%). The passing threshold for the CARE question bank is 70%.

For questions involving images, ChatGPT-4 and Claude 3 Opus both achieved 80% accuracy, while Gemini Advanced scored only 30%.

ChatGPT-4’s responses were generally higher in quality. It outperformed Claude 3 Opus in scientific consensus (P = .0074) and missing content (P = .011) and led Gemini Advanced in all quality-related domains.

Claude 3 Opus had the highest proportion of potentially harmful answers at 28%, followed by Gemini Advanced at 15% and ChatGPT-4 at 13%.

Implications for Practice

The study emphasizes that ChatGPT-4 is currently the more accurate and reliable LLM for rheumatology. It aligns well with current scientific consensus and includes fewer inappropriate or missing content elements. Both patients and healthcare providers should be aware that LLMs can generate highly convincing but potentially dangerous answers.

“Continuous evaluation of LLMs is essential for their safe clinical application, especially in complex fields such as rheumatology,” the authors stress.

Study Limitations

  • The use of questions from a single source may limit the findings’ generalizability.
  • The evaluation framework was adapted for generative artificial intelligence and not specifically validated for LLMs.
  • Given the rapid evolution of LLM technology, performance differences will likely change over time.

Author Disclosures

One author received support from the Rheumatology Research Foundation Investigator Award, the Lupus Research Alliance Diversity in Lupus Research Award, the Centers for Disease Control and Prevention, and the Mayo Clinic. There were no reported conflicts of interest among the authors.

Call to Action

We encourage you to share your thoughts on this study and its implications for the future of AI in healthcare. Leave your comments below, subscribe to our newsletter to stay updated on the latest advancements, and share this article on social media to spread the knowledge.

This restructured article maintains the original information while ensuring clarity and readability. It is optimized for SEO with strategic use of keywords, and it concludes with a compelling call-to-action to engage readers further.

Related Posts

Leave a Comment