Transforming Cataract Care with AI: Five Large Language Models Evaluated on Cataract-Related Questions
A new evaluation conducted at the Eye and ENT Hospital of Fudan University tested five popular large language models (LLMs) on how well they handle cataract-related queries. The study benchmarked the AI models against human responses using seven criteria and explored how performance varied across different clinical question types.
What was tested
– Models evaluated: ChatGPT-4, ChatGPT-4o, Gemini, Copilot, and the open-source Llama 3.5.
– Benchmarking against human-generated responses.
– Seven metrics used: accuracy, completeness, conciseness, harmlessness, readability, stability, and self-correction capability.
– Additional comparisons were made across question subgroups categorized by clinical topic type.
Key findings
– Accuracy, completeness, and harmlessness: ChatGPT-4o stood out, achieving the best results in these areas (accuracy 6.70 ± 0.63; completeness 4.63 ± 0.63; harmlessness 3.97 ± 0.17).
– Conciseness: Gemini delivered the highest conciseness score (4.00 ± 0.14).
– Readability: ChatGPT-4o had the most challenging readability profile (lowest score at 26.02 ± 10.78, indicating more difficult text to read). Copilot’s readability score was 40.26 ± 14.58, still lower than human readability (51.54 ± 13.71).
– Stability and self-correction: Copilot showed the strongest stability in reproducibility, and all models demonstrated robust self-correction when prompted.
– Overall performance relative to humans: Across subgroups of questions, all LLMs performed comparably to or better than human responses.
What this means for cataract care
– The study suggests LLMs, especially ChatGPT-4o, can provide accurate and comprehensive guidance on common cataract-related issues, potentially supporting patient education and clinical decision-making.
– The variability in readability highlights a need to tailor AI-provided information to patient literacy, ensuring explanations are accessible without sacrificing accuracy.
– While AI shows promise, clinicians and patients should continue to apply critical judgment and verify AI outputs against established medical guidance.
Implications and value-added commentary
– Practical use: AI chat assistants could be integrated into patient education workflows, helping to answer routine questions, summarize risks and treatment options, and triage inquiries before a patient-clinician encounter.
– Safety and oversight: The results underscore the importance of including AI tools as adjuncts rather than replacements for professional medical advice, with ongoing monitoring and updates as models evolve.
– Accessibility and customization: There is potential to optimize AI explanations for different literacy levels and languages, improving comprehension for a diverse patient population.
– Future directions: Further real-world testing and long-term evaluation are needed to assess how AI recommendations affect patient outcomes, decision satisfaction, and clinical workflow efficiency.
Bottom line
Five leading LLMs show meaningful potential to address cataract-related questions, with ChatGPT-4o emerging as the strongest overall performer in accuracy, completeness, and harmlessness. While these findings are encouraging, careful integration with clinical practice and attention to readability are essential to maximize benefits and minimize risks.
Summary of takeaways
– ChatGPT-4o leads in key quality metrics for cataract information.
– Gemini is the most concise; readability tends to be harder for AI-generated explanations compared to human writing.
– Copilot offers strong stability and self-correction, but readability varies.
– Across question types, AI responses are broadly on par with or better than human responses, signaling potential for AI-assisted patient education in ophthalmology.
– Ongoing evaluation, safety safeguards, and patient-centered presentation are critical as these tools move toward clinical use.