Evaluating Clinical Data Accuracy in an Emergency Room Waiting Area Chatbot Using LLM-as-a-Judge
Requirements
- Required: Successful participation in the course "Human-Computer Interaction"
- Desirable: Successful participation in the seminar on "Interactive Intelligent Systems" and the lecture on "Wissenschaftliches Arbeiten in der Informatik"
Contents
Chatbots deployed in emergency room waiting areas can collect preliminary clinical data from patients — such as symptoms, pain levels, and medical history — prior to triage. Ensuring the accuracy and completeness of this collected data is critical, as errors or omissions could directly affect the quality of downstream care decisions. While human expert review remains the gold standard for evaluating such content, it is not scalable in real-time deployment settings. This thesis investigates the use of LLM-based evaluators as a scalable mechanism to assess the accuracy of data collected by an emergency room waiting area chatbot, and examines how well these evaluators align with the judgment of human experts through a systematic meta-evaluation.
Procedure
The thesis follows a mixed-methods design. First, a set of evaluation metrics for data accuracy and completeness is defined, drawing on rubric-based evaluation frameworks from the literature. An LLM evaluator is then developed and prompted to assess chatbot-collected patient data across these metrics. A meta-evaluation is conducted by comparing LLM evaluator scores against ratings provided by a small panel of human experts on the same set of chatbot interactions. Quantitative analysis identifies score discrepancies, while analysis of expert reasoning reveals criteria not captured by the LLM evaluator. Findings are used to formulate concrete recommendations for improving the evaluator design.
References
- Arora et al. (2025). HealthBench: Evaluating large language models towards improved human health. arXiv:2505.08775.
- Gu et al. (2024). A survey on LLM-as-a-judge. The Innovation.
- Genovese et al. (2026). Artificial Authority: The Promise and Perils of LLM Judges in Healthcare. Bioengineering, 13(1), 108.
- Szymanski et al. (2025). Limitations of the LLM-as-a-judge approach for evaluating LLM outputs in expert knowledge tasks. Proceedings of the 30th International Conference on Intelligent User Interfaces.
- Pan et al. (2024). Human-Centered Design Recommendations for LLM-as-a-judge. Proceedings of the 1st Human-Centered Large Language Modeling Workshop.
- Diekmann et al. (2025). LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA. Proceedings of the 24th Workshop on Biomedical Language Processing.
