Assessing Emotionally-Driven Language in an Emergency Room Waiting Area Chatbot Using LLM-based Evaluation
Requirements
- Required: Successful participation in the course "Human-Computer Interaction"
- Desirable: Successful participation in the seminar on "Interactive Intelligent Systems" and the lecture on "Wissenschaftliches Arbeiten in der Informatik"
Contents
Patients in emergency room waiting areas are often in distress, and the language used by a chatbot in this context must balance empathy, clarity, and appropriateness to the situation. Emotionally-driven language that is poorly calibrated, either too clinical and cold, or inappropriately reassuring given the severity of symptoms, can negatively affect patient experience and trust. This thesis develops and evaluates an LLM-based evaluator designed to assess the appropriateness of emotionally-driven language produced by an emergency room waiting area chatbot, comparing its judgments against those of human raters to identify alignment gaps and design improvements. The inherently sociotechnical nature of this process is also examined, as deploying such an evaluator risks formalizing norms around emotional communication that were previously managed implicitly by human practitioners
Procedure
The thesis begins with a review of existing evaluation metrics for conversational tone, empathy, and naturalness in AI-generated dialogue. A set of evaluation criteria specific to the emergency room waiting context is then defined in collaboration with relevant stakeholders — for instance, distinguishing appropriate reassurance from inappropriate minimization of symptoms — and used to construct an LLM evaluator prompt following rubric-based design principles. A sample of chatbot conversations is rated both by the LLM evaluator and by a group of human raters, including both laypeople and domain experts. Quantitative comparison of scores and qualitative thematic analysis of rater reasoning are used to identify where the LLM evaluator diverges from human judgment. Findings inform iterative improvements to the evaluator design and surface broader implications of automating emotional language assessment in high-stakes settings.
References
- Deshpande, K., Sirdeshmukh, V., Mols, J. B., Jin, L., Hernandez-Cardona, E.-Y., Lee, D., Kritz, J., Primack, W. E., Yue, S., & Xing, C. (2025). MultiChallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 18632–18702). Association for Computational Linguistics.
- Genovese, A., Hegstrom, L., Prabha, S., Gomez-Cabello, C. A., Haider, S. A., Collaco, B., Wood, N. G., & Forte, A. J. (2026). Artificial authority: The promise and perils of LLM judges in healthcare. Bioengineering, 13(1), 108. https://doi.org/10.3390/bioengineering13010108
- Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Lin, Z., Zhang, B., Ni, L., Gao, W., Wang, Y., & Guo, J. (2026). A survey on LLM-as-a-judge. The Innovation, 101253. https://doi.org/10.1016/j.xinn.2025.101253
- Lin, B. Y., Deng, Y., Chandu, K., Brahman, F., Ravichander, A., Pyatkin, V., . . . Choi, Y. (2024). WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. arXiv. https://doi.org/10.48550/arXiv.2406.04770
- Pan, Q., Ashktorab, Z., Desmond, M., Santillán Cooper, M., Johnson, J., Nair, R., Daly, E., & Geyer, W. (2024). Human-centered design recommendations for LLM-as-a-judge. In Proceedings of the 1st Human-Centered Large Language Modeling Workshop (pp. 16–29). Association for Computational Linguistics.
- Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (Article 2020, pp. 46595–46623). Curran Associates Inc.
