Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education

Mowafa Househ; AlHasan AlSammarraie; Ali Al-Saifi; Hassan Kamhia; Mohamed Aboagla

doi:10.1136/bmjhci-2025-101570

BMJ Health & Care Informatics (Jul 2025)

Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education

Mowafa Househ,
AlHasan AlSammarraie,
Ali Al-Saifi,
Hassan Kamhia,
Mohamed Aboagla

Affiliations

Mowafa Househ: Hamad Bin Khalifa University, Doha, Qatar
AlHasan AlSammarraie: Hamad Bin Khalifa University College of Science and Engineering, Doha, Qatar
Ali Al-Saifi: Applab, Doha, Qatar
Hassan Kamhia: Wareed Medical Content Foundation, Riyadh, Saudi Arabia
Mohamed Aboagla: Department of Medical Oncology - National Cancer Center and Cancer Research, Hamad Medical Corporation, Doha, Qatar

DOI: https://doi.org/10.1136/bmjhci-2025-101570
Journal volume & issue: Vol. 32, no. 1

Abstract

Read online

Objectives To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.Methods We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.Results ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.Discussion Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.Conclusion ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.

Published in BMJ Health & Care Informatics

ISSN: 2632-1009 (Online)
Publisher: BMJ Publishing Group
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://informatics.bmj.com/

About the journal