BMJ Health & Care Informatics (Jul 2025)

Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education

  • Mowafa Househ,
  • AlHasan AlSammarraie,
  • Ali Al-Saifi,
  • Hassan Kamhia,
  • Mohamed Aboagla

DOI
https://doi.org/10.1136/bmjhci-2025-101570
Journal volume & issue
Vol. 32, no. 1

Abstract

Read online

Objectives To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.Methods We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.Results ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.Discussion Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.Conclusion ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.