Leveraging Retrieval-Augmented Language Models for Early Diagnosis in Resource-Constrained Healthcare

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

The British University in Dubai (BUiD)

Abstract

Large language models (LLMs) encounter notable challenges when applied to sensitive domains such as healthcare, particularly where data is limited, highly confidential, and subject to strict regulatory frameworks. These challenges are especially pronounced in the context of rare disease diagnosis, where current approaches often rely on decoder-based models that are proprietary and prone to hallucinations and generation of inaccurate or misleading outputs. Additionally, the substantial computational demands of LLMs further limit their feasibility in resource-constrained or low-income settings. To address these challenges, this research proposes a framework that maximizes the diagnostic utility of small and early collected clinical datasets while leveraging the power of open-source pre-trained medical language models. The framework introduces Retrieval-Augmented Encoding (RAE), a technique designed to enhance the diagnostic performance of affordable language models with classification heads by retrieving similar clinical notes to enrich the encoding of input data and support inference in diagnostic tasks. It also employs Retrieval-Augmented Generation (RAG) to expand the training dataset through paraphrasing for fine-tuning. A case study on appendicitis diagnosis was conducted using 2,400 unstructured abdominal disease notes, focusing on the exploration of the diagnostic sufficiency of early-stage clinical notes such as the History of Present Illness (HPI). Results show that the proposed framework achieved diagnostic accuracy and precision rates exceeding 93.3% with the HPI notes alone, highlighting their potential for early and efficient diagnosis without reliance on additional unstructured notes, like Physical Examination. This research highlights how locally deployable language models can be used effectively in resource-constrained healthcare environments to support early and accurate diagnoses, particularly for rare diseases and critical clinical decisions. Keywords: large language models, small dataset, appendicitis, diagnosis, retrieval augmented generation, data augmentation, clinical notes, history of present illnesses, BERT models, rare disease, healthcare informatics

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By