RAG Pipeline for Clinical Information Extraction from Nurse-Patient Transcripts

ai-technology · 2026-05-18

A novel modular retrieval-augmented generation (RAG) pipeline has been introduced to facilitate schema-constrained extraction of clinical information from dialogues between nurses and patients. This system tackles the issue of transforming unstructured narratives into structured formats that adhere to value-type constraints, as outlined by the MEDIQA-SYNUR task. It utilizes the training set as a model corpus and integrates schema-constrained prompting (both full and pruned candidate schema), deterministic schema-based postprocessing, and a secondary audit. The pipeline features two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2, along with relevant embedding models for RAG. This strategy seeks to alleviate the significant documentation workload that hinders clinicians, which previous research indicates detracts from their time spent on direct patient care.

Key facts

The pipeline is designed for MEDIQA-SYNUR, focusing on observation extraction from nurse-patient transcripts.
It uses retrieval-augmented generation with schema-constrained prompting.
Two LLM backbones are tested: Llama-4-Scout-17B-16E-Instruct and GPT-5.2.
The system normalizes narratives into a predefined schema with value-type constraints.
A second-pass audit is included for quality control.
The training set serves as an exemplar corpus for RAG.
Prior studies show clinicians spend large portions of their workday on documentation.
The pipeline aims to reduce documentation burden and increase direct patient care time.

RAG Pipeline for Clinical Information Extraction from Nurse-Patient Transcripts

Key facts

Entities

Institutions

Sources