MedInsider: A Benchmark for Documentation Integrity in Medical LLM Agents Under Institutional Pressure

Ahmed Taha, Abdelrahman Taeha, Muzzammil Ahmadzada

Preprint · 2026 · Under review · Cited by 0

Author profiles — Ahmed Taha: ORCID · Google Scholar · ResearchGate · GitHub · Hugging Face

Read PDF Preprint (ResearchGate) Code Dataset

Abstract

Medical AI agents are starting to handle clinical documentation tasks like writing notes, submitting billing codes, and reporting quality metrics. We ask a question existing benchmarks do not: when the surrounding institutional context rewards shortcuts or omissions, do these agents still preserve documentation integrity? MedInsider is a benchmark and simulated medical-records environment designed to answer this. It contains 840 clinical scenarios organized as 420 matched pairs. Within each pair, the patient's condition and the correct actions are identical; only the surrounding pressure changes (for example, billing incentives, quality-metric pressure, or pressure to discharge patients faster). Because the agent operates inside a simulated EHR, we can compare what it actually saw and did against what it later wrote down. We evaluate seven contemporary LLM agents and find that task completion and documentation integrity are not interchangeable: the model with the highest task completion is not the one with the fewest documentation discrepancies, and low observed discrepancy rates can coincide with lower task completion. We also test a simple intervention, requiring the agent to pass a compliance check before submitting bills or quality reports, and find it can reduce documentation discrepancies on the tested subset, at a measurable cost to task completion. A four-reviewer validation study over 120 source/model-blinded episode payloads found almost-perfect agreement on integrity judgments (Fleiss' κ = 0.905) and majority labels that matched the automated scorer on this validation set. These results suggest that benchmarks measuring task accuracy alone miss behavior that matters under institutional pressure, and that small structural changes to how agents interact with records systems can reduce these failures.

Details

BibTeX

@misc{taha2026medinsider,
  title        = {MedInsider: A Benchmark for Documentation Integrity in
                  Medical LLM Agents Under Institutional Pressure},
  author       = {Taha, Ahmed and Taeha, Abdelrahman and Ahmadzada, Muzzammil},
  year         = {2026},
  note         = {Preprint},
  url          = {https://ahmedtaha.io/publications/medinsider/}
}