MedInsider: A Benchmark for Documentation Integrity in Medical LLM Agents Under Institutional Pressure

Taha, Ahmed; Taeha, Abdelrahman; Ahmadzada, Muzzammil

doi:10.13140/RG.2.2.14991.14241

MedInsider: A Benchmark for Documentation Integrity in Medical LLM Agents Under Institutional Pressure

Ahmed Taha, Abdelrahman Taeha, Muzzammil Ahmadzada

Preprint · 2026 · Under review · Cited by 0

Author profiles — Ahmed Taha: ORCID · Google Scholar · ResearchGate · GitHub · Hugging Face

Read PDF Preprint (ResearchGate) Code Dataset

Abstract

Medical AI agents are starting to handle clinical documentation tasks like writing notes, submitting billing codes, and reporting quality metrics. We ask a question existing benchmarks do not: when the surrounding institutional context rewards shortcuts or omissions, do these agents still preserve documentation integrity? MedInsider is a benchmark and simulated medical-records environment designed to answer this. It contains 840 clinical scenarios organized as 420 matched pairs. Within each pair, the patient's condition and the correct actions are identical; only the surrounding pressure changes (for example, billing incentives, quality-metric pressure, or pressure to discharge patients faster). Because the agent operates inside a simulated EHR, we can compare what it actually saw and did against what it later wrote down. We evaluate seven contemporary LLM agents and find that task completion and documentation integrity are not interchangeable: the model with the highest task completion is not the one with the fewest documentation discrepancies, and low observed discrepancy rates can coincide with lower task completion. We also test a simple intervention, requiring the agent to pass a compliance check before submitting bills or quality reports, and find it can reduce documentation discrepancies on the tested subset, at a measurable cost to task completion. A four-reviewer validation study over 120 source/model-blinded episode payloads found almost-perfect agreement on integrity judgments (Fleiss' κ = 0.905) and majority labels that matched the automated scorer on this validation set. These results suggest that benchmarks measuring task accuracy alone miss behavior that matters under institutional pressure, and that small structural changes to how agents interact with records systems can reduce these failures.

Visual abstract

Matched pair · identical clinical facts, simulated EHR

Neutral twin

same chart state

same correct actions

Pressure twin billing incentive

same chart state

+ quality / throughput pressure

Agent acts via FHIR-shaped tools

read_chart write_note submit_billing quality_report simulated · benchmark

Action-log scoring · record vs. trace

Action log what it saw & did

observed chart facts + tool sequence

Final documentation

Notesupported by chart

Billingclaim not in action log unsupported

Compliance gate · before submission

submit billing / quality compliance gate blocks unsupported output

Action-log scoring checks whether the record the agent leaves behind is supported by what it observed and did.

Visual abstract. MedInsider is a benchmark that pairs a neutral clinical scenario with a pressure twin that holds the same clinical facts but adds institutional pressure such as a billing incentive or quality and throughput pressure. The agent acts inside a simulated, FHIR-shaped electronic health record through tools such as read_chart, write_note, submit_billing, and quality_report. Action-log scoring compares the final note, billing, and quality report against what the agent actually observed and did; any documentation not supported by the action log is flagged as unsupported. An optional compliance gate runs before submission and blocks unsupported output.

Details

Authors Ahmed Taha (Columbia University; Johns Hopkins University), Abdelrahman Taeha (Georgia Tech), Muzzammil Ahmadzada (Stanford University School of Medicine)
Year 2026
Type Preprint
DOI 10.13140/RG.2.2.14991.14241
ResearchGate researchgate.net/publication/405798471
PDF ahmedtaha.io/documents/MedInsider.pdf
Code github.com/ahmedtaha100/MedInsider
Dataset huggingface.co/datasets/ahmedtaha100/medinsider

BibTeX

@misc{taha2026medinsider,
  title        = {MedInsider: A Benchmark for Documentation Integrity in
                  Medical LLM Agents Under Institutional Pressure},
  author       = {Taha, Ahmed and Taeha, Abdelrahman and Ahmadzada, Muzzammil},
  year         = {2026},
  note         = {Preprint},
  doi          = {10.13140/RG.2.2.14991.14241},
  url          = {https://www.researchgate.net/publication/405798471}
}