Microsoft says LLMs degrade documents during long workflows

What does that mean for the IT pros building agents?

May 19, 2026

• 4 min read

TOPICS: Software / AI & Emerging Paradigms / AI Software

If you want the job done right—and that job is to repeatedly edit a document—you might have to do it yourself.

Microsoft Research found that LLMs have a way of changing and corrupting a doc during long workflows.

Using a benchmark they’re calling “Delegate-52,” Microsoft’s team tested 19 models across 52 domains, including coding, accounting, and music notation. The tests involved up to 20 interactions, including repeatedly editing, transforming, and recreating a file. And bad news: even top frontier systems, like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted data about one-quarter of the time, on average.

The test. In its research, Microsoft provided an example of an accounting document from fictional nonprofit “Hack Club.” The researchers split the test ledger into separate files by expense category, then merged it back into one file—while regularly throwing in “distractor” topical documents to simulate a work environment. They computed reconstruction scores after 10 round trips of editing and returning.

“Every model sees its performance degrade over the course of interaction, with average degradations of 50% across tested models by the end of simulation,” the researchers wrote.

The weaker models’ degradation was generally due to deleted content, while frontier models’ degradation was more attributable to corrupted content, the report revealed. Also, models perform better in programmatic, repetition-filled domains like Python, compared to natural-language domains with rich, unrepeated vocabulary.

Long horizon. The paper studied a specific interaction pattern called long-horizon delegated work, where AI systems repeatedly modify document artifacts over many steps with limited human or agentic verification between interactions, Philippe Laban, senior researcher in the AI Interaction and Learning (AIIL) group at Microsoft Research, shared with IT Brew in an email.

“The corruption we measure captures errors that emerge through chained transformations across many sequential edits and is designed to stress-test failure modes in this iterative process,” Laban, senior researcher in the AI Interaction and Learning (AIIL) group at Microsoft Research wrote to us. “Importantly, they do not reflect business outcomes or user satisfaction.”

Get to workflow. As organizations experiment with agentic workflows, it’s not too difficult to imagine a document moving from one AI agent to another, transforming along the way.

Travis Rehl, CTO at cloud consultancy and AWS partner Innovative Solutions, for example, spoke to us in March 2026 about a “scoping” tool that allows sales teams to lay out customer requirements based on previous successful scopes. The DarcyIQ generates a first pass of a services contract, which someone then reviews and changes.

Shanti Greene, head of data science and AI innovation at enterprise AI solutions company AnswerRocket, imagines a multi-agent research scenario. An “orchestrator” might take a user’s query and fire it off to a web-search agent, a read agent, a summarize agent, and a writer agent. There are potential pitfalls in that approach.

“Something that I’ve noticed is that with context windows, when you have a full context window, the model will tend to pick [text] at the beginning of it or at the end of it, and the middle gets overlooked. And if you do that at scale, and you go agent to agent, and it keeps happening, at some point, I think you just lose a chunk of the document,” Greene said.

To ensure document integrity through an agentic workflow, Greene recommends IT pros configure their agents or orchestration platforms to create a plaintext or markdown file of their outputs at each stage; this could contain, for example, what they did and the instructions for the next agent. This allows traceability. “The next agent can pick up fresh data, and you can go check that,” he said.

He also recommended word-count checks for documents that aren’t expected to change.

For software development, Jess Lampe, global lead technologist of digital transformation company Launch Consulting Group, recommends tried and true software checks like unit tests, linting, and human review; for natural-language documents, such as marketing materials that need to match brand guidelines, have subject matter experts who can review the text at critical stages of the process.

That goes for Rehl’s scoping tool, as well. Even the best models get it wrong, Rehl told us in March. “You have to have critical thinking as to what is directionally correct versus what is correct,” he said.

About the author

Billy Hurley

Billy Hurley has been a reporter with IT Brew since 2022. He writes stories about cybersecurity threats, AI developments, and IT strategies.

Top insights for IT pros

Billy Hurley

Top insights for IT pros