If you want the job done right—and that job is to repeatedly edit a document—you might have to do it yourself. Microsoft Research found that LLMs have a way of changing and corrupting a doc during long workflows. Using a benchmark they’re calling “Delegate-52,” Microsoft’s team tested 19 models across 52 domains, including coding, accounting, and music notation. The tests involved up to 20 interactions, including repeatedly editing, transforming, and recreating a file. And bad news: even top frontier systems, like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted data about one-quarter of the time, on average. The test. In its research, Microsoft provided an example of an accounting document from fictional nonprofit “Hack Club.” The researchers split the test ledger into separate files by expense category, then merged it back into one file—while regularly throwing in “distractor” topical documents to simulate a work environment. They computed reconstruction scores after 10 round trips of editing and returning. How to keep document integrity intact during agentic workflows.—BH |