A study from Microsoft Research has surfaced a risk most organizations have not yet built into their workflows: using a large language model to edit documents over time means accepting a silent, cumulative degradation of content. AI Rewriting is not a dramatic collapse but a slow drift that nobody catches.
What the Research Shows About AI Rewriting
The study used the DELEGATE-52 benchmark, an evaluation protocol covering 52 professional domains (accounting, code, musical notation, genealogy, subtitles, blueprints, recipes, and many others) and 310 real work environments. Nineteen models were given the same document editing tasks under identical, controlled conditions.
The setup was straightforward. Researchers submitted a document to an LLM, asked it to make an edit and then undo that edit. A reliable model would return the original document. They repeated this 20 times in a row.
The results were unambiguous.
Even the best models on the market degraded an average of 25% of a document’s content after 20 interactions. The average across all models reached 50% degradation.
Python was the only domain, out of 52, where most models managed to maintain acceptable reliability. Everywhere else, errors accumulated.
AI rewriting : Rare, Severe, and Invisible
What makes this degradation particularly hard to manage is its structure.
LLMs do not erode documents gradually, through small successive losses at each iteration. Degradation strikes through rare, abrupt accidents: most iterations leave the document intact, but a single isolated iteration can cause a 10 to 30% quality drop in one pass. Those accidents, though infrequent, account for nearly 80% of the total damage.
You can work through many sessions with no visible problem, then discover a substantially altered document after a corruption episode you never caught in the moment.
The best-performing models do not make fewer critical errors. They make them later. That does not make them more reliable; it simply extends the window during which you trust them before the document breaks.
One more finding worth noting: adding an agentic layer on top of the model, with file access and code execution, does not help. In the DELEGATE-52 experiments, models equipped with tools degraded documents more than the same models operating without them, by roughly 6% more. Agentic frameworks, often presented as the fix for LLM limitations, only displace the problem. The flaw sits at the model level, not in the orchestration layer, and no architecture resolves it at the surface.
What This Means for Organizations
This risk is not theoretical. It maps directly onto what we observe in organizations that have put AI into production.
Long workflows are the most exposed. The more interactions there are, the more errors accumulate. A document reworked across multiple sessions, by multiple people, with frequent back-and-forth with an LLM, is far more vulnerable than one edited a single time.
High-precision domains carry the most risk. Structured, repetitive formats hold up better. Complex natural language content, such as contractual documents, regulatory reports, and business documentation, is the most sensitive.
Document size matters too. Each additional 1,000 tokens amplifies degradation in a non-linear way over time. A 10,000-token document edited 20 times loses an average of 40% of its content, compared to less than 9% for a 2,000-token document under the same conditions.
And irrelevant context files make things worse. When documents that are not relevant to the task are present in the LLM’s context, which is common in RAG or multi-file environments, degradation accelerates. The effect is marginal in the short term and becomes significant over extended workflows.
What This Changes About AI Governance
What DELEGATE-52 documents rigorously is something we observe regularly on the ground: the AI risk is not in the model. It is in the workflow.
An LLM can produce satisfactory results in the short term and progressively degrade a document base over time, without anyone detecting it, because each individual interaction looks fine.
A few reflexes worth reconsidering.
Delegating a long task to an LLM and only checking the final output is essentially betting that each of the twenty steps went well. The Microsoft research proves that is a bad bet: a single iteration is enough to derail a document. Reviewing at multiple points in the process beats staking everything on the final version.
Failing to keep the original document is another common mistake. When an LLM introduces an error, it is rarely dramatic: a number that shifts, a sentence that disappears, a nuance that gets lost along the way. Without the starting version at hand, these small alterations go unnoticed. It is a basic precaution most organizations overlook.
There is also a subtler assumption worth questioning. Most people operate as though a document has not changed as long as they personally have not touched it. That was true with Word. It is not true with an LLM. A document can go through several seemingly harmless back-and-forth exchanges and still contain serious errors. Until that assumption is broken, the risk stays invisible.
Practical Safeguards against AI rewriting
The answer is not to abandon LLMs for document editing. It is to keep AI rewriting in mind and integrate them with appropriate guardrails.
- Systematic versioning. Any edit delegated to an LLM must be tracked. The original document must remain accessible and comparable. This is not a heavy technical requirement; it is a workflow discipline.
- Human supervision proportional to risk. For low-stakes content, a light review is enough. For contractual, regulatory, or strategic documents, structured validation by a subject-matter expert remains essential. An LLM cannot reliably evaluate the quality of its own output.
- Use-case classification by exposure level. An internal meeting summary does not require the same precautions as an analytical note for an investment committee or a piece of regulatory documentation. Mapping these levels is the prerequisite for any coherent usage policy.
- Hard limits on workflow length between checkpoints. Degradation accelerates with the number of interactions. Setting thresholds beyond which a human review is triggered is not conservatism; it is operational risk management.
The Bottom Line
DELEGATE-52 quantifies a risk that many had sensed without being able to measure it.
LLMs are powerful tools for document editing, but their reliability is asymmetric over time: sound in the short term, degraded over the medium and long term. AI rewriting and errors are quiet: no alarm, no error message, just content that drifts gradually away from its original version.
None of this argues against using AI in the enterprise. What it challenges is the assumption that plugging an LLM into an existing workflow is enough to capture the benefits automatically.
The difference between a deployment that delivers real value and one that ends up damaging the documents it was supposed to produce has nothing to do with the model chosen. It comes down to everything built around it: when someone reads through the output, which versions are preserved, where a human perspective is reintroduced into the process. That architectural work, which requires a real understanding of the technology as much as knowledge of the business processes it supports, is something no LLM vendor will do on behalf of the organization deploying it.
This article is based on the work of Philippe Laban, Tobias Schnabel, and Jennifer Neville (Microsoft Research), “LLMs Corrupt Your Documents When You Delegate,” April 2026.
See also : Generative AI in the enterprise: compliance starts long before you pick a tool







