Cloudficient Blog | Cloudficient

How Often Do Linked Documents Change After Send?

Written by Peter Kozak | Apr 22, 2026 10:49:22 AM

On April 2, 2026, Craig Ball published "A Dog and Its Tail: Don't Let Version Uncertainty Cloud Linked Attachment Production". The essay made two arguments. First - the "dog" - that collecting and searching linked attachments is a threshold discovery obligation, "full stop." Second - the "tail" - that getting the precise as-sent version is aspirational, important, and solvable over time.

This post is a first step, not a final answer. It reports Cloudficient's first-run measurement against the empirical question Craig asked between the dog and the tail - his intuition that fewer than 10 to 20% of linked attachments are meaningfully modified after being shared, offered as belief rather than evidence, with a call for the industry to measure it. The measurement is one tenant, on a platform-versioning measure, reported on our own methodology and footing. The sections that follow are careful about the difference between what it shows and what it does not yet settle.

What We Measured

During normal email journaling, the preservation system captures three pieces of information for each linked document transmitted in an outbound or inbound email: the Microsoft version identifier of the file at the moment the link was sent, the sharing URL that was transmitted, and the send timestamp. The scope of this measurement is email-linked documents only. Teams messages, channel posts, and other non-email sharing events are out of scope for this run, which aligns the measurement with the traditional "linked attachment" framing at the center of Craig's essay.

Months (or years) later, we resolve the same URL through Microsoft Graph and retrieve the current `DriveItem` state - `lastModifiedDateTime`, the versions collection, and lifecycle indicators. No content is read. We compare the version identifier preserved at send time to the current version identifier on the file.

From the same capture, three related measurements fall out:

Prevalence - did the version identifier change at all between send and query? (A binary per-link answer.)

Velocity - how much elapsed time sits between the preserved input-version timestamp and the current latest-version timestamp? (Computed as the difference between the `InputVersionDate` captured at send and the `LatestVersionDate` returned by Graph. This is an elapsed-time measurement, not a count of distinct post-send modification events.)

Multiplicity - how many platform version increments accumulated between send and query? (Computed as the arithmetic difference between the preserved version identifier and the current version identifier returned by Graph.)

All three are derived from Microsoft's own metadata. No file contents are read at either end.

The sections that follow report each in turn.

The tenant for this first run is a single large enterprise M365 customer. Over approximately one year of continuous email journaling with send-time version capture enabled, the preservation system has accumulated send-time metadata for several million email-linked document events. From that population, 10,000 links were randomly selected for comparison against Microsoft Graph's current state.

Random selection matters: the rates reported here are estimates of the tenant-wide rate across the full multi-million-event population, not descriptions of a hand-picked subset. At this sample size - drawn from a population large enough that finite-population correction is negligible - the 95% confidence interval around any headline percentage is approximately ±1 percentage point. The headline numbers below are therefore robust to sampling variance at the tenant level. They are not yet robust to tenant-to-tenant variance - that is a different question, addressed below.

The Headline Numbers

Of the 10,000 links, 9,795 resolved successfully against Microsoft Graph; the remaining 205 - about 2.1% - did not resolve at all, through deletion, access revocation, or other platform lifecycle events. That 2.1% is a distinct preservation-gap outcome and is reported separately, rather than folded into the modification rate below. For 6,098 of the resolved rows - non -`.aspx` file-link rows where both the Graph lookup succeeded, and the preservation system had captured a send-time version identifier - we could compute the headline comparison. (An additional 12 resolved rows are `.aspx` / FormServer navigation links; they are excluded from the headline rate but reappear in a wider multiplicity subset further below.)

Within that 6,098-link population, 75.3% of files no longer carried the same Microsoft version identifier at query time that they carried at send time. By Microsoft's own version accounting, the file the recipient would retrieve today often is not in the same version state it was in when the link arrived in their inbox.

Restricting to the Microsoft Office document types that dominate modern discovery - Word, PowerPoint, and Excel - the number rises:

 File type 
 Links measured 
 Modified after send 
 Word (`.docx`) 
 2,236 
 83.1% 
 PowerPoint (`.pptx` 
 1,815 
 83.3% 
 Excel (`.xlsx`) 
 1,408 
 78.4% 
 Office subtotal 
 5,459 
 81.9% 

Important measurement note. In this first run, "modified" means the Microsoft version identifier at query time differed from the version identifier preserved at send time. That is a platform-versioning measure. It is not yet a hash-level or semantic comparison of the file contents, and it may therefore include version increments associated with autosave, co-authoring, or other platform activity that would not necessarily amount to a reader-visible substantive edit.

This headline should not be read as "81.9% of all sampled links were substantively altered." It is the share of Office-document rows within the clean denominator for which the current Microsoft version identifier no longer matched the send-time version identifier. A hash-level follow-up on the subset where the as-sent content was preserved is the appropriate next refinement, and is planned as a separate measurement.

For the working documents of modern knowledge work - the collaborative spreadsheets, briefing decks, and working drafts that actually carry legal and business deliberation - four out of five linked copies in this tenant no longer matched their send-time version identifier.

How Soon Does the Version Change?

A second question is velocity. For the 4,592 headline rows whose version identifier changed after send, the elapsed time between the preserved input-version timestamp and the current latest-version timestamp is distributed like this:

 Elapsed time 

 Share of modifications 

 Within 1 day 

 10.0% 

 Within 1 week 

 30.5% (cumulative) 

 Within 1 month 

 53.6% (cumulative) 

 Within 3 months 

 70.3% (cumulative) 

 Beyond 1 year 

 4.7% 

More than half of the modified links had reached their current latest-version timestamp within 30 days of send. Whatever the mix of substantive editing, autosave, and co-authoring activity beneath that signal, the elapsed-time distribution shows that version state does not stabilize after send. In this dataset, 30.5% of the rows that eventually changed had reached their current latest-version timestamp within one week of send; across the full 6,098-row headline-comparable population, that is 22.9%.

How Many Version Changes per File?

A related cut of the same data speaks to how active that drift is - not just whether the version identifier changed, but by how much it changed in numeric terms. The preserved send-time version identifier and the current latest version identifier are both numeric, and their arithmetic difference is itself a measurement. Because SharePoint version numbering varies by library (major-only vs. major/minor with periodic promotion, producing fractional values for minor versions), the difference is a numeric version-difference bucket rather than a literal count of distinct post-send events.

Distribution across the 6,110 rows where both the send-time and current version identifiers parsed numerically (12 rows above the 6,098 headline denominator because this subset includes the `.aspx`/FormServer navigation links excluded from the headline rate):

 Post-send numeric version difference 

 Rows 

 Share of 6,110 

 Negative / non-monotonic 

 15 

 0.2% 

 0 (unchanged) 

 1,512 

 24.7% 

 Greater than 0 but less than +1 

 44 

 0.7% 

 +1 

 405 

 6.6% 

 +2–3 

 492 

 8.1% 

 +4–9 

 812 

 13.3% 

 +10–24 

 907 

 14.8% 

 +25–99 

 1,204 

 19.7% 

 +100 or more 

 719 

 11.8% 

Within the 4,583 rows with a positive post-send numeric version difference, the median was 17. The mean was 52, pulled by a long right tail: p90 was 146, p99 was 457, and the maximum reached 1,704. In the full 6,110-row multiplicity subset, 719 rows -11.8% - had a post-send version difference of 100 or more.

As with the prevalence figure, these numbers reflect the arithmetic difference between send-time and current version identifiers, not a count of distinct substantive edits.

Even with those caveats, the file-type contrast at the intensity layer is striking:

 File type 

 Rows with positive version difference 

 Median positive version difference 

 Word (`.docx`) 

 1,853 

 16 

 PowerPoint (`.pptx`) 

 1,508 

 19 

 Excel (`.xlsx`) 

 1,100 

 ~20 

 Excel with macros (`.xlsm`) 

 23  

 28 

 PDF 

 27 

 

 ICS (calendar) 

 49 

 1 

When PDFs change, the median positive version difference is one. When Office documents change, the median positive version difference is 16 to 20-plus, and in the long tail, many hundreds. That file-type pattern - visible at both the prevalence and intensity layers of the data - is the subject of the next section.

Why the Signal Is Not Pure Platform Noise

The strongest internal sanity check on these numbers is the file-type contrast, which is now visible in two independent cuts of the same data.

At the prevalence layer, PDFs in the same sample changed in only 8.4% of cases, saved-email (`.msg`) files in 2$, video (`.mp4`) never, and calendar invites (`.ics`) in 92.5% (as one would expect of calendar items that reschedule). At the *intensity* layer, PDFs that did change showed a median positive version difference of 1; Office documents that changed showed a median positive version difference of 16 to 20-plus.

If the Office results were driven primarily by uniform platform noise - autosave, metadata churn, co-authoring versioning - we would expect PDFs and saved emails, which live in the same tenant and traverse the same platform, to show more similar drift in both cuts. They do not. What that contrast does not establish is how much of the Office result corresponds to substantive content change as opposed to other version-generating platform activity. A hash-level follow-up is the next refinement, though a substantive-content comparison layer on top of it is what would most directly answer that question. Both are scoped as companion studies.

What the Data Does Not Yet Say

This is one tenant. One run. One channel - email. A few important things are not yet in this measurement:

Retention aging. The comparison here is "is the send-time version still the current version?" The next measurement - already specified - will also report "is the send-time version *still retrievable from the platform's version history at all*?" That is the statistic that speaks directly to whether the send-time file is recoverable by any practitioner today, not just by a preservation system that captured the version identifier early.

Cross-tenant comparison. Additional tenants, drawn from other industry segments, are the next measurement. A third open question -does modification rate correlate with organizational or sector culture? - cannot be answered from a single tenant. We expect it does, and the point of running additional tenants is to test that expectation rather than assume it.

Non-email channels. Teams-native sharing follows different retention and versioning dynamics than email, and a clean comparison requires a measurement run designed for it. That is planned as a separate study - not a footnote to this one.

Materiality. The measurement does not judge whether any specific modification was legally consequential. That is the job of human review in a specific matter. What the measurement does establish is that the presumption of a stable linked file is not supported by the observed behavior of the platform.

What We're Doing Next

The measurement is reproducible by any practitioner with Microsoft Graph read access and preserved send-time metadata - the mechanics are laid out in *What We Measured* above. We expect to publish the full measurement specification alongside subsequent measurements - the hash-level refinement, additional tenants, and retention aging - so that others can measure too. We'd rather the number be tested than trusted.
For the standards consequence of this measurement - specifically what it suggests for the Reconstruction-Grade eDiscovery Standard's transparency-and-collection floor - a companion post publishes alongside this one on the RGR standard site: What One Tenant's Measurement Suggests for the RGR Staircase. This post stays with the measurement itself.

What This Tenant Is Doing About It

The customer did not start measuring linked-document change rates out of idle curiosity. Send-time preservation was the first operational step in a broader program - over the course of 2026, they are rolling out Cloudficient's full context-aware eDiscovery stack, targeting RGR-MAX, the highest conformance level of the Reconstruction-Grade eDiscovery Standard.

Send-time preservation sits at the foundation of that roadmap because it is the one capability that, once missed, cannot be retrofitted backward. A link whose as-sent version was not captured during journaling cannot have that state reconstructed later - the platform's own version history may have aged the as-sent version out by then, and the retention-aging study is scoped specifically to measure how often it has. Capturing the send-time state during journaling, and only during journaling, is the single discipline that turns the per-link question - was the file I am producing the file that was sent? - from unanswerable into answerable with evidence.

For the customer in this post, that foundation is now carrying a year of preserved metadata across several million email-linked document events. The measurement described above is the first operational use of that capture layer. Further layers - deterministic content preservation, relationship export integrity, reconstruction-grade reproducibility - build on top of it as the context-aware stack is deployed over the remainder of 2026.

What You Can Do

If your current preservation workflow cannot tell you, per link, whether the version you are producing matches the version that was sent, you are not alone - and the gap is structural, not behavioral. Most email archives journal inbound and outbound messages; very few integrate deeply enough with Microsoft 365 to detect linked files inside those messages, resolve them at journaling time, and preserve the as-sent version of each linked file alongside the email. Without that integration in the archive layer, the question - did the recipient see a different file than what I am producing? -cannot be answered retrospectively, because there is no send-time state to compare against.

That integration is what Cloudficient's Expireon provides, and it is what made the data in this post possible. If your matters or your pre-production diligence have surfaced the need to document, per link, whether the versions you are producing match the versions that were sent, the path to answering that question starts at the archive layer. Happy to talk through what that looks like in your environment.

Cloudficient builds context-aware eDiscovery for collaborative platforms. Cloudficient is publishing this first measurement on its own footing, while following the transparency and reproducibility posture we believe the field should expect. The Reconstruction-Grade eDiscovery Standard is the open, vendor-neutral conformance framework we published so the industry has a shared way to evaluate preservation capabilities.