Before/after — what a clean drive looks like for AI integration
The other guides in this set are conceptual. This one is concrete. Side by side: what changes when you bring an unstructured estate under one standard — filenames, folder trees, metadata, and the retrieval-accuracy numbers that justify the work.
The audit we run before we touch anything
Day zero is measurement. Five numbers, captured before any rename, that anchor everything that follows.
- Total file count by type. Documents, spreadsheets, slides, PDFs, images, video. The mix tells you which document types need bespoke naming rules.
- Duplicate ratio. Same content hash, different filename. We expect 12% to 25% in an untouched estate.
- Weak-signal filename rate. Files whose names begin with
Untitled,Copy of,Screenshot,Scan_,Document (, or similar. We expect 8% to 18%. - Orphaned-owner count. Files whose owner field references a departed employee. Often surprisingly high — 5% to 15% — and a leading indicator of access-control debt.
- Maximum folder depth. The deepest folder path in the estate. Anything above seven levels is a flag.
These five numbers are also the after-cleanup scorecard. Re-measure on day ninety; the deltas are how you justify the work to whoever signed off on it.
Before — what a typical 50,000-file SMB drive looks like
Composite from real engagements, sanitised. The pattern repeats with surprising consistency across industries — the proper nouns change, the shape doesn't.
Filenames
Untitled-3.docxfinal_FINAL_v3 (1).docxCopy of proposal FINAL use this one.docxScreenshot 2024-08-15 at 14.07.42.pngDocument (4).pdfScan_0042.pdfQ2 numbers latest sarah edits.xlsxasdf.docx
Folder tree (excerpt)
/Shared/ |
/Shared/Sarah's Stuff/ |
/Shared/Sarah's Stuff/Old/ |
/Shared/Sarah's Stuff/Old/Old Old/ |
/Shared/Sarah's Stuff/Old/Old Old/2019 stuff/ |
/Shared/MISC/ |
/Shared/MISC/please sort/ |
/Shared/Q4 2024 FINAL/ |
/Shared/Q4 2024 FINAL/REALLY FINAL/ |
Metadata
- Owner: departed-employee@example.com, on 14% of files.
- Modified date: matches the date of the last platform sync job, not the date of the last meaningful edit, on roughly a third of files inherited from older platforms.
- Sensitivity labels: blank.
- Drive Labels / custom fields: blank.
After — the same drive, post-convention
Filenames
2024-03-12_proposal_acme-corp_v2.docx2024-06-30_q2-financials_consolidated.xlsx2024-08-15_screenshot_pricing-page_landing.png2024-09-01_invoice_acme-corp_INV-1042.pdf2024-11-04_meeting-notes_board_quarterly-review.docx2025-01-15_contract_acme-corp_msa_signed.pdf
Same content as before, addressable by date, document type, and counterparty. The assistant can filter by any of the three before it reads a single byte of content.
Folder tree (excerpt)
/Shared/ |
/Shared/Finance/ |
/Shared/Finance/2024/Q2/ |
/Shared/Finance/2024/Q2/Expenses/ |
/Shared/Sales/ |
/Shared/Sales/2024/Acme-Corp/ |
/Shared/Sales/2024/Acme-Corp/Contracts/ |
/Shared/_archive/2019-2022/ |
Metadata
- Owner: reassigned to current team owner; no orphaned owners remain.
- Modified date: preserved where meaningful; content date carried in filename so retrieval doesn't depend on it.
- Sensitivity labels: applied at folder level — Internal, Confidential, Public.
- Drive Labels / custom fields: retention class on every folder.
Side-by-side comparison
The audit numbers, before and after. These are the five metrics that move; everything else is downstream of them.
| Metric | Before | After |
|---|---|---|
| Total files | 52,400 | 43,100 (canonical) + 9,300 archived |
| Duplicate ratio | 18% | < 3% |
| Weak-signal filename rate | 12% | < 1% |
| Orphaned-owner count | 7,300 | 0 |
| Maximum folder depth | 11 | 5 |
| Files with sensitivity label | 0% | 100% (inherited from folder) |
| Files with retention class | 0% | 100% (inherited from folder) |
The retrieval test — same ten queries, before and after
The numbers above are inputs. The retrieval test is the output — and the only number a business stakeholder actually cares about.
We run ten representative queries through the AI assistant before any cleanup, then re-run the identical queries after. The queries don't change; only the estate does. From a recent SMB engagement:
| Query type | Before | After |
|---|---|---|
| Find the latest Acme MSA | Returned stale v1 from 2023 | Returned signed 2025 MSA |
| Summarise Q2 2024 expenses | Returned three conflicting numbers | Returned the consolidated Q2 file |
| What did the board approve in March? | No file found | Returned March board minutes |
| Pull the pricing page screenshot | Returned an unrelated UI screenshot | Returned the pricing-page screenshot |
| Which vendors are under NDA? | Listed two of seven | Listed all seven |
| ... | ... | ... |
| Aggregate | 4 / 10 correct | 9 / 10 correct |
The one remaining failure in this engagement was a question whose answer was genuinely not in the drive — a case where the assistant's honest response should have been "I don't have that", and which a post-cleanup tuning step addressed by enabling abstention.
What it costs — time, people, and dollars
Honest ranges from the engagements we've seen. Your estate may sit at either end.
| Approach | People | Time | Dollar cost |
|---|---|---|---|
| DIY, solo IT lead, 50k-file drive | 1 engineer, 0.5 FTE | 8 to 12 weeks | ~½ quarter of IT time |
| DIY, team rollout, multi-team drive | 1 IT lead + 1 records manager + per-team SME time | 2 quarters | Significant; biggest line item is policy meetings |
| Assisted, 50k-file drive (e.g. PLUMdata) | 1 engineer, 0.2 FTE | 2 to 4 weeks | Per-rename pricing on the apply step; scan and preview free |
| Assisted, whole-company rollout | 1 IT lead + per-team SME time, no records manager required | 1 quarter | Per-rename pricing; policy time materially lower than DIY |
The assisted path's saving isn't the rename throughput — it's the convention-design and edge-case decision time, which is where DIY projects stall.
What this work can't fix
Setting expectations is part of the work. Three things cleanup does not solve:
- Bad source documents stay bad. If the underlying contract is unsigned or the underlying model has the wrong formula, renaming the file makes it findable but not correct. The AI assistant will confidently return the wrong answer faster.
- Edge-case hallucinations don't go to zero. A clean estate dramatically reduces retrieval-driven hallucination — the assistant returns fewer wrong files. It does not eliminate the residual cases where the model invents content not present in any source. Those require abstention prompting and human review.
- Permission decisions still need humans. Cleanup makes existing permissions visible and auditable. It doesn't decide whether a given file should be shared with a given person — that's a policy decision, and a tool can only enforce a policy once someone has written it down.
For the diagnostic that tells you which failure mode is dominant in your estate before you commit to cleanup, see why your AI assistant can't search your drive properly. For the systematic readiness checklist, see the AI data readiness guide.
Frequently asked questions
Can we do this without renaming, just with metadata?
Partially. Adding Drive Labels or sensitivity labels closes some of the gap, because metadata filtering can use them. But filenames are also what humans see, what shared links display, and what most third-party search interfaces index. A metadata-only approach improves retrieval for the AI assistant while leaving human navigation as messy as before — and the assistant's accuracy gains are usually 30% to 50% smaller than they would be with naming included.
Won't renaming break our shared links?
On Google Drive, OneDrive, and SharePoint, file IDs are stable across renames — links continue to resolve. The user-visible link text doesn't auto-update, but the link still opens the renamed file. Where this does matter is in references inside documents ("see attached: final_v2.docx") and in third-party integrations that resolve by name. Both are handled the same way: a supervised approval workflow with a redirect map, and a full undo if a downstream system breaks.
Do we need to do this for archived content too?
Only if you want it in the AI assistant's scope. If the archive is genuinely cold — not searched, not referenced — leaving it alone is fine. Most teams find a middle case: a fraction of the archive contains material that's still occasionally referenced (closed-account contracts, historical financial models). That fraction is worth the cleanup; the rest can stay as it is until the next migration forces the question.
How do we keep the drive clean after?
Three light controls. First, an upload-time naming suggestion — most platforms either support this natively or via a Drive Label requirement. Second, a quarterly duplicate-and-staleness scan, with results sent to the team owner rather than to IT. Third, an onboarding step for new joiners that takes ten minutes and demonstrates the convention. Without these, the estate drifts back to mixed-state within twelve to eighteen months.
How long does this actually take?
For a single team's 50,000-file drive with one engineer leading the work, two to four weeks of focused effort, depending on how many bespoke decisions the team's content forces. A whole-company rollout takes a quarter or two, because the limiting factor is policy decisions (retention schedules, classification, access tiers), not the rename throughput. The naming work itself is the smallest line item.
What can't be fixed with this work?
Bad source documents stay bad. If the underlying contract is unsigned, the underlying model has the wrong formula, or the underlying brief is contradictory, no amount of renaming makes the AI assistant return a correct answer. AI hallucination on truly edge-case questions doesn't go to zero. Permission decisions still need a human — the cleanup makes existing permissions visible and auditable, but it doesn't decide whether a given file should be shared with a given person.