Guide · Data management

Before/after — what a clean drive looks like for AI integration

The other guides in this set are conceptual. This one is concrete. Side by side: what changes when you bring an unstructured estate under one standard — filenames, folder trees, metadata, and the retrieval-accuracy numbers that justify the work.

Published Last updated 9 min read

The audit we run before we touch anything

Day zero is measurement. Five numbers, captured before any rename, that anchor everything that follows.

  • Total file count by type. Documents, spreadsheets, slides, PDFs, images, video. The mix tells you which document types need bespoke naming rules.
  • Duplicate ratio. Same content hash, different filename. We expect 12% to 25% in an untouched estate.
  • Weak-signal filename rate. Files whose names begin with Untitled, Copy of, Screenshot, Scan_, Document (, or similar. We expect 8% to 18%.
  • Orphaned-owner count. Files whose owner field references a departed employee. Often surprisingly high — 5% to 15% — and a leading indicator of access-control debt.
  • Maximum folder depth. The deepest folder path in the estate. Anything above seven levels is a flag.

These five numbers are also the after-cleanup scorecard. Re-measure on day ninety; the deltas are how you justify the work to whoever signed off on it.

Before — what a typical 50,000-file SMB drive looks like

Composite from real engagements, sanitised. The pattern repeats with surprising consistency across industries — the proper nouns change, the shape doesn't.

Filenames

  • Untitled-3.docx
  • final_FINAL_v3 (1).docx
  • Copy of proposal FINAL use this one.docx
  • Screenshot 2024-08-15 at 14.07.42.png
  • Document (4).pdf
  • Scan_0042.pdf
  • Q2 numbers latest sarah edits.xlsx
  • asdf.docx

Folder tree (excerpt)

/Shared/
/Shared/Sarah's Stuff/
/Shared/Sarah's Stuff/Old/
/Shared/Sarah's Stuff/Old/Old Old/
/Shared/Sarah's Stuff/Old/Old Old/2019 stuff/
/Shared/MISC/
/Shared/MISC/please sort/
/Shared/Q4 2024 FINAL/
/Shared/Q4 2024 FINAL/REALLY FINAL/

Metadata

  • Owner: departed-employee@example.com, on 14% of files.
  • Modified date: matches the date of the last platform sync job, not the date of the last meaningful edit, on roughly a third of files inherited from older platforms.
  • Sensitivity labels: blank.
  • Drive Labels / custom fields: blank.

After — the same drive, post-convention

Filenames

  • 2024-03-12_proposal_acme-corp_v2.docx
  • 2024-06-30_q2-financials_consolidated.xlsx
  • 2024-08-15_screenshot_pricing-page_landing.png
  • 2024-09-01_invoice_acme-corp_INV-1042.pdf
  • 2024-11-04_meeting-notes_board_quarterly-review.docx
  • 2025-01-15_contract_acme-corp_msa_signed.pdf

Same content as before, addressable by date, document type, and counterparty. The assistant can filter by any of the three before it reads a single byte of content.

Folder tree (excerpt)

/Shared/
/Shared/Finance/
/Shared/Finance/2024/Q2/
/Shared/Finance/2024/Q2/Expenses/
/Shared/Sales/
/Shared/Sales/2024/Acme-Corp/
/Shared/Sales/2024/Acme-Corp/Contracts/
/Shared/_archive/2019-2022/

Metadata

  • Owner: reassigned to current team owner; no orphaned owners remain.
  • Modified date: preserved where meaningful; content date carried in filename so retrieval doesn't depend on it.
  • Sensitivity labels: applied at folder level — Internal, Confidential, Public.
  • Drive Labels / custom fields: retention class on every folder.

Side-by-side comparison

The audit numbers, before and after. These are the five metrics that move; everything else is downstream of them.

MetricBeforeAfter
Total files52,40043,100 (canonical) + 9,300 archived
Duplicate ratio18%< 3%
Weak-signal filename rate12%< 1%
Orphaned-owner count7,3000
Maximum folder depth115
Files with sensitivity label0%100% (inherited from folder)
Files with retention class0%100% (inherited from folder)

The retrieval test — same ten queries, before and after

The numbers above are inputs. The retrieval test is the output — and the only number a business stakeholder actually cares about.

We run ten representative queries through the AI assistant before any cleanup, then re-run the identical queries after. The queries don't change; only the estate does. From a recent SMB engagement:

Query typeBeforeAfter
Find the latest Acme MSAReturned stale v1 from 2023Returned signed 2025 MSA
Summarise Q2 2024 expensesReturned three conflicting numbersReturned the consolidated Q2 file
What did the board approve in March?No file foundReturned March board minutes
Pull the pricing page screenshotReturned an unrelated UI screenshotReturned the pricing-page screenshot
Which vendors are under NDA?Listed two of sevenListed all seven
.........
Aggregate4 / 10 correct9 / 10 correct

The one remaining failure in this engagement was a question whose answer was genuinely not in the drive — a case where the assistant's honest response should have been "I don't have that", and which a post-cleanup tuning step addressed by enabling abstention.

What it costs — time, people, and dollars

Honest ranges from the engagements we've seen. Your estate may sit at either end.

ApproachPeopleTimeDollar cost
DIY, solo IT lead, 50k-file drive1 engineer, 0.5 FTE8 to 12 weeks~½ quarter of IT time
DIY, team rollout, multi-team drive1 IT lead + 1 records manager + per-team SME time2 quartersSignificant; biggest line item is policy meetings
Assisted, 50k-file drive (e.g. PLUMdata)1 engineer, 0.2 FTE2 to 4 weeksPer-rename pricing on the apply step; scan and preview free
Assisted, whole-company rollout1 IT lead + per-team SME time, no records manager required1 quarterPer-rename pricing; policy time materially lower than DIY

The assisted path's saving isn't the rename throughput — it's the convention-design and edge-case decision time, which is where DIY projects stall.

What this work can't fix

Setting expectations is part of the work. Three things cleanup does not solve:

  • Bad source documents stay bad. If the underlying contract is unsigned or the underlying model has the wrong formula, renaming the file makes it findable but not correct. The AI assistant will confidently return the wrong answer faster.
  • Edge-case hallucinations don't go to zero. A clean estate dramatically reduces retrieval-driven hallucination — the assistant returns fewer wrong files. It does not eliminate the residual cases where the model invents content not present in any source. Those require abstention prompting and human review.
  • Permission decisions still need humans. Cleanup makes existing permissions visible and auditable. It doesn't decide whether a given file should be shared with a given person — that's a policy decision, and a tool can only enforce a policy once someone has written it down.

For the diagnostic that tells you which failure mode is dominant in your estate before you commit to cleanup, see why your AI assistant can't search your drive properly. For the systematic readiness checklist, see the AI data readiness guide.

Frequently asked questions

Can we do this without renaming, just with metadata?

Partially. Adding Drive Labels or sensitivity labels closes some of the gap, because metadata filtering can use them. But filenames are also what humans see, what shared links display, and what most third-party search interfaces index. A metadata-only approach improves retrieval for the AI assistant while leaving human navigation as messy as before — and the assistant's accuracy gains are usually 30% to 50% smaller than they would be with naming included.

Won't renaming break our shared links?

On Google Drive, OneDrive, and SharePoint, file IDs are stable across renames — links continue to resolve. The user-visible link text doesn't auto-update, but the link still opens the renamed file. Where this does matter is in references inside documents ("see attached: final_v2.docx") and in third-party integrations that resolve by name. Both are handled the same way: a supervised approval workflow with a redirect map, and a full undo if a downstream system breaks.

Do we need to do this for archived content too?

Only if you want it in the AI assistant's scope. If the archive is genuinely cold — not searched, not referenced — leaving it alone is fine. Most teams find a middle case: a fraction of the archive contains material that's still occasionally referenced (closed-account contracts, historical financial models). That fraction is worth the cleanup; the rest can stay as it is until the next migration forces the question.

How do we keep the drive clean after?

Three light controls. First, an upload-time naming suggestion — most platforms either support this natively or via a Drive Label requirement. Second, a quarterly duplicate-and-staleness scan, with results sent to the team owner rather than to IT. Third, an onboarding step for new joiners that takes ten minutes and demonstrates the convention. Without these, the estate drifts back to mixed-state within twelve to eighteen months.

How long does this actually take?

For a single team's 50,000-file drive with one engineer leading the work, two to four weeks of focused effort, depending on how many bespoke decisions the team's content forces. A whole-company rollout takes a quarter or two, because the limiting factor is policy decisions (retention schedules, classification, access tiers), not the rename throughput. The naming work itself is the smallest line item.

What can't be fixed with this work?

Bad source documents stay bad. If the underlying contract is unsigned, the underlying model has the wrong formula, or the underlying brief is contradictory, no amount of renaming makes the AI assistant return a correct answer. AI hallucination on truly edge-case questions doesn't go to zero. Permission decisions still need a human — the cleanup makes existing permissions visible and auditable, but it doesn't decide whether a given file should be shared with a given person.

Start with your own Drive

Free to scan, free to preview, private by design. The first five numbers — duplicate ratio, weak-name rate, orphan count, depth, file mix — come back in minutes.

Begin a free scan →