Guide · Data management

Before/after — what a clean drive looks like for AI integration

Q: Can we do this without renaming, just with metadata?

Partially. Adding Drive Labels or sensitivity labels closes some of the gap, because metadata filtering can use them. But filenames are also what humans see, what shared links display, and what most third-party search interfaces index. A metadata-only approach improves retrieval for the AI assistant while leaving human navigation as messy as before — and the assistant's accuracy gains are usually 30% to 50% smaller than they would be with naming included.

Q: Won't renaming break our shared links?

On Google Drive, OneDrive, and SharePoint, file IDs are stable across renames — links continue to resolve. The user-visible link text doesn't auto-update, but the link still opens the renamed file. Where this does matter is in references inside documents ("see attached: final_v2.docx") and in third-party integrations that resolve by name. Both are handled the same way: a supervised approval workflow with a redirect map, and a full undo if a downstream system breaks.

Q: Do we need to do this for archived content too?

Only if you want it in the AI assistant's scope. If the archive is genuinely cold — not searched, not referenced — leaving it alone is fine. Most teams find a middle case: a fraction of the archive contains material that's still occasionally referenced (closed-account contracts, historical financial models). That fraction is worth the cleanup; the rest can stay as it is until the next migration forces the question.

Q: How do we keep the drive clean after?

Three light controls. First, an upload-time naming suggestion — most platforms either support this natively or via a Drive Label requirement. Second, a quarterly duplicate-and-staleness scan, with results sent to the team owner rather than to IT. Third, an onboarding step for new joiners that takes ten minutes and demonstrates the convention. Without these, the estate drifts back to mixed-state within twelve to eighteen months.

Q: How long does this actually take?

For a single team's 50,000-file drive with one engineer leading the work, two to four weeks of focused effort, depending on how many bespoke decisions the team's content forces. A whole-company rollout takes a quarter or two, because the limiting factor is policy decisions (retention schedules, classification, access tiers), not the rename throughput. The naming work itself is the smallest line item.

Q: What can't be fixed with this work?

Bad source documents stay bad. If the underlying contract is unsigned, the underlying model has the wrong formula, or the underlying brief is contradictory, no amount of renaming makes the AI assistant return a correct answer. AI hallucination on truly edge-case questions doesn't go to zero. Permission decisions still need a human — the cleanup makes existing permissions visible and auditable, but it doesn't decide whether a given file should be shared with a given person.

The other guides in this set are conceptual. This one is concrete. Side by side: what changes when you bring an unstructured estate under one standard — filenames, folder trees, metadata, and the retrieval-accuracy numbers that justify the work.

Published 18 June 2026Last updated 18 June 20269 min read

The audit we run before we touch anything

Day zero is measurement. Five numbers, captured before any rename, that anchor everything that follows.

Total file count by type. Documents, spreadsheets, slides, PDFs, images, video. The mix tells you which document types need bespoke naming rules.
Duplicate ratio. Same content hash, different filename. We expect 12% to 25% in an untouched estate.
Weak-signal filename rate. Files whose names begin with Untitled, Copy of, Screenshot, Scan_, Document (, or similar. We expect 8% to 18%.
Orphaned-owner count. Files whose owner field references a departed employee. Often surprisingly high — 5% to 15% — and a leading indicator of access-control debt.
Maximum folder depth. The deepest folder path in the estate. Anything above seven levels is a flag.

These five numbers are also the after-cleanup scorecard. Re-measure on day ninety; the deltas are how you justify the work to whoever signed off on it.

Before — what a typical 50,000-file SMB drive looks like

Composite from real engagements, sanitised. The pattern repeats with surprising consistency across industries — the proper nouns change, the shape doesn't.

Filenames

Untitled-3.docx
final_FINAL_v3 (1).docx
Copy of proposal FINAL use this one.docx
Screenshot 2024-08-15 at 14.07.42.png
Document (4).pdf
Scan_0042.pdf
Q2 numbers latest sarah edits.xlsx
asdf.docx

Folder tree (excerpt)

/Shared/

/Shared/Sarah's Stuff/

/Shared/Sarah's Stuff/Old/

/Shared/Sarah's Stuff/Old/Old Old/

/Shared/Sarah's Stuff/Old/Old Old/2019 stuff/

/Shared/MISC/

/Shared/MISC/please sort/

/Shared/Q4 2024 FINAL/

/Shared/Q4 2024 FINAL/REALLY FINAL/

Metadata

Owner: departed-employee@example.com, on 14% of files.
Modified date: matches the date of the last platform sync job, not the date of the last meaningful edit, on roughly a third of files inherited from older platforms.
Sensitivity labels: blank.
Drive Labels / custom fields: blank.

After — the same drive, post-convention

Filenames

2024-03-12_proposal_acme-corp_v2.docx
2024-06-30_q2-financials_consolidated.xlsx
2024-08-15_screenshot_pricing-page_landing.png
2024-09-01_invoice_acme-corp_INV-1042.pdf
2024-11-04_meeting-notes_board_quarterly-review.docx
2025-01-15_contract_acme-corp_msa_signed.pdf

Same content as before, addressable by date, document type, and counterparty. The assistant can filter by any of the three before it reads a single byte of content.

Folder tree (excerpt)

/Shared/

/Shared/Finance/

/Shared/Finance/2024/Q2/

/Shared/Finance/2024/Q2/Expenses/

/Shared/Sales/

/Shared/Sales/2024/Acme-Corp/

/Shared/Sales/2024/Acme-Corp/Contracts/

/Shared/_archive/2019-2022/

Metadata

Owner: reassigned to current team owner; no orphaned owners remain.
Modified date: preserved where meaningful; content date carried in filename so retrieval doesn't depend on it.
Sensitivity labels: applied at folder level — Internal, Confidential, Public.
Drive Labels / custom fields: retention class on every folder.

Side-by-side comparison

The audit numbers, before and after. These are the five metrics that move; everything else is downstream of them.

Metric	Before	After
Total files	52,400	43,100 (canonical) + 9,300 archived
Duplicate ratio	18%	< 3%
Weak-signal filename rate	12%	< 1%
Orphaned-owner count	7,300	0
Maximum folder depth	11	5
Files with sensitivity label	0%	100% (inherited from folder)
Files with retention class	0%	100% (inherited from folder)

The retrieval test — same ten queries, before and after

The numbers above are inputs. The retrieval test is the output — and the only number a business stakeholder actually cares about.

We run ten representative queries through the AI assistant before any cleanup, then re-run the identical queries after. The queries don't change; only the estate does. From a recent SMB engagement:

Query type	Before	After
Find the latest Acme MSA	Returned stale v1 from 2023	Returned signed 2025 MSA
Summarise Q2 2024 expenses	Returned three conflicting numbers	Returned the consolidated Q2 file
What did the board approve in March?	No file found	Returned March board minutes
Pull the pricing page screenshot	Returned an unrelated UI screenshot	Returned the pricing-page screenshot
Which vendors are under NDA?	Listed two of seven	Listed all seven
...	...	...
Aggregate	4 / 10 correct	9 / 10 correct

The one remaining failure in this engagement was a question whose answer was genuinely not in the drive — a case where the assistant's honest response should have been "I don't have that", and which a post-cleanup tuning step addressed by enabling abstention.

What it costs — time, people, and dollars

Honest ranges from the engagements we've seen. Your estate may sit at either end.

Approach	People	Time	Dollar cost
DIY, solo IT lead, 50k-file drive	1 engineer, 0.5 FTE	8 to 12 weeks	~½ quarter of IT time
DIY, team rollout, multi-team drive	1 IT lead + 1 records manager + per-team SME time	2 quarters	Significant; biggest line item is policy meetings
Assisted, 50k-file drive (e.g. PLUMdata)	1 engineer, 0.2 FTE	2 to 4 weeks	Per-rename pricing on the apply step; scan and preview free
Assisted, whole-company rollout	1 IT lead + per-team SME time, no records manager required	1 quarter	Per-rename pricing; policy time materially lower than DIY

The assisted path's saving isn't the rename throughput — it's the convention-design and edge-case decision time, which is where DIY projects stall.

What this work can't fix

Setting expectations is part of the work. Three things cleanup does not solve:

Bad source documents stay bad. If the underlying contract is unsigned or the underlying model has the wrong formula, renaming the file makes it findable but not correct. The AI assistant will confidently return the wrong answer faster.
Edge-case hallucinations don't go to zero. A clean estate dramatically reduces retrieval-driven hallucination — the assistant returns fewer wrong files. It does not eliminate the residual cases where the model invents content not present in any source. Those require abstention prompting and human review.
Permission decisions still need humans. Cleanup makes existing permissions visible and auditable. It doesn't decide whether a given file should be shared with a given person — that's a policy decision, and a tool can only enforce a policy once someone has written it down.

For the diagnostic that tells you which failure mode is dominant in your estate before you commit to cleanup, see why your AI assistant can't search your drive properly. For the systematic readiness checklist, see the AI data readiness guide.

Frequently asked questions

Can we do this without renaming, just with metadata?

Partially. Adding Drive Labels or sensitivity labels closes some of the gap, because metadata filtering can use them. But filenames are also what humans see, what shared links display, and what most third-party search interfaces index. A metadata-only approach improves retrieval for the AI assistant while leaving human navigation as messy as before — and the assistant's accuracy gains are usually 30% to 50% smaller than they would be with naming included.

Won't renaming break our shared links?

On Google Drive, OneDrive, and SharePoint, file IDs are stable across renames — links continue to resolve. The user-visible link text doesn't auto-update, but the link still opens the renamed file. Where this does matter is in references inside documents ("see attached: final_v2.docx") and in third-party integrations that resolve by name. Both are handled the same way: a supervised approval workflow with a redirect map, and a full undo if a downstream system breaks.

Do we need to do this for archived content too?

Only if you want it in the AI assistant's scope. If the archive is genuinely cold — not searched, not referenced — leaving it alone is fine. Most teams find a middle case: a fraction of the archive contains material that's still occasionally referenced (closed-account contracts, historical financial models). That fraction is worth the cleanup; the rest can stay as it is until the next migration forces the question.

How do we keep the drive clean after?

Three light controls. First, an upload-time naming suggestion — most platforms either support this natively or via a Drive Label requirement. Second, a quarterly duplicate-and-staleness scan, with results sent to the team owner rather than to IT. Third, an onboarding step for new joiners that takes ten minutes and demonstrates the convention. Without these, the estate drifts back to mixed-state within twelve to eighteen months.

How long does this actually take?

For a single team's 50,000-file drive with one engineer leading the work, two to four weeks of focused effort, depending on how many bespoke decisions the team's content forces. A whole-company rollout takes a quarter or two, because the limiting factor is policy decisions (retention schedules, classification, access tiers), not the rename throughput. The naming work itself is the smallest line item.

What can't be fixed with this work?

Bad source documents stay bad. If the underlying contract is unsigned, the underlying model has the wrong formula, or the underlying brief is contradictory, no amount of renaming makes the AI assistant return a correct answer. AI hallucination on truly edge-case questions doesn't go to zero. Permission decisions still need a human — the cleanup makes existing permissions visible and auditable, but it doesn't decide whether a given file should be shared with a given person.

The audit we run before we touch anything

Before — what a typical 50,000-file SMB drive looks like

Filenames

Folder tree (excerpt)

Metadata

After — the same drive, post-convention

Filenames

Folder tree (excerpt)

Metadata

Side-by-side comparison

The retrieval test — same ten queries, before and after

What it costs — time, people, and dollars

What this work can't fix

Frequently asked questions

Related guides