Guide · AI & data management

Preparing your data for AI agents — the 2026 readiness checklist

Before you deploy AI agents on your Drive, your data has to be findable, well-named, deduplicated, and access-controlled. Here are the five prerequisites that decide whether your agent actually works — and the practical checklist to get there.

Published Last updated 11 min read

What does "AI-ready data" mean?

AI-ready data is content that an AI agent can find, identify, retrieve, and trust without a human in the loop. That definition is simple, but every word in it does work.

  • Find — the file shows up in a search the agent performs on the user's behalf. If your filenames are DSC_0042.jpg and final v2 FINAL use this one.docx, the agent cannot find anything by topic.
  • Identify — the file's name, path, and metadata communicate what the file is. The agent shouldn't have to open ten files to figure out which one holds the Q2 expenses.
  • Retrieve — the agent can read the file content. That means no permissions surprises, no broken links, no duplicate files where one is current and four are stale.
  • Trust — the agent (and the user reviewing the agent's answer) can tell whether the file is authoritative. Two near-identical files with different figures will produce two different answers, and the user won't know which to act on.

Almost every team that struggles to deploy an AI agent on their own data is failing at one of those four words. The good news: each one maps to a concrete, fixable problem.

Why data readiness decides whether your agent works

In 2025 a clear pattern emerged across enterprise AI agent pilots: the model is rarely the bottleneck. The bottleneck is the retrieval layer — and the retrieval layer is only as good as the source data.

A frontier-class model with a clean corpus outperforms a frontier-class model with a messy corpus by a wide margin. Both retrieval-augmented generation (RAG) and agentic workflows depend on metadata filtering: the agent picks candidate documents by name, path, date, and tag before it reads the contents. If your files don't carry that information in their names and paths, the candidate set is random.

The result is the canonical failure mode of enterprise AI: the agent confidently returns the wrong document, the user loses trust, and the pilot stalls. The fix is upstream — at the data layer — not in the prompt or the model.

The five prerequisites

1. Consistent file naming

Every file in the estate follows one naming convention. That convention specifies, at minimum:

  • Date format — ISO 8601 (2026-06-02) or a single consistent alternative.
  • Separator — space, hyphen, or underscore. One choice, applied everywhere.
  • Capitalisation — Title Case, lowercase, or UPPERCASE. One choice.
  • Field order — subject first, document type, then date; or date first, then subject. One choice.
  • Required fields — for invoices: vendor, invoice number, date. For contracts: counterparty, agreement type, signing date. Domain-specific, but written down.

The choice matters less than the consistency. An agent can learn either Q2 Expense Report 2024.xlsx or 2024-06-30_expense-report_q2.xlsx. It cannot learn both in the same Drive.

2. Predictable folder structure

A folder path is metadata. /Finance/2024/Q2/Expenses/ tells the agent four facts about every file inside. A folder called /MISC OLD STUFF/ tells it nothing.

Three-to-five-level hierarchies, ordered by stable categories (function → year → quarter → topic) work best for retrieval. Avoid grouping by person — people leave; the folder lives forever.

3. Working metadata

Drive platforms expose metadata fields the agent can read: owner, created date, modified date, MIME type, labels, drive location. Make sure they're populated and meaningful.

  • Use Drive Labels (Google) or Sensitivity Labels (Microsoft 365) for classification — sensitive, internal, public.
  • Don't rely on the file's modified date to mean "the date the content describes". They diverge constantly.
  • If your DMS supports custom metadata, populate it. Empty fields are worse than no fields because they create false negatives in retrieval.

4. Deduplicated versions

Version sprawl is the single most common failure mode in enterprise Drives. The fix is two-part:

  • One canonical file per logical document. Everything else goes into a _archive sibling folder or the DMS's version history.
  • Version numbers in the canonical file's name only when versions are externally meaningful (e.g. signed contracts). Internal drafts should not be discoverable copies.

An agent retrieving five near-identical files will produce five near-identical answers and a user who no longer trusts the agent.

5. Access control the agent inherits

An AI agent should retrieve content with the requesting user's permissions, not a service account's. This single architectural choice prevents the worst-case data-leak scenarios: an agent reading content the requester is not authorised to see, then surfacing it in an answer.

  • For Google Workspace, use the user's OAuth token, not a domain-wide delegation service account.
  • For Microsoft 365, use delegated permissions (on-behalf-of), not application permissions.
  • Audit every retrieval — log which user requested, which files were considered, which were used in the answer.

The AI-data-readiness checklist

Work through these in order. Each row depends on the rows above being done first.

StepOutcomeHow you know it's done
1. InventoryYou know what you haveFile count, type breakdown, and rough age distribution per drive
2. Define conventionOne-page naming standard, written downA non-technical reader can apply it to a new file without asking questions
3. Pilot on one folderConvention survives contact with real dataYou renamed 200–500 files and edge cases produced rule additions, not exceptions
4. Roll out, with previewsEvery file follows the standardNo file in the estate violates the rules; audit log of every rename is intact
5. Restructure foldersFolder paths are themselves metadataA new joiner can locate the right folder for a new document without asking
6. DeduplicateOne canonical file per document"final v2 FINAL" families are resolved; archives are clearly archived
7. ClassifySensitivity labels populatedEvery file has a classification; the agent can filter by it
8. Wire the agentUser-scoped retrieval with audit logYou can answer, for any given agent answer, which files were considered and why

Common mistakes to avoid

  • Renaming during a migration. Do it before the move, on the source platform. Otherwise you double the migration work and break every existing link.
  • Letting each team pick its own convention. The agent crosses team boundaries; the convention has to, too. Per-team variants are fine as extensions of a single base convention, not as alternatives to it.
  • Treating naming as a one-time project. New files arrive every day. Without a creation-time enforcement mechanism — a template, a Drive Label requirement, or an agent that suggests names on upload — the estate drifts back to chaos in months.
  • Skipping the inventory. Teams that don't measure first consistently underestimate the size of the problem by an order of magnitude. Run the scan.
  • Mocking access control in pilots. An agent that "works" with a service account often fails the moment you switch to per-user OAuth. Build it correctly from day one.

For government agencies

Public-sector data-readiness work has the same five prerequisites, plus a records-management layer on top. In the United States that means alignment with the National Archives and Records Administration's general records schedules (NARA GRS) and any agency-specific schedules.

Before any AI agent is authorised against agency data, three additional steps are usually required:

  • Records schedule alignment. File names and folder paths should make it possible to identify the records series each file belongs to. This is what enables automated retention and disposition.
  • FOIA-readiness. Files that may be subject to Freedom of Information Act requests need to be findable by topic, date range, and custodian. The naming convention directly determines whether FOIA searches return complete results.
  • Classification consistency. Sensitivity labels (CUI, PII, etc.) must be applied uniformly. An AI agent retrieving files with inconsistent labels cannot safely filter them.

The order of operations for an agency considering a drive migration and an AI deployment is:

  1. Inventory the source estate.
  2. Apply naming convention aligned with the records schedule.
  3. Deduplicate and classify.
  4. Migrate the clean estate.
  5. Wire the agent against the migrated, classified data.

Skipping step 2 is the single most common reason agency AI pilots stall after migration.

Where PLUMdata fits

PLUMdata handles steps 2 through 4 of the checklist — the naming-convention, rollout, and deduplication layer — for Google Drive today, with OneDrive, Dropbox, and SharePoint following.

  • Learns your convention. A short onboarding conversation produces a written naming standard.
  • Previews every rename. Nothing changes in your Drive without your explicit approval, file by file or folder by folder.
  • Full undo. Every rename is reversible at every step. The audit log is yours.
  • Private by design. Files are processed in-session, never stored, never used to train AI.
  • Free to scan, free to preview. Pay only when you apply.

Run a free scan on your own Drive →

Frequently asked questions

What does "AI-ready data" actually mean?

AI-ready data is content that an AI agent can find, identify, retrieve, and trust without human intervention. In practice that means consistent file naming, a predictable folder structure, working metadata, deduplicated versions, and access controls that the agent inherits from the requesting user.

Do I need a vector database before I can use an AI agent on my Drive?

Not necessarily. Most enterprise agents now use a hybrid of vector retrieval and metadata filtering. Metadata filtering only works if your file names and folder structure are consistent — which is why naming-convention work is the prerequisite to any retrieval architecture.

How long does it take to make a 100,000-file Drive AI-ready?

With an assisted workflow like PLUMdata, two to four weeks of supervised work for a single team's estate. A full enterprise migration takes longer because policy decisions (retention schedules, classification, access tiers) drive the timeline, not the renaming itself.

Is this different for government agencies?

Yes. Agencies have to align file naming with their records retention schedule (in the United States, NARA general records schedules and any agency-specific schedules) before any AI agent can be deployed against the data. The data-readiness work is the same; the constraint set is wider.

Should we do this before or after migrating drives?

Before. Migrating chaos produces migrated chaos. Apply the naming convention and deduplicate on the source platform, then migrate the clean estate. This roughly halves migration time and avoids re-indexing post-migration.

Where does PLUMdata fit in this process?

PLUMdata handles step 2 and step 3 of the checklist — defining a naming convention and applying it to every file in your Drive — with previews, supervised approval, and full undo. It is the data-management layer that sits between your raw Drive and any downstream AI agent, search index, or vector store.

Start with your own Drive

Free to scan, free to preview, private by design. Run the first step of the checklist on your real data in minutes.

Begin a free scan →