Guide · Data strategy

Structured vs unstructured cloud storage — what it means for your business

Your SQL databases, CRM, and ERP get governance. Your Drive, SharePoint, and Dropbox don't — and that's where roughly 80% of new business data lives. This is the strategic frame the rest of the cluster builds on.

Published Last updated 11 min read

Definitions — with examples from your stack

The textbook definitions are short and almost useless. Here are the working definitions, with what they map to in a real mid-market business.

Structured data

Data with a defined schema — rows and columns, fields and types — that is queried with SQL or a typed API. Every field has a known meaning, and the meaning doesn't change when a new row is added.

  • Databases — Postgres, MySQL, SQL Server, Snowflake.
  • SaaS systems of record — Salesforce, HubSpot, NetSuite, Workday, Stripe.
  • Spreadsheets used as databases — a finance team's monthly revenue tracker is structured if it has a stable column layout.

Unstructured data

Content in files. The file has a name, a path, a size, a modified date, and some metadata, but the meaning is inside the file — in the prose, the slides, the image, or the audio.

  • Documents — Word, PDF, Google Docs.
  • Slides — PowerPoint, Keynote, Google Slides.
  • Spreadsheets used as documents — ad-hoc analyses, one-off models, screenshots from a system.
  • Images, audio, video — screenshots, marketing assets, meeting recordings, call transcripts.
  • Email — semi-structured headers, unstructured body and attachments. See the FAQ.

The dividing line in practice

If you can answer a question about the data by writing a query and getting a typed result, it's structured. If you have to open the thing and read it, it's unstructured. That heuristic is good enough for every decision in this guide.

Why the unstructured pile grew faster than anyone planned for

In 2010, a mid-sized business had maybe a hundred thousand files across a few file shares. In 2026, the same business — at the same headcount — has tens of millions of objects across Drive, SharePoint, OneDrive, Dropbox, Box, Slack, Teams, and however many SaaS products attach files to records.

Three forces compounded:

  • SaaS made file creation effectively free. Unlimited storage per seat removed the pressure that used to make people delete things.
  • Collaboration tools made file duplication frictionless. "Copy of...", "Sarah's copy", and shared drafts became normal artefacts of how work happens.
  • Nobody made governance cheaper. Records management, classification, and lifecycle work all remained labour-intensive even as the volume scaled. Most SMBs stopped doing it.

The result is that the structured side of the business looks roughly the way it did fifteen years ago, while the unstructured side has grown two orders of magnitude. Governance investment did not keep pace.

Four business consequences SMB leaders feel first

1. Storage cost creep

Per-seat unlimited storage doesn't cost zero. It costs roughly the premium plan tier you're paying for, and the cost compounds with headcount. Worse, the cost of moving the estate later — to a cheaper plan or a different vendor — scales with size. The estate is the switching cost.

2. eDiscovery and legal hold take weeks not days

When legal asks for "all documents related to the Acme contract negotiation between June and September 2025", the answer ought to take an hour. In an ungoverned estate, it takes a fortnight, because the keyword search returns thousands of files with names that don't indicate their contents and folders that don't indicate their topics.

3. Retention exists on paper but not in practice

Your records-retention policy says client files are kept for seven years and then disposed of. In practice, every file the company ever created is still there, because nobody can identify which files belong to which retention class. The policy is a document; the disposition is not happening.

4. A security surface nobody owns

The worst breaches happen at files people forgot existed — the board archive a former CFO shared via link in 2022, the customer-list spreadsheet attached to an old support ticket, the credentials someone pasted into a draft. The ungoverned unstructured estate is where these accumulate. When something leaks, nobody can answer the question "how did this end up there?".

Why AI raises the stakes on the unstructured side specifically

The structured side already has guardrails. Tables have schemas. CRMs have role-based access. Query logs are audited. The blast radius of an AI agent connected to the CRM is bounded by what queries it can run.

The unstructured side has none of that. Files don't have schemas; they have names, and the names are often meaningless. Access is governed by sharing links that accumulate over years. There is no query log; there is just a search bar.

When you connect an AI assistant to that estate, three things happen at once:

  • The agent surfaces files faster than anyone would manually. The convenience is real.
  • The agent also surfaces files nobody intended to share, because the historical access controls weren't set up to defend against a search-everything tool.
  • The agent gives confidently wrong answers, because the retrieval layer has no way to distinguish the canonical file from four stale duplicates. (For the mechanics of this, see the companion guide on why your AI assistant can't search your drive properly.)

AI doesn't create these problems. It exposes them at a scale where they stop being tolerable.

Six signs your unstructured estate is out of control

You can run these checks in an afternoon. If three or more are true, the estate has crossed the line where ad-hoc governance stops working.

  • Duplicate ratio above 15%. One in seven files is a copy of another file.
  • "Untitled / Copy of / Screenshot" rate above 10%. A meaningful fraction of files carry no signal in their name.
  • Orphaned ownership after offboarding. Files owned by departed employees still exist and still have active read-shares. Nobody has reassigned them.
  • Folder depth above seven. Trees deeper than seven levels are usually personal-archive patterns that have leaked into shared space.
  • Top-level folders named after people. Shared drives organised by "Sarah's stuff", "John's docs" — a sure sign nobody owns the taxonomy.
  • No measurable retention disposition in the past 12 months. The policy exists; nothing has been disposed of under it.

What "bringing it under one standard" actually means

It does not mean a migration. It does not mean a deletion project. It means three things, applied in place on the platform where your team already works:

  1. One naming convention that captures topic, document type, and date in a consistent format.
  2. A folder taxonomy shallow enough that a new joiner can place a new file correctly without asking. Three to five levels, organised by function and year.
  3. A retention class on every file, or at minimum on every folder, that maps to the organisation's records schedule. This is what makes disposition happen.

The detail of how to design the naming convention itself is in the AI data readiness checklist. For a side-by-side of what the result looks like in practice, see before / after — what a clean drive looks like for AI integration.

90-day starting plan

If you have one quarter and one engineer, this is the plan that produces a defensible result without trying to solve the whole estate at once.

PhaseWeeksOutcome
Pick one team, one drive1The team whose retrieval pain is loudest. Their drive is the pilot.
Measure1File count, duplicate ratio, weak-name rate, orphaned-owner count, folder depth.
Design the convention2One page. Covers date format, separator, capitalisation, field order, required fields per document type.
Pilot on a 500-file folder1Edge cases produce rule additions, not exceptions.
Roll out across the drive, with previews3Every file follows the standard. Audit log of every rename intact.
Deduplicate2"Final v2 FINAL" families resolved to canonical files. Archive sibling folder for the rest.
Re-test against the AI assistant1Same ten queries as week one. Retrieval accuracy delta is the headline.
Decide on expansion1Apply the now-tested convention to the next team. Repeat.

The honest version of this plan is that the design choices get made in the pilot, not before. Teams that try to specify the convention exhaustively in advance spend a quarter on a document and zero quarters on cleanup. Build the convention by applying it.

Frequently asked questions

What is the difference between structured and unstructured data?

Structured data lives in rows and columns with a schema — Postgres tables, Salesforce records, NetSuite ledgers. Unstructured data lives in files with content — Word documents, PDFs, slide decks, images, audio, video. Structured data is queried with SQL or an API; unstructured data is searched, opened, and read. The distinction matters because every governance, security, and AI capability your business buys treats the two categories completely differently.

Is email structured or unstructured?

Email is semi-structured. The headers (from, to, date, subject) are structured fields you can filter on. The body and the attachments are unstructured content. For data-management purposes, most teams treat email as unstructured because the value and the risk live in the body and the attachments, and those are what AI agents read.

Isn't this just records management with a new name?

Records management is a subset of the unstructured-data problem, and a well-run records-management programme is excellent preparation for AI integration. The difference is urgency and scope: records management has historically focused on documents the organisation is legally obligated to keep, whereas AI integration affects every file the agent can read, including the working drafts and the screenshots and the meeting recordings. The disciplines overlap; the surface area is wider.

Should we move all our unstructured data to a data lake?

Usually no, and certainly not as the first move. A data lake is a destination optimised for analytics and machine learning over already-curated data. Most SMB unstructured estates are not curated — moving the raw estate to a lake produces an expensive, queryable mess. Curate in place first (naming, deduplication, classification), then move only the subset that has an analytics use case. The remainder stays on the collaboration platform where your team actually works.

Where do Microsoft Purview and Google Vault fit?

They are governance and eDiscovery layers that sit on top of your unstructured estate — they let you classify, retain, and search content under policy. They are not a substitute for clean source data. A Purview retention label applied to a folder called "Untitled documents (1)" controls how long the file is kept, but does not tell you what the file is. The governance layer is more effective on top of a curated estate than on top of a chaotic one.

How much of our spend is on dead files?

In SMB engagements we typically see 20% to 40% of unstructured-storage spend going to files no one has opened in over two years, and another 10% to 20% on duplicate copies of files that are still live. The exact ratio depends on how long the estate has accumulated and how often the team has rotated. A duplicate-and-staleness scan returns the number in hours, and the financial case for cleanup writes itself.

Start with your own Drive

Free to scan, free to preview, private by design. The fastest way to see how much of your unstructured estate is dead weight.

Begin a free scan →