Data

AI in Permitting: Where Is All This Data Coming From?

Everyone in permitting is talking about the value of data. Open datasets, AI-ready records, the dashboard that finally shows the whole pipeline. Almost nobody asks the first question: where the data is coming from?

The permits themselves are public records, so that part is free. Turning a county's worth of scattered, half-digitized filings into something you can sort, query, and sell is the expensive part, and that's where the money and the moats are. The records are public. The structure is the product. This week's roundup has the live version of that: an $800 million acquisition of a development-data platform, and a public-records boom that turns out to be the same industry feeding itself. This piece is the map underneath those headlines.

To make it make sense, I organize it into nine methods. Most tools mix two or three, and the marketing rarely tells you which. The nine aren't a fact about the industry; they're how I keep the mess straight, and the thing that keeps them honest is a single question I ask of any product: before someone built it, what state was the data in? The answer is always one of four things. It didn't exist yet. It existed, but you couldn't use it. It existed, and you could use it, but you didn't control it. Or someone had already structured it. Those four states are the spine I hang everything on, and the nine methods are how I split the work inside each one. Your tenth method, if you've got one, probably lands in one of these buckets, or it turns out not to be acquisition at all.

Here's the whole set, grouped:

The data didn't exist yet

Synthesize: model data nobody ever recorded. Predict where the wetlands probably are.
Capture: observe the physical world for the first time. Sensors, field survey, computer vision over aerial imagery.

It existed, but not as usable data

Recover: digitize analog records. The paper and microfiche in agency basements.
Solicit: get people to hand it to you, one contributor at a time.

It exists, but you don't control it

Scrape: harvest public records scattered across thousands of jurisdictions.
Extract: file FOIA and records requests to pry loose what's held but unposted.
Mandate: run the intake everyone is required to use, and the records come to you.

Someone already structured it

Buy: pay another holder for their structured set, then clean it and resell it.
Aggregate: combine several of these and standardize to one form.

That's the four states and the nine methods. One more thing belongs on the list only so I can take it back off:

Platform: move other people's data layers instead of fetching anything. Ask the question of it, and it fails: the data was already structured by someone else, and the platform never acquires it; it just distributes it. That's why it's not a fifth state. It's the exception that tells you the four are about acquisition in the first place.

Now the walk-through. The methods are simple to name; the stakes show up at the end, in who keeps the data.

Start with the data that never existed as a record. When a model predicts where ephemeral streams are likely to run, filling the gaps the federal maps leave blank, that prediction travels downstream as though someone had walked the site. A predicted wetland is a guess treated as a finding, and the gap matters most when the same model can steer a developer either toward avoiding a parcel or toward clearing it. Capture sits right next to it, and the only difference is that it measures the world instead of inferring it. Sensors and surveys are known methods; what's new is using aerial imagery with computer vision, which can flag the deck you built without a permit in an overhead photo across a whole jurisdiction at once.

Then there's the data that existed but not in a form you could use. Recovery is the method the market mostly skips because it rarely pays off. One state is spending tens of millions of dollars to digitize records it already owns, and no vendor downstream will do that work for free. Solicit is the gentler cousin: some of the largest open geographic datasets in the world were built contributor by contributor, by people who simply handed over their data.

The largest group is data that's already usable but sits outside your control. Scraping is the visible one. The records are public, so the product isn't the data; it's about putting a thousand-plus separate portals into a single queryable schema. Extract is the loud one, and the one in this week's news: practitioners describe an arms race of automated FOIA requests, with some firing every 24 hours and scripts racing to reach the data first. Mandate is the quietest and the deepest. The company running the intake portal already holds every application by virtue of its position, no scraping required, and people forget this one precisely because the system of record already holds the thing everyone downstream is scrambling to reconstruct.

A fourth group is data someone already structured. Buying it again carries a different provenance story from collecting it raw, because you're re-enclosing something that was enclosed once already: an aggregator pays several vendors for permit data, standardizes it, and sells jurisdiction-level timelines on top. Aggregate is really a meta-method that rides on the others, and it's where most things calling themselves "data platforms" actually sit.

The methods themselves are old. Counties have been scraped and FOIA'd for years. What changed is that AI dropped the cost of nearly every one of them at the same time. Language models structure scraped text that once required a data-entry team. Computer vision turns raw imagery into capture. OCR makes recovery cheaper. Agents file records requests on a timer. The race didn't start because someone invented a new way to get data. It started because the old ways got cheap enough to run at scale, all inside the same year or two.

Naming the method is the easy part. The harder question is what happens to the structured data afterward, and that gets settled in the contract, not the technology. One common posture grants you use-only rights over data built from public records, forbids resale, and requires you to delete it when you walk away. Another posture states that the agency retains the structured data and the vendor only operates it. That second version exists. It's just rarer, because almost nobody writes it unless the buyer insists.

So when someone offers to solve your data problem, the method tells you how they got it. The contract tells you who ends up owning your own records. Ask the second question.

AI in Permitting is a recurring column on Permitting Tech covering how artificial intelligence is entering permitting workflows. Written by Boon Sheridan.

AI in Permitting: Where Is All This Data Coming From?

Read more

This Week in Permitting Tech July 20, 2026: The Senate Runs Out of Calendar

This Week in Permitting Tech July 6, 2026: Seattle Stands Up an AI Permitting Team

This Week in Permitting Tech June 29, 2026: EPA Pulls Three NEPA Levers in One Week

This Week in Permitting Tech June 16, 2026: When the Government Can Pull Your AI Model