AI Pilots: The First 90 Days | Operating Model Guide

Overview

One map for the whole program

Everything in this guide hangs off a single picture: three 30-day phases, a layer of work that never stops, and an outcome that extends past day 90.

The framework is built so that by day 90 there are deployed pilots, real measurements against documented starting points, and a written playbook. Ninety days is the window in which a new program either earns organizational trust or loses it, and that combination of evidence is what converts a mandate into momentum.

It deliberately starts with two pilot functions rather than a company-wide rollout. Two is enough to prove the approach works in more than one context, because a method that works in only one function is a one-off, not a framework. Two is also few enough that each pilot gets real attention, real measurement, and real course correction. Expansion to the rest of the company comes later, and only after the pilots have proven themselves. That sequencing is the difference between building institutional capability and running an experiment.

Figure 1. The master map. Numbers 1 through 12 correspond to the twelve deliverable sections in this guide. Click any item to jump straight to its full explanation.

How to read this diagram

The three white columns are the three 30-day phases. They run in sequence from left to right: the program listens and measures first, deploys second, and proves and packages third. Each column contains four boxes, which are that month's concrete deliverables. The red numbered circles are reference markers, and each number matches a fully explained section below, so nothing in this picture is left to interpretation.

The dark band beneath the columns is work with no end date. Executive communication, compliance gates, and enablement run continuously underneath all three phases, which is why they are drawn as a foundation rather than as steps. The red box at the bottom is the program's true finish line: a pilot only counts as a success after six months of sustained, documented results, and only then does its approach expand to other functions.

Days 1-30

Listen and baseline

The first month produces no AI deployments, and that is intentional. Everything deployed in month two depends on the quality of what is learned in month one. Deploying before understanding the work is how AI ends up layered on top of broken processes, which simply automates the brokenness.

1Pick the two pilots

What it is

Working sessions with functional leaders across the company to select the two functions where the AI pilots will run, producing a short selection memo that names the choices and the reasoning behind each.

Why it matters

Pilot selection is the single highest-leverage decision in the entire 90 days, because the wrong pilots can sink the program regardless of execution quality. A good pilot has four properties. The functional leader actively wants it, because a pilot imposed on a reluctant leader will quietly starve. The process has meaningful volume, because improvement claims need enough repetitions to be statistically credible. The baseline can actually be measured, because a process nobody can quantify today cannot prove improvement tomorrow. And the data sensitivity is manageable, because a first pilot that needs six months of security review before touching production data produces nothing in 90 days.

Choosing pilots together with functional leaders, rather than for them, also sets the operating posture of the whole program: this is a partnership, not an inspection.

2Baseline the workflows

What it is

For each pilot, a detailed map of how the work actually flows today, step by step, person by person, system by system, paired with quantified measurements of current performance.

What gets produced

A baseline card for each pilot process: a one-page record of the process as it stands, covering what it costs per unit of work, how long a unit takes from start to finish (cycle time), how often errors occur, and how much volume flows through. The card also names where each measurement came from, so it can be reproduced.

Why it matters

Every improvement claim this program will ever make is a comparison against these numbers. Without a documented starting point, "the pilot improved quality" is an opinion; with one, it is arithmetic. Baseline cards also protect the program's credibility with executives: when results are reported in month three, no one can argue about whether the starting point was real, because it was written down before any AI touched the process. The mapping exercise has a second payoff as well. The process as documented and the process as practiced are almost always different things, and AI must be designed against the real one.

3Open the intake channel

What it is

A simple, visible channel through which anyone in the company can submit an idea for where AI could help: a submission form, a published description of how ideas are evaluated, and a commitment that every submitter hears back with a decision and a reason.

Why it matters

Two reasons, one offensive and one defensive. Offensively, the people closest to the work see opportunities no central team ever will, and the intake channel turns those observations into the pipeline for the next wave of pilots, so the expansion after day 90 never starts from a blank page. Defensively, when a company gets serious about AI, employees begin using AI tools on their own, with or without permission. An official front door, opened early, channels that energy into governed paths instead of ungoverned ones. The intake process does not need to be sophisticated in month one. It needs to exist, be known, and keep its promises.

4Draft the governance rules

What it is

The first written version of the rules under which AI may be used at the company, covering four things: data tiers (a classification of company data by sensitivity, and which AI tools may touch which tier), acceptable use (what employees may and may not do with AI tools), human-in-the-loop requirements (which decisions must always have a person accountable for the final call), and a model risk review step (who signs off before any AI system touches a production process).

Why it matters

Sequencing is the entire point. Governance is drafted in month one, before a single deployment, because governance written after deployment is governance written around whatever already shipped. For a company whose business runs on sensitive identity and fraud data, regulators, bank partners, and customers will all eventually ask the same question: how do you control this? The answer must be that the controls came first.

There is also a practical payoff. Clear guardrails agreed early make every later approval faster, because the security and compliance conversation happens once at the framework level instead of repeatedly at every deployment. Done this way, governance is an accelerant, not a brake.

Days 31-60

Prioritize and deploy

Month two converts month one's understanding into committed decisions and live deployments. The order inside this phase matters: score first, commit second, deploy third, and define measurement before any results arrive.

5Score ROI and feasibility

What it is

Every candidate opportunity, drawn from the pilot workflow maps and the intake channel, scored on two axes and ranked into a portfolio. The first axis is return on investment: the projected improvement against the baseline card, in cost reduced, cycle time shortened, and errors avoided. The second axis is feasibility, which combines three questions: is the data this process needs actually accessible and clean, how sensitive is that data under the governance tiers, and how much human behavior change does the improvement require.

Why it matters

Without a scoring framework, prioritization defaults to whoever argues loudest or holds the most senior title. A two-axis score replaces advocacy with arithmetic and gives executives a defensible answer to "why this process and not that one." The feasibility axis exists to prevent a predictable failure: the opportunities with the largest theoretical returns often sit on the most sensitive data and the most entrenched habits, and a program that chases only the biggest number ships nothing in its first year. The right early portfolio balances return against the realistic odds of landing it.

6Write the business cases

What it is

A short formal document for each opportunity that will move forward, stating the expected impact in numbers tied to the baseline card, the cost to implement, the risks and how each will be mitigated, and, critically, the exact metrics by which success or failure will be judged.

Why it matters

The business case is a pre-commitment device. By writing down the success metrics before deployment, the program removes its own ability to retrofit a success story afterward. If the pilot works, the proof was specified in advance. If it falls short, that is visible too, and a program that can show its failures honestly is a program executives learn to trust. The business case is also where risk gets confronted on paper, in front of leadership, rather than discovered in production.

7Deploy through gates

What it is

The actual deployment of AI into the two pilot processes, with two mandatory checkpoints, called gates, that every deployment must pass before going live. A gate is a checkpoint that cannot be skipped or deferred.

Why it matters

Making compliance and adoption into gates, rather than recommendations, changes their nature: they stop being things a hurried team can postpone and become structural requirements of shipping at all. One more principle governs this step. The process is redesigned around the AI capability rather than bolting a tool onto the existing process. If a workflow has nine steps and AI removes the need for three of them, the redesigned workflow has six steps. Layering a tool onto all nine just automates waste.

Figure 2. The gated deployment pipeline every opportunity travels, left to right.

How to read this diagram

Rectangles are work stages; the pipeline reads left to right. An opportunity enters only after it has been scored (deliverable 5) and its business case, with success metrics committed in writing, is approved (deliverable 6). The two red diamonds are the gates. Gate 1 is the governance checkpoint: the security, privacy, and risk owners apply the rules drafted in deliverable 4 and must sign off before the AI system touches any production data. Gate 2 is the adoption checkpoint: the people whose work changes have been trained, and a continuity plan exists so the business keeps running if the AI component fails. A deployment that has not passed both gates does not go live, full stop. The dark box at the end is a live pilot, which immediately enters the weekly measurement rhythm described in deliverable 9.

8Define the scorecards

What it is

Standardized definitions for every metric the program will report, captured in a metrics dictionary, plus a one-page scorecard template applied identically to every pilot. The dictionary defines, precisely, terms like cost per unit, throughput, error rate, and headcount leverage, which measures how much more output the same team produces, reframing AI as a capacity multiplier rather than a replacement program. Each definition includes its formula and data source.

Why it matters

Two pilots measured two different ways cannot be compared, and a program that cannot compare its own pilots cannot honestly say what worked. Standard definitions set now also mean that when the program expands beyond the pilots, every new deployment plugs into the same measurement system instead of inventing its own. And as with the business cases, defining metrics before results arrive prevents cherry-picking: the scorecard is locked before anyone knows what it will show.

Days 61-90

Prove and package

Month three is where the program demonstrates results and, just as importantly, packages itself so that it no longer depends on any one person.

9Dashboards live, cadence running

What it is

The measurement system made visible and put on a clock: live dashboards showing each pilot's scorecard metrics against its baseline, plus a standing reporting rhythm at three altitudes.

Why it matters

A dashboard nobody is scheduled to look at is decoration. The cadence is what turns measurement into management: drift gets caught in days instead of quarters, and pilot teams operate knowing their numbers will be discussed on a known schedule, which is a quiet but powerful driver of follow-through. The layered rhythm also matches information to audience. Working teams need weekly operational detail; executives need the quarterly pattern. One report for everyone serves no one.

Figure 3. The reporting rhythm. The same dashboard feeds all three reviews, at three altitudes.

How to read this diagram

The dark box on the left is the single source of truth: one dashboard per pilot, always showing current scorecard metrics next to the baseline card from deliverable 2, so improvement is read as a direct comparison rather than a claim. The three rows on the right are the standing meetings that consume it, and the arrows mean exactly that: the same numbers feed all three, with no separate version of the truth prepared for any audience. The frequency drops and the altitude rises as you read down: weekly for the teams doing the work, monthly for the leaders who own the function, quarterly for the executives who decide whether the program expands. The red border on the weekly row marks where problems are meant to be caught: at the working level, within days, long before they reach an executive readout.

10AI agent layer, version one

What it is

AI agents deployed to run the program's own administrative machinery: a first set that triages incoming intake submissions, checks whether deployments have satisfied their gates, validates that weekly status updates are complete and specific rather than vague, and assembles the briefing materials for executive readouts.

Why it matters

Three reasons. First, program overhead is real: intake triage, status chasing, and readout preparation can consume a coordinator's entire week, and agents collapse that work from days to hours. Second, the placement in month three is a deliberate statement of method. Agents automate the intake process, the gates, and the reporting cadence only after those things exist and demonstrably work, because automating a process before it works just produces failure at higher speed, which is exactly the mistake this program exists to prevent in every other function. Third, this deliverable is the program's proof of its own philosophy: the system that governs AI adoption is itself run by AI, built the right way. Process first, baseline first, automation second.

11Playbook and training, version one

What it is

The repeatable method, written down: a playbook documenting every step of the framework so far (how to select a process, build a baseline card, score ROI and feasibility, write the business case, pass the gates, and stand up the scorecard), plus training materials that teach functional teams to run the early steps themselves.

Why it matters

This is the deliverable that distinguishes institutional capability from individual heroics. If the method lives only in the program lead's head, the company has rented an outcome; if it lives in a playbook that others can execute, the company owns a capability. There is a scaling argument too: one person cannot personally run AI adoption across an entire company and was never supposed to. The endgame is functional teams identifying and baselining their own opportunities using the playbook, with the central program providing scoring, governance, and measurement. Version one will be imperfect, and that is fine. A real playbook shipped at day 90 and revised quarterly beats a perfect one that never ships.

12Executive readout and expansion roadmap

What it is

The day-90 presentation to executive leadership, in three sections. Results: each pilot's performance against its baseline, on the metrics committed in writing back in month two, including anything that fell short. Risks: what was encountered, how it was handled, what remains open. Roadmap: which opportunities from the scored portfolio and intake pipeline the program recommends for the next wave, and what resources that wave requires.

Why it matters

This is the moment the program converts 90 days of work into a mandate for the next 90. The structure is designed for credibility: results are reported against pre-committed numbers, shortfalls are presented alongside wins, and the expansion request is grounded in a scored pipeline rather than enthusiasm. Executives fund programs that demonstrate control, and control is exactly what the baseline cards, gates, and scorecards were built to demonstrate.

No end date

The always-on layer

Beneath the three phases sits a band of work that never finishes. It is drawn as a separate layer in the master map specifically so no reader mistakes these for phase tasks that complete.

Executive communication

Regular, unprompted updates to leadership begin in week one, long before there are results to show. Executive sponsorship is the program's oxygen, and sponsorship survives bad news but not surprises. A leadership team that has heard a steady, honest signal for three months receives the day-90 readout as a continuation, not a reveal.

Compliance gates

The governance rules drafted in month one are not a document that gets filed. They are a living checkpoint applied to every deployment, every expansion, and every new tool, forever. In a regulated, data-sensitive business the obligation never closes, and a program that treats governance as a milestone it already passed is one incident away from being shut down.

Enablement

Every deployment changes someone's daily work, and people do not adopt new ways of working because a tool appeared. Training, support, and visible responsiveness run continuously, because adoption is not an event at launch. It is a curve that has to be actively maintained, and the moment enablement stops, old habits return and the measured gains decay.

The finish line that matters

Day 90 and beyond: validated, then expanded

The framework defines a status that no pilot can claim at day 90: Validated. A pilot earns it only after sustaining its documented improvement for six full months, on the same scorecard, against the same baseline, with no decay.

The six-month bar exists because of a pattern anyone who has run transformation work will recognize. New deployments enjoy a burst of attention, novelty, and management focus, and almost anything improves for six weeks under those conditions. The honest question is what the numbers look like after the attention moves on. Six months of sustained performance is the difference between an improvement and an anecdote.

Validated status is also the gate for expansion: only a Validated pilot's playbook gets promoted for rollout to other functions, which means the company never scales an unproven pattern.

Figure 4. The status lifecycle every opportunity moves through, left to right.

How to read this diagram

Each pill is a status, and every opportunity in the program holds exactly one at any time, moving left to right and never skipping a step. The definitions are in the table below. The red bracket and arrow between Piloting and Validated mark the program's hardest requirement: six months of sustained, documented performance on the same scorecard against the same baseline. The red pill highlights Validated because it is the status the whole framework is built to reach, and the dark Scaled pill is what happens afterward: the pilot's playbook is rolled out to other functions through the same pipeline shown in Figure 2.

Status	What it means
Candidate	Identified and sitting in the intake pipeline. Anyone in the company can create one through the intake channel (deliverable 3).
Scored	Evaluated on the ROI and feasibility axes (deliverable 5) and ranked in the portfolio.
Piloting	Approved business case, passed both gates, deployed, and being measured weekly against its baseline card.
Validated	Six full months of sustained, documented improvement with no decay. The only status that authorizes expansion.
Scaled	The pilot's playbook deployed beyond the original function, with each new deployment entering the same pipeline at Piloting.

This final element is what makes the whole framework cohere. The purpose was never to run two interesting AI experiments. It was to build a system that reliably turns AI opportunities into durable productivity gains, proves each one with arithmetic rather than enthusiasm, and grows only as fast as its evidence allows.

AI pilots: the first 90 days.

One map for the whole program

How to read this diagram

Listen and baseline

1Pick the two pilots

2Baseline the workflows

3Open the intake channel

4Draft the governance rules

Prioritize and deploy

5Score ROI and feasibility

6Write the business cases

7Deploy through gates

How to read this diagram

8Define the scorecards

Prove and package

9Dashboards live, cadence running

How to read this diagram

10AI agent layer, version one

11Playbook and training, version one

12Executive readout and expansion roadmap

The always-on layer

Executive communication

Compliance gates

Enablement

Day 90 and beyond: validated, then expanded

How to read this diagram