A repeatable framework for identifying, deploying, governing, and measuring AI across core business processes. Built to deliver durable productivity gains in two pilot functions, then scale as the results earn it.
Everything in this guide hangs off a single picture: three 30-day phases, a layer of work that never stops, and an outcome that extends past day 90.
The framework is built so that by day 90 there are deployed pilots, real measurements against documented starting points, and a written playbook. Ninety days is the window in which a new program either earns organizational trust or loses it, and that combination of evidence is what converts a mandate into momentum.
It deliberately starts with two pilot functions rather than a company-wide rollout. Two is enough to prove the approach works in more than one context, because a method that works in only one function is a one-off, not a framework. Two is also few enough that each pilot gets real attention, real measurement, and real course correction. Expansion to the rest of the company comes later, and only after the pilots have proven themselves. That sequencing is the difference between building institutional capability and running an experiment.
The three white columns are the three 30-day phases. They run in sequence from left to right: the program listens and measures first, deploys second, and proves and packages third. Each column contains four boxes, which are that month's concrete deliverables. The red numbered circles are reference markers, and each number matches a fully explained section below, so nothing in this picture is left to interpretation.
The dark band beneath the columns is work with no end date. Executive communication, compliance gates, and enablement run continuously underneath all three phases, which is why they are drawn as a foundation rather than as steps. The red box at the bottom is the program's true finish line: a pilot only counts as a success after six months of sustained, documented results, and only then does its approach expand to other functions.
The first month produces no AI deployments, and that is intentional. Everything deployed in month two depends on the quality of what is learned in month one. Deploying before understanding the work is how AI ends up layered on top of broken processes, which simply automates the brokenness.
Working sessions with functional leaders across the company to select the two functions where the AI pilots will run, producing a short selection memo that names the choices and the reasoning behind each.
Pilot selection is the single highest-leverage decision in the entire 90 days, because the wrong pilots can sink the program regardless of execution quality. A good pilot has four properties. The functional leader actively wants it, because a pilot imposed on a reluctant leader will quietly starve. The process has meaningful volume, because improvement claims need enough repetitions to be statistically credible. The baseline can actually be measured, because a process nobody can quantify today cannot prove improvement tomorrow. And the data sensitivity is manageable, because a first pilot that needs six months of security review before touching production data produces nothing in 90 days.
Choosing pilots together with functional leaders, rather than for them, also sets the operating posture of the whole program: this is a partnership, not an inspection.
For each pilot, a detailed map of how the work actually flows today, step by step, person by person, system by system, paired with quantified measurements of current performance.
A baseline card for each pilot process: a one-page record of the process as it stands, covering what it costs per unit of work, how long a unit takes from start to finish (cycle time), how often errors occur, and how much volume flows through. The card also names where each measurement came from, so it can be reproduced.
Every improvement claim this program will ever make is a comparison against these numbers. Without a documented starting point, "the pilot improved quality" is an opinion; with one, it is arithmetic. Baseline cards also protect the program's credibility with executives: when results are reported in month three, no one can argue about whether the starting point was real, because it was written down before any AI touched the process. The mapping exercise has a second payoff as well. The process as documented and the process as practiced are almost always different things, and AI must be designed against the real one.
A simple, visible channel through which anyone in the company can submit an idea for where AI could help: a submission form, a published description of how ideas are evaluated, and a commitment that every submitter hears back with a decision and a reason.
Two reasons, one offensive and one defensive. Offensively, the people closest to the work see opportunities no central team ever will, and the intake channel turns those observations into the pipeline for the next wave of pilots, so the expansion after day 90 never starts from a blank page. Defensively, when a company gets serious about AI, employees begin using AI tools on their own, with or without permission. An official front door, opened early, channels that energy into governed paths instead of ungoverned ones. The intake process does not need to be sophisticated in month one. It needs to exist, be known, and keep its promises.
The first written version of the rules under which AI may be used at the company, covering four things: data tiers (a classification of company data by sensitivity, and which AI tools may touch which tier), acceptable use (what employees may and may not do with AI tools), human-in-the-loop requirements (which decisions must always have a person accountable for the final call), and a model risk review step (who signs off before any AI system touches a production process).
Sequencing is the entire point. Governance is drafted in month one, before a single deployment, because governance written after deployment is governance written around whatever already shipped. For a company whose business runs on sensitive identity and fraud data, regulators, bank partners, and customers will all eventually ask the same question: how do you control this? The answer must be that the controls came first.
There is also a practical payoff. Clear guardrails agreed early make every later approval faster, because the security and compliance conversation happens once at the framework level instead of repeatedly at every deployment. Done this way, governance is an accelerant, not a brake.
Month two converts month one's understanding into committed decisions and live deployments. The order inside this phase matters: score first, commit second, deploy third, and define measurement before any results arrive.
Every candidate opportunity, drawn from the pilot workflow maps and the intake channel, scored on two axes and ranked into a portfolio. The first axis is return on investment: the projected improvement against the baseline card, in cost reduced, cycle time shortened, and errors avoided. The second axis is feasibility, which combines three questions: is the data this process needs actually accessible and clean, how sensitive is that data under the governance tiers, and how much human behavior change does the improvement require.
Without a scoring framework, prioritization defaults to whoever argues loudest or holds the most senior title. A two-axis score replaces advocacy with arithmetic and gives executives a defensible answer to "why this process and not that one." The feasibility axis exists to prevent a predictable failure: the opportunities with the largest theoretical returns often sit on the most sensitive data and the most entrenched habits, and a program that chases only the biggest number ships nothing in its first year. The right early portfolio balances return against the realistic odds of landing it.
A short formal document for each opportunity that will move forward, stating the expected impact in numbers tied to the baseline card, the cost to implement, the risks and how each will be mitigated, and, critically, the exact metrics by which success or failure will be judged.
The business case is a pre-commitment device. By writing down the success metrics before deployment, the program removes its own ability to retrofit a success story afterward. If the pilot works, the proof was specified in advance. If it falls short, that is visible too, and a program that can show its failures honestly is a program executives learn to trust. The business case is also where risk gets confronted on paper, in front of leadership, rather than discovered in production.
The actual deployment of AI into the two pilot processes, with two mandatory checkpoints, called gates, that every deployment must pass before going live. A gate is a checkpoint that cannot be skipped or deferred.
Making compliance and adoption into gates, rather than recommendations, changes their nature: they stop being things a hurried team can postpone and become structural requirements of shipping at all. One more principle governs this step. The process is redesigned around the AI capability rather than bolting a tool onto the existing process. If a workflow has nine steps and AI removes the need for three of them, the redesigned workflow has six steps. Layering a tool onto all nine just automates waste.
Rectangles are work stages; the pipeline reads left to right. An opportunity enters only after it has been scored (deliverable 5) and its business case, with success metrics committed in writing, is approved (deliverable 6). The two red diamonds are the gates. Gate 1 is the governance checkpoint: the security, privacy, and risk owners apply the rules drafted in deliverable 4 and must sign off before the AI system touches any production data. Gate 2 is the adoption checkpoint: the people whose work changes have been trained, and a continuity plan exists so the business keeps running if the AI component fails. A deployment that has not passed both gates does not go live, full stop. The dark box at the end is a live pilot, which immediately enters the weekly measurement rhythm described in deliverable 9.
Standardized definitions for every metric the program will report, captured in a metrics dictionary, plus a one-page scorecard template applied identically to every pilot. The dictionary defines, precisely, terms like cost per unit, throughput, error rate, and headcount leverage, which measures how much more output the same team produces, reframing AI as a capacity multiplier rather than a replacement program. Each definition includes its formula and data source.
Two pilots measured two different ways cannot be compared, and a program that cannot compare its own pilots cannot honestly say what worked. Standard definitions set now also mean that when the program expands beyond the pilots, every new deployment plugs into the same measurement system instead of inventing its own. And as with the business cases, defining metrics before results arrive prevents cherry-picking: the scorecard is locked before anyone knows what it will show.
Month three is where the program demonstrates results and, just as importantly, packages itself so that it no longer depends on any one person.
The measurement system made visible and put on a clock: live dashboards showing each pilot's scorecard metrics against its baseline, plus a standing reporting rhythm at three altitudes.
A dashboard nobody is scheduled to look at is decoration. The cadence is what turns measurement into management: drift gets caught in days instead of quarters, and pilot teams operate knowing their numbers will be discussed on a known schedule, which is a quiet but powerful driver of follow-through. The layered rhythm also matches information to audience. Working teams need weekly operational detail; executives need the quarterly pattern. One report for everyone serves no one.
The dark box on the left is the single source of truth: one dashboard per pilot, always showing current scorecard metrics next to the baseline card from deliverable 2, so improvement is read as a direct comparison rather than a claim. The three rows on the right are the standing meetings that consume it, and the arrows mean exactly that: the same numbers feed all three, with no separate version of the truth prepared for any audience. The frequency drops and the altitude rises as you read down: weekly for the teams doing the work, monthly for the leaders who own the function, quarterly for the executives who decide whether the program expands. The red border on the weekly row marks where problems are meant to be caught: at the working level, within days, long before they reach an executive readout.
AI agents deployed to run the program's own administrative machinery: a first set that triages incoming intake submissions, checks whether deployments have satisfied their gates, validates that weekly status updates are complete and specific rather than vague, and assembles the briefing materials for executive readouts.
Three reasons. First, program overhead is real: intake triage, status chasing, and readout preparation can consume a coordinator's entire week, and agents collapse that work from days to hours. Second, the placement in month three is a deliberate statement of method. Agents automate the intake process, the gates, and the reporting cadence only after those things exist and demonstrably work, because automating a process before it works just produces failure at higher speed, which is exactly the mistake this program exists to prevent in every other function. Third, this deliverable is the program's proof of its own philosophy: the system that governs AI adoption is itself run by AI, built the right way. Process first, baseline first, automation second.
The repeatable method, written down: a playbook documenting every step of the framework so far (how to select a process, build a baseline card, score ROI and feasibility, write the business case, pass the gates, and stand up the scorecard), plus training materials that teach functional teams to run the early steps themselves.
This is the deliverable that distinguishes institutional capability from individual heroics. If the method lives only in the program lead's head, the company has rented an outcome; if it lives in a playbook that others can execute, the company owns a capability. There is a scaling argument too: one person cannot personally run AI adoption across an entire company and was never supposed to. The endgame is functional teams identifying and baselining their own opportunities using the playbook, with the central program providing scoring, governance, and measurement. Version one will be imperfect, and that is fine. A real playbook shipped at day 90 and revised quarterly beats a perfect one that never ships.
The day-90 presentation to executive leadership, in three sections. Results: each pilot's performance against its baseline, on the metrics committed in writing back in month two, including anything that fell short. Risks: what was encountered, how it was handled, what remains open. Roadmap: which opportunities from the scored portfolio and intake pipeline the program recommends for the next wave, and what resources that wave requires.
This is the moment the program converts 90 days of work into a mandate for the next 90. The structure is designed for credibility: results are reported against pre-committed numbers, shortfalls are presented alongside wins, and the expansion request is grounded in a scored pipeline rather than enthusiasm. Executives fund programs that demonstrate control, and control is exactly what the baseline cards, gates, and scorecards were built to demonstrate.
Beneath the three phases sits a band of work that never finishes. It is drawn as a separate layer in the master map specifically so no reader mistakes these for phase tasks that complete.
Regular, unprompted updates to leadership begin in week one, long before there are results to show. Executive sponsorship is the program's oxygen, and sponsorship survives bad news but not surprises. A leadership team that has heard a steady, honest signal for three months receives the day-90 readout as a continuation, not a reveal.
The governance rules drafted in month one are not a document that gets filed. They are a living checkpoint applied to every deployment, every expansion, and every new tool, forever. In a regulated, data-sensitive business the obligation never closes, and a program that treats governance as a milestone it already passed is one incident away from being shut down.
Every deployment changes someone's daily work, and people do not adopt new ways of working because a tool appeared. Training, support, and visible responsiveness run continuously, because adoption is not an event at launch. It is a curve that has to be actively maintained, and the moment enablement stops, old habits return and the measured gains decay.
The framework defines a status that no pilot can claim at day 90: Validated. A pilot earns it only after sustaining its documented improvement for six full months, on the same scorecard, against the same baseline, with no decay.
The six-month bar exists because of a pattern anyone who has run transformation work will recognize. New deployments enjoy a burst of attention, novelty, and management focus, and almost anything improves for six weeks under those conditions. The honest question is what the numbers look like after the attention moves on. Six months of sustained performance is the difference between an improvement and an anecdote.
Validated status is also the gate for expansion: only a Validated pilot's playbook gets promoted for rollout to other functions, which means the company never scales an unproven pattern.
Each pill is a status, and every opportunity in the program holds exactly one at any time, moving left to right and never skipping a step. The definitions are in the table below. The red bracket and arrow between Piloting and Validated mark the program's hardest requirement: six months of sustained, documented performance on the same scorecard against the same baseline. The red pill highlights Validated because it is the status the whole framework is built to reach, and the dark Scaled pill is what happens afterward: the pilot's playbook is rolled out to other functions through the same pipeline shown in Figure 2.
| Status | What it means |
|---|---|
| Candidate | Identified and sitting in the intake pipeline. Anyone in the company can create one through the intake channel (deliverable 3). |
| Scored | Evaluated on the ROI and feasibility axes (deliverable 5) and ranked in the portfolio. |
| Piloting | Approved business case, passed both gates, deployed, and being measured weekly against its baseline card. |
| Validated | Six full months of sustained, documented improvement with no decay. The only status that authorizes expansion. |
| Scaled | The pilot's playbook deployed beyond the original function, with each new deployment entering the same pipeline at Piloting. |
This final element is what makes the whole framework cohere. The purpose was never to run two interesting AI experiments. It was to build a system that reliably turns AI opportunities into durable productivity gains, proves each one with arithmetic rather than enthusiasm, and grows only as fast as its evidence allows.