AI Audit: How to Pick Your First Use Case with Criteria (Not Hype)

Why 80% of AI projects don't make it to production

There's a number going around in industry reports that lines up reasonably well with what we see at clients: most AI initiatives in companies don't get past pilot. The exact figure depends on the study (60%, 75%, 80%), but the order of magnitude is consistent. The interesting thing isn't the figure — it's the cause.

Almost nobody fails because of the technology. Models are mature. Integrations exist. Cloud holds. Technical teams know how to build. What fails is the use case picked. Cases that sounded good in a leadership meeting, that got bought due to trend pressure, that as soon as they're taken to production discover the data isn't there, the actual process differs from the described one, the estimated savings don't justify the implementation effort, or the affected team isn't aligned.

That's why the first project matters more than the model, more than the framework, more than the partner. A well-chosen first use case starts adding value in weeks and builds the credibility for the next one. A badly chosen one burns six months, leaves the team scarred, and puts "AI" on the leadership committee's blacklist for years.

This article explains the method we use, at every company, to decide which use case goes first. It's not magic — it's scoring across three axes, operational discovery, and the honesty to drop what doesn't belong.

The scoring matrix: impact, effort, and risk

Any candidate for a first AI use case gets evaluated against three axes. Each scored 1 to 5; the winning use case is the one that combines the three best, not the one that excels on only one.

Impact. How much real value it brings to the business if it works. Three questions:

How many hours / euros / errors does it save? Not "a lot" — an estimated number. If nobody in the company can estimate the saving, the case isn't sufficiently understood.
How many people or processes does it touch? A case that affects 200 people/month has more impact than one affecting 5, all else equal.
Is it cost saving or revenue growth? Both count, but they create different business dynamics. Worth knowing which one is being attacked.

Effort. What it costs to take it to real production. Three questions:

Is the needed data available and clean? If it requires extraction from a legacy system, manual normalization, and months of validation, effort skyrockets. Dirty data is the most underestimated cause of failed projects.
Which systems do you have to integrate with? A standard HubSpot integration is light. An integration with a proprietary ERP without a documented API is heavy.
Who else has to approve / participate? Cases touching legal, HR, finance, and operations at once have more organizational than technical effort.

Technical risk. What's the chance the solution doesn't work in production even if it's built well. Three questions:

Is the technology mature for this case? RAG over Spanish-language documents, mature, yes; agents autonomously negotiating prices, not yet. Knowing the frontier prevents buying hype.
Is the minimum acceptable quality reachable? Some cases tolerate 85% accuracy; others need 99.5%. Knowing the minimum threshold decides whether the case is in or out.
Is there single-vendor dependency? Cases only working with one specific vendor (closed model, closed platform) carry higher strategic risk.

Scoring happens in a session with the client, with each candidate on the table. The conversation that scoring triggers is often more valuable than the final result — it surfaces assumptions nobody had verbalized.

The five typical candidates that show up in discovery

In mid-to-large Spanish companies, first-use-case candidates almost always converge into five families. Listed in order of how often they win the initial scoring:

1. Customer support / L1. Handling frequent questions about order status, product doubts, account management. High impact (large volume), medium effort (APIs usually exist), low risk (mature technology). It's the case that wins most often. Three months to reach a reasonable autoresolution rate, clear metrics.

2. Sales qualification / digital SDR. Qualifying inbound leads, enriching data, booking meetings with a human salesperson. High impact (directly affects pipeline), medium effort (CRM integration, usually HubSpot or Salesforce), low risk. Excellent when the sales team is overloaded and leads go cold.

3. Document processing. Reading, classifying, and extracting data from invoices, contracts, forms. Medium-high impact (depends on volume), medium effort (depends on format variability), low risk. Particularly good in companies with saturated admin departments.

4. Internal knowledge assistant. Answering internal questions about processes, policies, project data. Medium impact (marginal but generalized improvement), medium-high effort (data is usually scattered), low risk. A good second or third project, rarely the first — because ROI is harder to measure.

5. Automated reporting. Generating recurring reports from CRM, ERP, or spreadsheet data. Medium impact (saves hours but rarely transforms), low effort (data is there), low risk. A good quick win but rarely justifies a project on its own — usually folds into a larger use case.

Other candidates appear less but deserve mention: agents operating on WhatsApp for reservations or qualification (typical in hospitality and services), sales copilot integrated into HubSpot, automated employee onboarding. Each has its own scoring profile.

Anti-patterns: use cases that look good and aren't

The most useful part of a serious audit is dropping attractive but bad candidates. Five anti-patterns that show up over and over in discovery and that we almost always advise against:

"Replace the entire support team with AI." The support team solves complex cases, with judgment, with context. Replacing it whole is the wrong objective and, additionally, is politically toxic. The right objective is absorbing 60-70% of routine cases so the human team focuses on the complex ones. The difference between "automating tasks" and "replacing people" decides whether the project coexists with the team or fights them.

"Build a general copilot for the director." Sounds good, doesn't work. A general copilot — "an assistant for whatever the director needs" — has no bounded role, no metrics, no clear tools. It ends up being a chat window with little real use. The director needs concrete tools (pipeline summary, alerts on critical accounts, drafting of specific communications), not a vague assistant.

"Turn the whole website into a chatbot." Websites fulfill functions — navigation, exploration, comparative reading — that a linear chat doesn't replicate well. A chatbot on the web is a complement, not a substitute. Companies that set out to "kill navigation" usually end up with worse experience and worse conversion.

"Generate all commercial proposals with AI." Proposals have template parts (where AI helps) and parts of real commercial judgment (where a human decides). Success is mixing both, not automating the whole thing. 100% AI proposals usually lose the nuance that closes deals.

"Start with the highest-impact case." Counterintuitive but real. The highest-impact case is often the most complex, the most political, and the riskiest. If it fails, the committee draws conclusions about AI as a whole, not about that project. Starting with a medium-impact case that works generates the credibility to attack the big one after.

Discovery: who to interview and what to ask

An audit that only interviews executives produces theoretical candidates. One that only interviews operators produces small candidates. The right mix has two layers:

Operations layer. People doing the work daily: salespeople, support agents, admin staff, technicians. Questions that pull useful information:

"What part of your day brings you the least value?" — repetitive gaps are natural candidates.
"What task have you done today that you know a computer could do if it knew what you know?" — gets at implicit knowledge transfer.
"What report do they ask you for again and again?" — recurring reporting is almost always automatable.
"What question of yours goes unanswered because finding the information costs too much?" — those are the gaps where an internal search assistant would add value.

Leadership and process layer. Leads who know the critical business processes and the legal/financial constraints. Questions:

"What was the most expensive operational bottleneck in the last quarter?" — start from real pain, not the wish list.
"What process, if accelerated by X%, would move a relevant business metric?" — connect operations to real KPIs.
"What legal or regulatory constraints affect customer data?" — GDPR and, by 2026, the EU AI Act condition many use cases.

A serious discovery dedicates 5 to 12 interviews, depending on client size. Each interview 45-60 minutes. Result: a list of 10-15 raw candidates that get filtered down to 5-8 with formal scoring.

From scoring to roadmap

Scoring delivers an order, but delivering just an order isn't enough. The audit ends with a phased roadmap:

Phase 1 — Quick wins (4-8 weeks). High-scoring cases with low effort. The goal is to generate fast evidence: a case in production the team can show, with real metrics. Usually a digital SDR, L1 support, or a reporting assistant.

Phase 2 — Structural project (3-6 months). The highest-impact case that requires more work. Once the quick win is validated, you attack the project that justifies infrastructure investment: large-scale document processing, internal knowledge assistant over all corporate documentation, multi-agent system combining several use cases.

Phase 3 — Platformization (6-12 months). If the first two cases work, the common platform gets built: the agent orchestration server, the MCP servers to each internal system, the eval pipeline, the monitoring dashboard. From there, every new use case enters in weeks, not months.

This progression isn't exclusive to AI — it's normal engineering applied to a new layer. What distinguishes a company that scales AI from one that stumbles is respecting the order: validate first, platformize later.

What a serious AI audit looks like (and what it doesn't)

A serious audit has three deliverables and three things it doesn't deliver:

It does deliver:

Map of candidate processes with their operational context: who does what, how often, with what pain, with what data.
Scoring matrix with each candidate scored on the three axes, the assumptions behind each score, and the conversation that led to the numbers.
Phased prioritized roadmap, with estimated times, technical dependencies, required team, and success metrics for each phase.

It doesn't deliver:

A 40-page PDF with generic AI concepts. Your team already knows what an LLM is. The audit is about your business, not the state of the art.
A closed recommendation of vendor or platform. Stack choice is a technical decision that comes later, not part of the audit. Mixing them is a conflict of interest.
Inflated savings promises. Ranges are ranges. If nobody can defend a concrete number with a concrete calculation, it doesn't go in the report.

Typical duration: two to three weeks. One week of interviews, a few days of analysis and scoring, one week of roadmap and presentation. Beyond three weeks usually signals stretching something that didn't have enough data from the start — or selling a project disguised as an audit.

How we build it in production

When a client asks us for an AI audit, the first commitment isn't to deliver a recommendation — it's to be honest about what's worth it and what isn't. Sometimes the conclusion is "don't do AI yet" because the data isn't clean and cleaning it weighs more than the use case. Other times, the recommendation is a small quick win before the big project the client wanted to attack. The audit exists to make that decision with criteria, not to confirm what was already decided.

The final deliverable belongs to the client. Process map, scoring, and roadmap live in a document the committee can read, edit, and use to make future decisions — regardless of who ends up executing the project.

Behind each audit is a team of engineers who believe serious AI engineering starts by understanding the client's business — which processes eat time, which decisions get made blind, what real pain exists — and not by celebrating what a model can do in a lab. Technical excellence isn't measured by the sophistication of the first agent; it's measured because the first use case adds value in weeks and builds the credibility for the next one.

Radical honesty means telling you a project doesn't make sense when it doesn't, even when it's business for us. That's the difference between an AI audit and a services brochure.

If you're evaluating where to put AI in your company and don't want to commit budget to a project that ends in a drawer, we can run the audit: two to three weeks, honest scoring on your real candidates, and a roadmap your committee can approve or reject with data on the table.