TipsNews

Minimum contract for an AI agent in production (4 non-negotiable conditions)

The 4 conditions we require before implementing an AI agent for an SMB: bounded scope, MCP interoperability, kill-switch, and production eval harness.

serpixel · 20 May 2026

Close-up of a red emergency stop button on an industrial control panel, with warning text partially visible.

Key points

One agent, one workflow, one metric: The first failure point of any AI agent implementation is generality. If you cannot define the process step by step and tie it to a concrete measurable metric by week four, the project is not an agent in production; it is an experiment billed to the client.

Interoperability via MCP as an auditable technical contract: An agent that does not touch business tools (CRM, ERP, email, WhatsApp Business) does not replace any real mechanical process. The Model Context Protocol (MCP) is today's auditable technical contract that lets you verify which systems the agent touches and how.

Kill-switch effective within five minutes: The mechanism to stop the agent is the first thing designed, not the last. It must be actionable by the client without depending on the vendor, and effective within five minutes of activation.

Eval harness on real traffic, not offline tests: Integration tests run once; continuous evaluation never stops. A production harness measures accuracy, latency, cost per action, and behavioral drift on real traffic, with a designated person reading the data every week.

Most AI agent implementations do not fail because of the technology. They fail because there was no starting contract.

By “contract” we do not mean a legal document. We mean: a set of concrete conditions that must be true before activating the agent on real business data and processes. Without these conditions, what goes live is not an agent in production. It is an experiment billed to the client.

At serpixel (Clever European Business, S.L.) we have four non-negotiable conditions we verify before signing any implementation. We put them there not to protect ourselves: we put them there because it is the only way we know to guarantee that what we deliver works when it matters and can be stopped when it fails.

Condition 1: one agent, one workflow, one metric

The first failure point of almost every implementation is scope. Not the technology, not the model, not the integrations: the scope.

When a project starts with “we want an agent for customer service” without specifying much more, what we have is an intention, not a project. A vague process cannot be implemented productively. And, worse, it cannot fail detectably: an agent with diffuse scope fails silently, and nobody knows when.

The filter question is concrete: what metric will change by week four?

If the answer is “improve customer service” or “be more efficient,” that is not an answer: it is an aspiration. The answer we accept is a different kind: “the percentage of order drafts accepted without human editing will go from 0% to 70% in four weeks.” Or: “mean first response time on support emails will go from 6 hours to under 30 minutes.”

With a metric like that, the project has shape. We can define success cases, edge cases, and failure conditions. We can design an evaluation harness. We can know whether the agent works or not.

Without it, we are doing a research project, and nobody has told the client they are paying for the research.

Condition 2: verifiable interoperability via MCP

An agent that does not touch business tools is not an agent. It is an expensive chatbot.

The distinction is functional. A chatbot converses. An agent executes real actions: read an email, identify the customer in the CRM, check stock in the ERP, create an order draft, escalate to a person when the case is ambiguous. If the agent does none of these actions, it does not reduce hours of mechanical work. It reduces response time to questions, which is a different problem.

Today, the standard that makes this interoperability verifiable in an auditable way is the Model Context Protocol (MCP). An agent with integrations documented via MCP lets you know exactly: which systems it talks to, how it talks to them, and with what permissions. If a vendor cannot describe their agent’s integrations in terms of tools, methods, and permissions, the governance of the system is hard to establish.

The filter question: how many internal systems does the agent connect to via MCP, and what methods does each one expose?

The client does not need to understand the protocol in detail. They need to understand the list: “the agent reads from the CRM, writes order drafts to the ERP, does not write to the CRM, does not send emails autonomously.” With that list, the client knows exactly where the agent can fail and what the blast radius of an error is.

Condition 3: kill-switch effective within five minutes

The mechanism to stop the agent is the first thing designed, not the last.

The argument is operational. An agent that touches real data can cause damage fast. If an agent processes WhatsApp orders and starts generating incorrect entries in the ERP, the damage is proportional to the number of orders that pass until someone detects it and stops the system. If reaching the vendor takes two hours, the damage can be two hours of incorrect orders.

The right kill-switch meets three conditions:

Actionable by the client. No need to call the vendor. It can be an environment variable, a button in the admin panel, a documented API call, or a setting in the client’s own internal tool. What is not a kill-switch is saying “send an email to the serpixel team and we will respond within two hours.”
Effectiveness SLA under five minutes. From the moment the client activates it to when the agent stops processing new cases. Not from when the client sends the request: from when they press the button.
Documented human fallback. The kill-switch is not complete without knowing who picks up the process when the agent is off. If the agent processes 200 orders a month and the kill-switch is activated, someone has to process them. That person, with which tools and in how much time, must be documented in the SOW and tested before go-live, not figured out in a hurry while the process accumulates.

The filter question: how many minutes does it take to stop the agent if it starts causing damage, without calling anyone?

Condition 4: eval harness in production

The difference between an agent that gets implemented and one that keeps working long-term is continuous evaluation.

An offline evaluation harness is a set of static test cases: 50 representative messages with expected responses, run automatically against the model and marked pass/fail. It verifies initial behavior and catches regressions when you change a model version. But it is not production evaluation.

A production harness evaluates the agent on real traffic. Real cases, with the actual input distributions the business generates every day: messages in the real customer’s language, with the real customers’ spelling mistakes, with the non-standard product name references real customers use. It measures four things:

Accuracy. What percentage of cases does the agent handle correctly?
Latency. How long does it take to process each case?
Cost per action. What is the inference cost per processed case?
Behavioral drift. Is today’s quality the same as a month ago?

Drift matters. Language models change (new versions, vendor fine-tunings), business data changes (new products, new processes, new types of customers), and the agent’s behavior can degrade without anyone noticing until the client notices.

The filter question: who will read the evaluation data every week, and what decision is associated with the reading?

If there is no designated person and no associated decision (if accuracy drops below threshold X, we do Y), the harness is an instrument with nobody at the wheel.

The contract is the minimum guarantee

Four conditions. Bounded scope, verifiable interoperability, effective kill-switch, continuous evaluation. None of the four is sophisticated: all of them can be explained in five minutes to a non-technical operations director.

What makes them hard is not understanding them: it is that they require upfront work. Documenting the process step by step is uncomfortable. Defining an outcome metric with a baseline is difficult when there is no prior data. Designing the kill-switch and the human fallback requires imagining a failure nobody wants to happen. Setting up the production harness from day one costs more than not having it.

But the cost of skipping any of the four does not fall on the vendor. It falls on the client’s business.

This is why we do not sign implementations that do not meet all four. It is not a commercial stance. It is the only way we know to guarantee that what we put into production works when it matters and can be stopped when it fails.

If you have a process idea and want to know whether it passes this filter, let’s talk for 30 minutes. We bring the process to the table and leave knowing whether there is an agent worth implementing, and if there is, what the four conditions would look like for your specific case.

Frequently asked questions

Four minimum conditions: scope bounded to a single workflow with a defined outcome metric, verifiable interoperability with business tools (CRM, ERP, email, or messaging), a documented kill-switch with a sub-five-minute SLA, and a continuous evaluation harness on real traffic. If any of the four is missing, what goes into production is an uncontrolled experiment.

Scope generality. When a project starts with 'we want an agent for customer service' without specifying which exact process, what its inputs are, what outputs it produces, and which edge cases it covers, the implementation has no way to fail detectably. It fails silently, and nobody knows when.

The Model Context Protocol (MCP) is an open standard that defines how an agent integrates with external tools: CRM, ERP, calendar, email, WhatsApp Business. It matters because it makes interoperability auditable: you can know exactly which systems the agent talks to, how, and with what permissions. An agent that cannot describe its integrations via a standard protocol makes governance and portability difficult.

Because an agent that touches real operations can cause damage quickly. If an agent processes orders and starts generating incorrect entries, or if a customer service agent starts giving wrong information, the damage compounds with every message that passes until someone stops the system. The kill-switch must exist from day one so that the cost of an error does not depend on how long it takes to reach the vendor.

An offline harness runs a static set of test cases against the model. It verifies initial behavior and catches regressions when you change a model version. A production harness evaluates the agent on real traffic: real cases, with the actual distributions of inputs the business generates every day. The difference matters because models drift: new versions, new business data, new types of customer messages. Without continuous evaluation, you do not know when the agent stops performing as well as the day you activated it.

Five questions the vendor must answer in the first meeting without additional preparation: exactly which process the agent will handle, which measurable outcome improves and how the baseline was measured, how the agent is deactivated and in how many minutes, who covers the process when the agent is off, and how the agent's performance is periodically evaluated. If the vendor does not have answers to all five, the project is not ready for production.