Minimum contract for an AI agent in production (4 non-negotiable conditions)
The 4 conditions we require before implementing an AI agent for an SMB: bounded scope, MCP interoperability, kill-switch, and production eval harness.
Key points
Most AI agent implementations do not fail because of the technology. They fail because there was no starting contract.
By “contract” we do not mean a legal document. We mean: a set of concrete conditions that must be true before activating the agent on real business data and processes. Without these conditions, what goes live is not an agent in production. It is an experiment billed to the client.
At serpixel (Clever European Business, S.L.) we have four non-negotiable conditions we verify before signing any implementation. We put them there not to protect ourselves: we put them there because it is the only way we know to guarantee that what we deliver works when it matters and can be stopped when it fails.
Condition 1: one agent, one workflow, one metric
The first failure point of almost every implementation is scope. Not the technology, not the model, not the integrations: the scope.
When a project starts with “we want an agent for customer service” without specifying much more, what we have is an intention, not a project. A vague process cannot be implemented productively. And, worse, it cannot fail detectably: an agent with diffuse scope fails silently, and nobody knows when.
The filter question is concrete: what metric will change by week four?
If the answer is “improve customer service” or “be more efficient,” that is not an answer: it is an aspiration. The answer we accept is a different kind: “the percentage of order drafts accepted without human editing will go from 0% to 70% in four weeks.” Or: “mean first response time on support emails will go from 6 hours to under 30 minutes.”
With a metric like that, the project has shape. We can define success cases, edge cases, and failure conditions. We can design an evaluation harness. We can know whether the agent works or not.
Without it, we are doing a research project, and nobody has told the client they are paying for the research.
Condition 2: verifiable interoperability via MCP
An agent that does not touch business tools is not an agent. It is an expensive chatbot.
The distinction is functional. A chatbot converses. An agent executes real actions: read an email, identify the customer in the CRM, check stock in the ERP, create an order draft, escalate to a person when the case is ambiguous. If the agent does none of these actions, it does not reduce hours of mechanical work. It reduces response time to questions, which is a different problem.
Today, the standard that makes this interoperability verifiable in an auditable way is the Model Context Protocol (MCP). An agent with integrations documented via MCP lets you know exactly: which systems it talks to, how it talks to them, and with what permissions. If a vendor cannot describe their agent’s integrations in terms of tools, methods, and permissions, the governance of the system is hard to establish.
The filter question: how many internal systems does the agent connect to via MCP, and what methods does each one expose?
The client does not need to understand the protocol in detail. They need to understand the list: “the agent reads from the CRM, writes order drafts to the ERP, does not write to the CRM, does not send emails autonomously.” With that list, the client knows exactly where the agent can fail and what the blast radius of an error is.
Condition 3: kill-switch effective within five minutes
The mechanism to stop the agent is the first thing designed, not the last.
The argument is operational. An agent that touches real data can cause damage fast. If an agent processes WhatsApp orders and starts generating incorrect entries in the ERP, the damage is proportional to the number of orders that pass until someone detects it and stops the system. If reaching the vendor takes two hours, the damage can be two hours of incorrect orders.
The right kill-switch meets three conditions:
-
Actionable by the client. No need to call the vendor. It can be an environment variable, a button in the admin panel, a documented API call, or a setting in the client’s own internal tool. What is not a kill-switch is saying “send an email to the serpixel team and we will respond within two hours.”
-
Effectiveness SLA under five minutes. From the moment the client activates it to when the agent stops processing new cases. Not from when the client sends the request: from when they press the button.
-
Documented human fallback. The kill-switch is not complete without knowing who picks up the process when the agent is off. If the agent processes 200 orders a month and the kill-switch is activated, someone has to process them. That person, with which tools and in how much time, must be documented in the SOW and tested before go-live, not figured out in a hurry while the process accumulates.
The filter question: how many minutes does it take to stop the agent if it starts causing damage, without calling anyone?
Condition 4: eval harness in production
The difference between an agent that gets implemented and one that keeps working long-term is continuous evaluation.
An offline evaluation harness is a set of static test cases: 50 representative messages with expected responses, run automatically against the model and marked pass/fail. It verifies initial behavior and catches regressions when you change a model version. But it is not production evaluation.
A production harness evaluates the agent on real traffic. Real cases, with the actual input distributions the business generates every day: messages in the real customer’s language, with the real customers’ spelling mistakes, with the non-standard product name references real customers use. It measures four things:
- Accuracy. What percentage of cases does the agent handle correctly?
- Latency. How long does it take to process each case?
- Cost per action. What is the inference cost per processed case?
- Behavioral drift. Is today’s quality the same as a month ago?
Drift matters. Language models change (new versions, vendor fine-tunings), business data changes (new products, new processes, new types of customers), and the agent’s behavior can degrade without anyone noticing until the client notices.
The filter question: who will read the evaluation data every week, and what decision is associated with the reading?
If there is no designated person and no associated decision (if accuracy drops below threshold X, we do Y), the harness is an instrument with nobody at the wheel.
The contract is the minimum guarantee
Four conditions. Bounded scope, verifiable interoperability, effective kill-switch, continuous evaluation. None of the four is sophisticated: all of them can be explained in five minutes to a non-technical operations director.
What makes them hard is not understanding them: it is that they require upfront work. Documenting the process step by step is uncomfortable. Defining an outcome metric with a baseline is difficult when there is no prior data. Designing the kill-switch and the human fallback requires imagining a failure nobody wants to happen. Setting up the production harness from day one costs more than not having it.
But the cost of skipping any of the four does not fall on the vendor. It falls on the client’s business.
This is why we do not sign implementations that do not meet all four. It is not a commercial stance. It is the only way we know to guarantee that what we put into production works when it matters and can be stopped when it fails.
If you have a process idea and want to know whether it passes this filter, let’s talk for 30 minutes. We bring the process to the table and leave knowing whether there is an agent worth implementing, and if there is, what the four conditions would look like for your specific case.