When people talk about applying AI to work, the conversation often begins with the prompt.

A good prompt feels powerful. You describe what you want, the model answers, and for a moment the whole process looks simple. This is why many AI demonstrations are impressive: they show the visible part of the interaction, not the operational structure required to make it reliable.

But real work is different.

In a real business environment, an answer is not useful only because it is well written. It has to be based on the right context. It has to respect internal rules. It has to avoid exposing unnecessary information. It has to stay within security boundaries. It has to be reviewed, logged, corrected, and improved over time.

That is where the prompt stops being enough.

The project I describe here started from a concrete operational problem: helping support operators prepare better replies to customer tickets in LaDesk / LiveAgent. The goal was not to create a fully autonomous AI that answers customers on its own. The goal was more practical: give operators an assistant that can help them understand the conversation, use relevant context, draft a clearer reply, and keep the final decision in human hands.

This distinction matters. In many real workflows, the first useful version of AI is not the one that replaces people. It is the one that helps them work faster, more consistently, and with fewer avoidable mistakes.

What you will learn

This article is not a tutorial about prompt writing. It is a practical reflection on what is required to make AI useful inside a real business workflow.

You will learn why a prompt alone is not enough when the system has to deal with customer context, internal data, security boundaries, costs, human review, and operational responsibility.

The article explains how I approached the design of an AI-assisted support workflow by combining precise PRDs, controlled agents, project-specific skills, persistent memory, risk logs, context logs, automated tests, and a strict human-in-the-loop boundary.

It also looks at a broader question: how do you move from an impressive AI demo to a process that can actually survive contact with real work?

The focus is not on using the most powerful model everywhere, or on giving agents unlimited autonomy. The focus is on structure: defining the problem correctly, limiting responsibilities, making decisions traceable, preserving the right context, testing fragile points, and keeping humans in control where responsibility still matters.

The broader lesson is simple: AI becomes useful when it is integrated into a process that can be inspected, tested, corrected, and trusted over time.

The real problem behind the assistant

A support ticket is rarely just a message.

A customer may have written several times. The conversation may include frustration, missing details, unclear references, or partial information. To answer properly, the operator may need to check an order, verify a product, look at internal notes, consult a procedure, or understand what has already been promised.

In this context, the problem is not simply “write a reply in French”. The real problem is deciding what should be said, what should not be said, and which information is reliable enough to use.

This is exactly the kind of work where AI can be useful, but also dangerous if used carelessly. A model can produce a fluent answer very quickly. It can also produce an answer that sounds correct while being based on assumptions.

For this reason, the first version of the system keeps a strict human-in-the-loop design. The AI prepares a draft. The operator reads it, edits it, and decides whether it can be used. Nothing is sent automatically to the customer.

That choice may sound conservative, but it is the right starting point. Before automating the final action, the intermediate process must prove that it is reliable, understandable, and useful.

From a prompt to a controlled workflow

The system is built around a simple principle: AI should not be left alone with broad access and vague instructions.

The browser extension is only the interface. It opens a sidebar in the operator’s working environment and lets the operator interact with the assistant. It does not contain API keys. It does not read the whole page. It extracts only the ticket ID.

The backend does the real orchestration. It retrieves the ticket thread, manages credentials, talks to the model, and applies the rules that should not live inside the browser. It is protected behind Cloudflare Zero Trust and designed to connect later to internal tools through MCP, including the company back office and WooCommerce.

This separation is important because it changes the nature of the system. The assistant is not a script injected into a browser with excessive permissions. It is a controlled service with boundaries.

The extension handles the interaction. The backend handles authority. The human keeps responsibility.

That is the architecture in one sentence.

The most important work happened before the code

The most useful part of this project was not writing code immediately. It was preparing the work so that coding agents could operate inside a clear structure.

I created precise PRDs before implementation. Not broad documents full of generic intentions, but operational specifications divided into small executable tasks. Each task had a purpose, a scope, dependencies, acceptance criteria, and a clear definition of completion. This is the same idea I described when arguing that you should structure requirements so precisely that the agent’s behaviour becomes constrained before it ever produces code.

This changed the behaviour of the agents.

When the instruction is vague, the model has to invent. When the task is precise, the model has to execute.

This is one of the main lessons I take from AI-assisted development: the quality of the decomposition matters as much as the quality of the model. A powerful model can still make poor decisions if the work unit is unclear. A less expensive model can produce useful work if the task is narrow, explicit, and testable.

The PRD became the operating contract of the project. A task was not considered complete because an agent said it was complete. It was complete only when the acceptance criteria were met, the relevant tests passed, the risk log was updated, and the change was committed as an atomic Git step.

This is not bureaucracy. It is a way to keep generated work inspectable.

Agents need roles, not freedom

It is tempting to describe this kind of project as “multi-agent”. Technically, that is true. But the important part is not the number of agents. The important part is that each agent has a limited responsibility.

One agent supervises the PRD workflow. Another implements backend tasks. Another works on the frontend. Another focuses on Anthropic integration, streaming, MCP, and tool use. Another writes tests when the risk justifies it. Another reviews the diff. Another records risks. Another updates the PRD state.

The point is not to create complexity for its own sake. The point is to avoid asking one model to plan, implement, review, approve, remember, and close its own work.

That is where many agentic workflows become fragile. If the same agent defines success, implements the change, and declares it complete, the process becomes too self-referential.

A useful agentic process needs separation of duties. The agent that writes code should not be the only judge of whether the code is acceptable. The agent that discovers a risk should not silently ignore it. The agent that changes implementation should not casually rewrite the PRD.

In practice, this turns AI-assisted development into something closer to a small technical organization than a single conversation.

Skills are the operating manual of the project

The agents also need project-specific context. Repeating the same rules in every prompt is inefficient and unreliable. For this reason, I created local skills: reusable instructions that explain how to work inside this repository and this workflow.

There are skills for backend work, frontend work, PRD handling, Anthropic integration, memory, commits, risk logging, and review. They act as a local operating manual.

This is a small detail, but it makes a large difference.

A backend agent should not have to rediscover the project conventions every time. A frontend agent should not have to be reminded repeatedly of the Chrome extension constraints. A reviewer should not improvise what a review means. A memory writer should not decide randomly what deserves to be stored.

Skills reduce repetition and make the process more stable across sessions.

They also express an important principle: not all context belongs in the prompt of the moment. Some context belongs to the project itself.

Cost control is part of the architecture

Another lesson is that model choice should be treated as an architectural decision, not only a preference.

It is easy to use the most powerful model for everything. It is also expensive and often unnecessary. In this project, the most capable model is reserved for judgment: supervision, review, architectural decisions, and high-risk checks. More routine implementation can be delegated to a cheaper model when the task is clear enough. This is the same principle behind delegating work across model tiers: stop paying senior rates for junior work.

This creates a practical token economy.

The stronger model is used where reasoning and evaluation matter most. The cheaper model is used where the work is more mechanical and well specified. A second independent model can be used for review when a high-risk area deserves another point of view. Automated tests provide a control layer that does not depend on model confidence.

This is a more realistic way to think about AI cost. The same economic logic appears in an AI-assisted product sourcing pipeline that reduces 300,000 supplier references to a reviewable shortlist: there the model is cheap and the data is expensive, so the architecture is designed to put the LLM only in the one stage where it genuinely earns its place, and to protect the costly part with caching and deterministic engineering everywhere else.

The question is not simply which model is best. The better question is where the best model is actually needed.

Memory should be selective

Persistent memory is useful only if it is disciplined.

For this project, durable decisions are stored in a local SQLite database dedicated to the project. This includes technical decisions, API constraints, model identifiers, gotchas, architectural choices, and workflow rules that should not be rediscovered every few sessions.

The purpose is not to remember everything. That would create noise. The purpose is to remember what would create cost, confusion, or risk if forgotten. This is a smaller, project-scoped version of the same conviction that historical data is the real foundation of useful AI: a system reasons well only when it can rely on what was already learned.

For example, if a provider name in OpenCode has already caused an error, that belongs in memory. If a specific streaming approach was chosen because another one does not work well with the frontend, that belongs in memory. If a model parameter is forbidden, that belongs in memory. If a PRD contains a legacy reference that should no longer be followed, that belongs in memory.

The rule is simple: search first, then write when something durable is discovered.

This avoids two common problems at the same time. It prevents repeated analysis, and it prevents important decisions from remaining implicit.

The risk log is more important than it looks

Most project documents describe what has been done. But in real work, many problems come from what has not been done.

A test was skipped. A decision was assumed. An endpoint was not verified. A security concern was accepted temporarily. A feature was implemented in a reduced form. A mismatch appeared between specification and code. A follow-up was mentioned but not tracked.

These are the things that disappear easily in AI-assisted work, because models tend to produce a smooth story. They make the process sound coherent, even when it involved uncertainty.

That is why this project includes a risk log.

The risk log does not exist to celebrate progress. It exists to preserve friction. It records what was skipped, assumed, blocked, reduced, or left unresolved. The same instinct drives AI-assisted internal bug reporting: the value is not a fluent narrative, it is a faithful, well-formed record of what actually went wrong.

This matters because a system is not more reliable when it hides uncertainty. It is more reliable when uncertainty is visible and can be acted on.

The context log preserves the story

Alongside the PRD, memory, and risk log, there is also a context log.

The context log is different from a changelog. A changelog says what changed. The context log explains why something changed, what caused the decision, and what was still uncertain at the time.

This becomes useful very quickly.

During the project, small events matter: a model ID is wrong, a provider setting changes, a convention is corrected, an old PRD reference becomes dangerous, a manual decision is made outside the agent workflow, or a test remains pending.

If these events are not recorded, the project looks cleaner than it really is. After a few days, the team sees the final state but loses the reasoning that produced it.

With AI agents, this risk is even stronger. Agents can generate code and summaries, but they do not automatically preserve the true history of operational friction.

The context log makes the process observable, not only the result.

Testing cannot stop at the backend

For the backend, tests are expected: pytest, ruff, and mypy.

The frontend deserves attention too. A Chrome extension may look like a thin interface, but it contains fragile logic: ticket ID extraction, stream handling, parsing, abort behaviour, expired authentication, error states, and React components that connect the operator to the assistant.

The important question is not whether every visual detail has a test. The important question is whether the fragile points are covered. This is the same argument I made about why AI-written code still needs serious testing after the first commit: code that looks correct on the first commit is not the same as code that is verified.

This is especially true where browser behaviour, backend streaming, and AI output meet. Those are the parts where small mistakes can create confusing user experiences or silent failures.

The project therefore treats testing as part of the workflow, not as an optional cleanup step.

Mocking MCP before using real systems

The backend is designed to work with MCP servers for internal systems and WooCommerce. Those systems do not need to be fully ready from day one.

The important decision is to mock them with the same tool names and schemas expected from the real servers later. That way, moving from mock to real should be an environment change, not a rewrite of prompts or orchestration logic.

This is a practical way to build before all dependencies are mature.

A mock is not just a shortcut. If designed correctly, it protects the architecture from becoming a throwaway prototype. It also forces an honest awareness of the sandbox problem — what happens when your AI does not share your reality: a mock is a controlled, declared gap between the model’s world and the real one, which is far safer than an undeclared one.

Why Git still matters in an AI workflow

Every completed task produces an atomic Git commit.

This sounds ordinary, but in an AI-assisted process it becomes even more important. Git is not only version control. It is a control surface.

If an agent changes too much, the diff shows it. If the task touches unexpected files, the diff shows it. If a change breaks something later, there is a clear point to return to. If the commit is too broad, that is a signal that the task was probably too broad as well.

Generated work has to become inspectable work. Git helps enforce that.

Measuring value before adding autonomy

The first version of the assistant is intentionally conservative.

It helps prepare replies, but it does not send them. This creates the right conditions to measure value before increasing automation. It is the same human-gate pattern as the deployment queue that records what an AI assistant changed without granting it permission to ship: the automated actor produces a structured, inspectable artifact, and the decision that carries real risk stays with the person who owns the consequences.

The useful questions are practical. Does the assistant reduce the time needed to prepare a reply? Are the drafts usable with minor edits? Does it correctly use the ticket context? Does it avoid unsupported claims? Does it reduce repetitive writing? Does it improve consistency between operators? Does it create new review costs? Do operators trust it more over time?

This is where AI projects become real.

The question is not whether the model can write a fluent answer. It can. The question is whether the system improves the workflow once human review, business constraints, latency, cost, security, and maintenance are included.

If the answer is yes, more automation can be considered later.

If the answer is no, adding autonomy only makes the problem larger.

What this project teaches

The main lesson is that applied AI is not mainly about generation. It is about coordination.

The model is only one part of the system. Around it, there must be decisions about data access, human control, tool permissions, model allocation, memory, logs, testing, review, and risk.

This is less spectacular than a demo, but much more useful.

A demo shows that AI can produce an answer. A working system has to show that the answer belongs in the workflow.

That is the difference.

Conclusion

This project started from a practical support problem, but the lesson is broader.

AI becomes useful in real business environments when it is treated as part of a controlled process. The prompt matters, but it is not the system. The system is the architecture around the prompt: the boundaries, the PRDs, the agents, the skills, the memory, the logs, the tests, the reviews, and the human decision points.

This is the direction of applied AI that interests me most.

Less focus on isolated prompts.

More focus on observable systems.

Less presentation-driven automation.

More operational discipline.

Less magic.

More responsibility.