Evolving Agentic Engineering at Prisma

This post builds on the agentic engineering series: Agentic Engineering: How Prisma Builds with AI introduces the practice, Agentic Engineering at Prisma covers the process, roles, and documentation layer, and Drive and the Maker goes deep on the process and the Maker role.

In Agentic Engineering at Prisma and Drive and the Maker, I wrote about how we were structuring engineering work around agents, specs, projects, milestones, and Makers. Like everyone else, we've learned a lot since then like where the process helped, where it became too heavy, and how we are adapting as models and harnesses improve.

The short version is that our approach has moved along two tracks that look like opposites.

The first track was highly opinionated. A structured Drive process, descriptive skills, defined stages, explicit artifacts, and a multi-agent execution model that could scale to very large changes. That gave us a shared language, brought engineers along while the industry continues to work out what good even looks like. This approach proved that agents could ship large, high-quality projects when planning, execution, and review were tight.

The second track moved almost the other way. Leaner skills, fewer assumptions, explicit inputs and outputs, and a stronger reliance on the underlying model and harness. Instead of encoding an entire process into a series of opinionated and interdependent skills, we lean more on high-quality planning from frontier models so cheaper models can take on the implementation work without the need to make assumptions at implementation time.

I don't think the interesting question is "more process" or "less process". I think it is more precise process. We want the durable parts to stay: specs, plans, artifacts, feedback loops, and code quality standards. We want fewer brittle assumptions about the model, the harness, and the exact workflow each developer should use.

The lesson underneath all of it: an AI engineering process should help a team get productive now without becoming a system they have to maintain forever.

Why we started with an opinionated process

When we started, nobody had a stable answer for what good AI-assisted engineering looked like. The tools moved quickly. The models improved quickly. Harnesses were inconsistent, and every developer was experimenting with a different workflow.

In that environment, we made a deliberate choice. Rather than wait for the perfect process to emerge, we picked a direction and made it explicit. We built an opinionated Drive process and encoded it into descriptive skills that walked developers and agents through clear stages.

This mattered because the early problem was not really technical execution. It was organisational adoption.

Engineers needed a way to think about working with agents beyond "open the tool and prompt it".
Technical leaders needed a way to reason about quality, delivery, review, and ownership.
Teams needed to avoid every developer inventing a completely different mental model for the same work.

The process moved us from individual experimentation to team-level learning. That was its real job early on.

What the opinionated path proved

The first path was a heavy process and framework. The skills were verbose and descriptive. They encoded assumptions about how planning, execution, and review should happen, and they went as far as defining which sub-agents to create for certain kinds of work, then retaining those sub-agents so their context could persist across the workflow. The result was a large, multi-sub-agent system.

This was not a theoretical exercise. In Prisma Next, this approach let us make tens of thousands of changes at high quality. The scale was real, and the quality came from tight planning, controlled execution, and a review loop that stopped the work drifting away from the intended outcome.

I want to be careful about where the credit goes. The important part was not "using multiple agents". It was the execution and review loop. Each task and milestone ran iteratively: the agent executed, the work was reviewed, feedback was applied, and the next task started with better context. A large agent workflow only works if it has strong feedback loops. Without them, scale just gives you a larger pile of uncertain output.

What I think that work proved:

Agents can scale to very large engineering changes.
Multi-agent workflows can be effective when the work is well planned.
Review loops are essential for maintaining code quality.
Artifacts matter, because they let state and decisions persist across planning, execution, and review.
A structured process can help a team align around a new way of working.
Opinionated skills can bring people along while the underlying practices are still immature.

It did not prove that this exact process should be used everywhere, that it generalises across every repository, or that highly descriptive skills will stay the right abstraction as models improve. It was tested mainly in one repository. That showed us what was possible, but it also meant we had to be careful not to over-generalise. The lesson was not "this is the universal process". It was "agents can scale when planning, execution, review, and artifacts are structured well".

Where the opinionated path started to cost us

The heavy process worked, but it had real costs, and the main one was that it baked in assumptions about the model and the harness.

At the time, those assumptions were useful. They compensated for tool limitations, reduced ambiguity, and made the workflow repeatable when the tools themselves were less reliable. The problem is that assumptions like that go stale as the tools improve.

A skill that exists to work around a model limitation can become unnecessary once the model improves. A workflow that compensates for a harness limitation can become brittle once the harness changes. A verbose instruction set that helps a weaker model can actively get in the way of a stronger one.

That creates a maintenance problem. If the process is too encoded, the team starts optimising the process itself. You end up owning a bespoke layer:

The process design.
The skill wording.
The assumptions about model behaviour.
The assumptions about harness behaviour.
The evaluation surface.
The ongoing maintenance.
The risk that improvements in the underlying tools make parts of your process obsolete.

None of this means opinionated processes are bad. Ours continues to be useful in some workflows, and I would make the same choice again given where the tooling was. That said, the more you encode into your own process, the more you own, and over time that cost becomes harder to ignore.

The lean path: skills as interfaces

The second path moved almost the opposite way. Instead of descriptive skills that encode a large process, we experimented with very lean ones: small, focused, explicit about inputs and outputs, narrowly scoped, and far less prescriptive about how the model should reason internally. The core skill set shrank to something close to spec, plan, and execute.

The shift in thinking is that a skill should define a contract, not narrate a process. What goes in, what artifact or output comes out, and what standard the result has to meet. For example:

Given this spec, produce a plan.
Given this plan, execute the next milestone.
Given this diff, review the implementation.
Given this review, apply corrections.
Given this completed project, produce a summary.

The contract matters more than the internal narration. A lean skill can survive model improvements because it does not over-specify how the model thinks. It defines what good output looks like and lets the model and harness get better underneath it. A skill should not read like a blog post. It should read like an interface.

I don't want to overstate this. Cheaper models with weaker reasoning sometimes still need more explicit guidance, especially around nuance, so "lean" is a direction rather than an absolute. The general pull is towards smaller, clearer skills that compose well.

Frontier planning, cheaper execution

Leaning on the model more puts more weight on the plan, and that turns out to be the point. As we've learned as an industry: if agents can generate code quickly, then code generation is not the scarce resource. Judgment is, and the plan is where most of the judgment gets captured.

A good plan makes the ambiguous things explicit: what should change, why, what should not change, which systems are likely involved, what the acceptance criteria are, which tests validate the work, where the risks and edge cases are, and what the agent should not assume. When that is done well, execution stops being open-ended. The execution agent is not inferring the whole product and architecture from a vague prompt. It is carrying out a defined plan against a known codebase and checking its own work.

This is why we are comfortable using frontier models for planning and cheaper models for execution. Ambiguity is not spread evenly across the workflow. It concentrates in the spec and the plan, where a mistake propagates and a weak plan creates expensive downstream correction. Execution, once the plan is clear, is a narrower job.

Cost, without token maxing

We have never been interested in token maxing. There was no leaderboard for token usage, and we never wanted to imply that spending more automatically meant doing faster or better quality work.

At the same time, we did not want to constrain people into bad results. If someone only ever uses a weak model with a tiny context window and tight limits, they may conclude the whole workflow is useless, when really they never saw what good looked like. So the posture has been: spend responsibly, but do not optimise prematurely against the wrong constraint.

With the multi-agent workflow we have had single tasks cost hundreds of dollars. We didn't make a habit of this but it was often enough to make us stop and look closely. Those moments were useful. They showed that agent workflows get expensive quickly and is why we don't want to spend time optimising around models or agent harnesses. There are other, very talented engineers working on those problems.

Don't overbuild around temporary limitations

As I said, we do not want to build the model or the harness ourselves. We can build skills, internal tools, and workflow glue, and we can create process where it helps us today. What I want to avoid is building a large system around the assumption that models and harnesses will stay weak.

The safer bet is that they keep improving. If they plateau, we already have the skill set to compensate, because we know how to write more structured skills and more opinionated processes. For now I would rather tolerate some short-term pain than overbuild around limitations that may be gone soon.

Lean tooling is mostly about adaptability:

If a new model plans better, we can use it.
If an open-weight model gets good enough for execution, we can route execution there.
If a harness improves at context management or review, we can drop some of our compensating instructions.
If execution gets much cheaper, the economics of delegation change.
If planning quality improves, we can simplify the skill layer further.

This is also why open-weight models matter to us. We have been using them internally, including for Gremlin, as a way to experiment with lower-cost execution while reserving frontier models for the parts of the workflow where quality has the most leverage. For planning we still want the strongest model available. For execution we are more willing to trade some speed or raw capability for lower cost, as long as the plan is clear and the review loop is strong.

Gremlin fits the broader direction. The long-term idea is not that every engineer runs every agent locally and babysits every step. A developer should be able to define the work clearly, hand it off, and come back to a reviewable PR. That does not remove the engineer from the loop. It moves them towards the highest-value parts of it: framing, planning, judgment, review, and ownership. As more work runs in parallel, a laptop is not the right long-term place to run all of it, and the interface becomes the artifact or the PR rather than a live terminal session someone has to supervise.

What leadership should standardise

The useful distinction for technical leaders is not "standardise every prompt" and it is not "let everyone do whatever they want". It is the line between principles and workflow.

I think the principles should be standardised:

Work starts from clear problem framing.
Large work produces a spec or equivalent artifact.
Plans are explicit enough to guide execution.
Acceptance criteria are testable.
Execution includes a review loop.
Agents can inspect and improve their own work.
Human review owns judgment.
Artifacts persist the important state.
Code quality standards are non-negotiable.
Uncertainty is surfaced, not hidden.

From a leadership point of view, that is mostly fine. Different engineers think differently, different parts of the codebase need different levels of structure, and different models suit different tasks. The goal is not uniformity for its own sake. It is reliable outcomes with the durable principles held constant.

Process as a bridge, not a destination

If there is one idea I would leave you with, it is that an agent process is transitional. It matters because it helps a team bridge from today's imperfect tools to tomorrow's better ones, but it should not become the thing you optimise forever.

Our early Drive skills were intentionally opinionated, and that was the right call at the time. They gave us a shared way of working, brought people along, and proved agents could scale to very large changes when the work was planned, executed, and reviewed carefully. As models and harnesses improve, the value of that structure changes. Some of it becomes redundant, some of the verbose guidance becomes noise, and some of the assumptions become maintenance burden.

So we are moving towards leaner skills, stronger artifacts, better planning, and cheaper execution. We still believe in specs, plans, feedback loops, code quality standards, and people owning judgment. We just want the workflow around those beliefs to stay adaptable.

If you want the earlier chapters of this story, start with Agentic Engineering: How Prisma Builds with AI, then Agentic Engineering at Prisma and Drive and the Maker, and see Gremlin for where the lean path is heading.