The Rise of Agent Engineering. What Scaling Enterprise AI Agents Really Demands

There is a moment in almost every enterprise AI initiative that feels like magic. Someone wires a model to a few internal documents, asks a question that used to take hours of searching, and receives a coherent answer in seconds. A prototype lands in a leadership meeting. The room lights up. The organization tells itself a new story: we are now an AI company, and the rest is rollout.

That moment is real, and it is also where many organizations quietly begin to fail.

Episode 176 of the AI After Work (AIAW) Podcast offered a rare kind of clarity about why. Danilo Nobrega, Founding Go-to-Market Lead for the Nordics at LangChain, did not speak like someone selling a new abstraction, nor like someone forecasting a distant AGI future. He spoke like someone who has watched technology waves mature in the only way they ever do: by colliding with production realities, revealing where the hidden complexity actually lives, and forcing teams to invent new disciplines to survive the collision.

The discipline is what he calls agent engineering. The phrase matters because it is an explicit rejection of the comforting assumption that agents are just chatbots with tools. In Danilo’s framing, the difference between something that looks good in a demo and something that holds up in a real organization is not primarily model choice, prompt cleverness, or a larger context window. The difference is whether the organization treats agents as systems that must be engineered: orchestrated, observed, evaluated, governed, and continuously improved once they meet real users, real data, and real business risk.

What makes that framing useful is that it is not philosophical. It is operational. It is a blueprint for how enterprises can stop mistaking prototypes for products.

The seduction of “I built it in five minutes”

Danilo opened the conversation with a story that is simultaneously inspiring and disorienting. Using Agent Builder inside LangSmith, he created a personal agent in about five minutes. He described in natural language what he wanted: scan his Gmail for customer emails that remain unanswered, scan Slack for messages he has not responded to, look at his calendar for upcoming customer meetings, and deliver a structured report to Slack at three in the morning so that by the time he starts working, nothing has slipped between the cracks. The first output was a bit odd, heavy on emojis, but he tweaked the instructions and formatting, and the system behaved.

That kind of pattern is emerging across the industry: a few lines of intent, an agent appears, and the distance between idea and execution collapses. It is exactly the kind of experience that pushes enterprises to believe scale is a procurement problem. If one person can build something useful before lunch, the logic goes, then a team can build dozens, and a company can roll them out broadly, and a platform team can industrialize them later.

But the moment you take that five-minute agent and place it inside a regulated enterprise workflow, you learn why agent engineering needs its own name. The agent is no longer a productivity hack. It becomes a participant in operations. It touches customer relationships. It influences decisions. It must handle failures gracefully, avoid data leakage, and produce consistent behavior under messy conditions, including the kinds of inputs no one thought to test. When the cost suddenly spikes because a workflow starts pulling too much context into memory, it becomes a budget issue rather than a curiosity. When an agent produces an answer that is plausible but wrong, it becomes a brand issue rather than an engineering issue. When an agent is connected to systems of record, it becomes a governance issue rather than a product feature.

In other words, the organization leaves the world of “cool demo” and enters the world of “operational liability.” That is the dividing line agent engineering exists to manage.

From chains to graphs: why linear thinking breaks in the enterprise

One of the most practical contributions Danilo made in the discussion was his explanation of why early LLM applications, even when they work, tend to become brittle as soon as they encounter real workflows. He used a metaphor that is unusually effective because it avoids jargon and still captures the core architectural shift.

LangChain, in its earliest framing, resembles a recipe: you do step one, then step two, then step three, and if nothing goes wrong, the cake comes out. That sequential framing fits a surprising number of early use cases, particularly those that are essentially “retrieve context, generate answer.” It also fits the way many teams initially think about LLM apps, because it mirrors how traditional backend services are composed: call A, then B, then C.

But enterprise work is rarely a cake recipe. It looks more like sourdough bread, where you constantly adjust based on what you see. Sometimes you need more water. Sometimes the dough is too sticky. Sometimes you have to pause and come back later. Sometimes you need to retry a step, or fork the process into parallel investigations, or hand the intermediate result to a person for approval. If you cannot represent that messiness explicitly, you end up pretending the world is linear until your system breaks in the middle of something that matters.

This is where LangGraph becomes more than just another library abstraction and instead represents an architectural shift. Danilo described it as a graph of nodes and edges with state and checkpoints at every node, allowing branches, loops, and persistence. That last word, state, is the quiet center of the whole story. Without state, an agent is a stream of text. With state, it becomes a workflow system that can be paused, resumed, inspected, and governed.

He offered a particularly enterprise-relevant example: approvals. If a workflow needs a VP to approve a generated artifact, the system must be able to wait three days and then continue when approval arrives, without losing context or restarting from scratch. This is not a glamorous feature, but it is precisely the kind of feature that decides whether an agent can be embedded into the operational fabric of an organization.

The broader insight is that agent engineering is not primarily about making the model smarter; it is about making the surrounding system more resilient.

The non-deterministic problem: why old QA doesn’t work anymore

The deepest thread in the episode, and arguably the most novel, was Danilo’s explanation of why the familiar enterprise playbook for shipping software starts to fail when you replace deterministic code paths with probabilistic model behavior.

Traditional software development assumes repeatability. If a tax calculation function returns the wrong number, you write a test, fix the code, rerun the test, and deploy with confidence that the function will now produce the correct output for the tested case. Even when data changes, the logic remains stable.

LLM applications break that assumption. Even with the same prompt, you can get different answers, because the system is probabilistic by design. When you move from simple LLM wrappers to multi-agent systems that call multiple models, tools, and sub-agents in parallel, the space of possible behaviors expands dramatically. At that point, the thing you are shipping is not a set of deterministic functions, but a controlled behavior space.

Danilo put it in a way that should stay with every enterprise CTO: in agent systems, the “code” has effectively moved from the source files to the traces. What truly defines your agent in production is not the Python code in your repository, but the sequence of runtime executions between input and output, because that is where its actual behavior takes shape.

This shift implies that the primary artifact of engineering is no longer just the source code itself, but the telemetry, traces, evaluation results, and feedback loops that make system behavior transparent, measurable, and continuously improvable.

Observability becomes the source code of agent systems

When Anders asked Danilo what a trace is, Danilo described it with a mundane analogy: it looks like a file system tree. There is a starting point, and then each call branches into subcalls. In that tree, you see every model invocation, every tool call, every step the agent took to produce the final answer. You also see two numbers that end up deciding whether your system survives: latency and cost.

If you have spent years building distributed systems, this should feel familiar. The enterprise world already learned that you cannot run microservices without observability. You do not debug microservices by reading the code harder. You debug them by tracing execution paths, measuring latency, detecting anomalies, and establishing feedback loops from production behavior back into engineering decisions.

Danilo’s claim is that agents inherit the same truth, but in a harsher form, because the system’s core behavior includes an opaque model. In classic software, you can read the code. In model-centric systems, you cannot meaningfully “read the model.” What you can read is the trace: what happened when you gave this input, how the system reasoned, what it called, and what it returned.

That is why LangSmith exists in LangChain’s framing: it is the platform layer that covers build, observe, evaluate, and deploy, because the engineering cycle cannot be separated from observability and evaluation once the behavior becomes probabilistic.

The industry data reinforces the point. In its State of Agent Engineering reporting, nearly 89 percent of respondents reported implementing observability for their agents, while evaluation adoption lagged at around 52 percent, and quality was cited as the top production barrier by 32 percent of respondents.

What those numbers imply is that teams who reach production tend to converge on the same realization: if you cannot see what your agent is doing, you cannot trust it, and if you cannot trust it, you cannot scale it.

Evaluation is the forgotten half of “trust”

If observability tells you what happened, evaluation tells you whether it was any good. Danilo was explicit that the ecosystem is still early in adopting evaluation, even though it is effectively the test and QA layer of the agent world. You can build dashboards. You can see costs. You can trace failures. But if you cannot systematically measure whether outputs meet your definition of correctness, conciseness, compliance, or usefulness, you are shipping a system that you can observe but cannot govern.

What makes evaluation interesting, and harder than it first appears, is that it forces enterprises to be explicit about what “good” means. In deterministic software, correctness is often binary. In agent systems, correctness can be contextual. A response can be factually correct and still harmful if it violates policy or confuses the user. A response can be concise and still omit a crucial caveat. A response can be compliant and still unhelpful.

Danilo listed evaluation approaches that reflect how the field is evolving: LLM-as-judge, human-in-the-loop review, and criteria such as conciseness or correctness. The key is not the mechanism. The key is the feedback loop: you define what you care about, you measure it, and you use the signal to improve the system iteratively.

This is also where the phrase “agent engineering” earns its keep. It signals that we are not merely “prompting” systems into existence. We are instrumenting them, measuring them, and improving them over time based on evidence.

Shipping is how you learn, but only if you can see what you shipped

One of the most practically useful, and culturally challenging, ideas Danilo shared is that you cannot perfect an agent in a staging environment. He referenced a principle internalized by companies like Klarna and others: shipping is how you learn, not what you do after learning.

This is not an argument for recklessness. It is an argument for disciplined iteration under instrumentation. The logic is simple: your staging environment cannot simulate the full diversity of real user inputs, real edge cases, and real operational constraints. The agent that looks good in staging will meet an unexpected prompt in production on day one. What matters is whether you can see the failure, understand it, and improve the system without introducing new risk.

That is also why Danilo framed “observability from day one” as a defining trait of teams who succeed. In the survey, observability adoption among teams with agents in production rose to 94 percent.

If you have lived through DevOps transformations, you will recognize the pattern. The point is not to eliminate failure. The point is to shorten the distance between failure and learning, and to put guardrails in place so that failures do not become catastrophes.

Quality is the top barrier, and the Klarna story shows why

One of the most counterintuitive points Danilo raised is that cost is not the number one barrier to production. Quality is. In LangChain’s reporting, quality was the top barrier cited by respondents, with security rising to second.

That ordering matters because it aligns with what many enterprises discover too late: the biggest risk is not that your agent will be expensive. The biggest risk is that your agent will be wrong in a way that is persuasive.

This is where it becomes useful to hold two truths at once, and to resist the temptation to tell a single triumphant story about automation. Klarna is often cited as an iconic example because it demonstrated early that agentic systems can drive real operational impact. LangChain’s own case study describes Klarna using LangGraph and LangSmith and achieving dramatically faster customer resolution times. The company’s press release around its AI assistant emphasized that it handled a large share of customer service chats in its early rollout, pointing to scale and productivity gains.

At the same time, later reporting suggests that Klarna also experienced the downside of pushing automation too far without sufficiently preserving service quality, leading to a reassessment and a renewed emphasis on human support in certain contexts.

This is not a contradiction. It is the reality of agent engineering at scale. The lesson is not “AI works” or “AI doesn’t work.” The lesson is that the winning metric is not automation for its own sake. The winning metric is customer value delivered sustainably, with quality maintained under real-world complexity.

Danilo’s line that “the most expensive agents are cheap agents” captures the same idea from a different angle. Cheap, lightly engineered agents become liabilities because they create brand damage, compliance risk, and operational churn. Expensive, well-engineered agents can be economically rational because they reduce escalations, shorten resolution times, and allow humans to focus on the cases that truly need human judgment.

Security and guardrails are not add-ons; they are architecture

The episode also made a point that often gets lost in enterprise discussions: guardrails are not merely policy documents. They are implementation details.

When Anders asked how to prevent undesirable behavior, Danilo pointed toward middleware in a LangGraph context: you can place control logic before and after the graph executes, modify inputs, filter outputs, and stop unsafe responses from being sent into the world.

This is practical advice because it frames safety not as a moral aspiration but as an engineering capability. Enterprises do not become safe by wishing. They become safe by designing systems where safety is enforceable at runtime.

This also ties to the EU AI Act discussion. Danilo argued that traceability and visibility support accountability, and in the EU context, those properties increasingly map to compliance expectations. For high-risk systems, the AI Act includes record-keeping obligations that require logging capabilities to enable traceability and oversight. The point is not that every enterprise agent is automatically high-risk under the Act. The point is that the direction of travel is clear: enterprises will be asked to explain how a system behaves, not merely what it outputs. Observability is becoming a regulatory asset as much as an engineering tool.

The infrastructure gap: why “an LLM wrapper” collapses under real workflows

Danilo described what many enterprise teams discover after months of enthusiastic prototyping: the jump from a simple LLM wrapper to a production agent is not about adding one more tool call. It is about adopting an infrastructure layer that supports the realities of long-running, stateful workflows.

He called this the infrastructure gap. A wrapper plus basic RAG can work for a demo. In production, you need the ability to pause execution for a human approval, to retry a failed step, to checkpoint state, to recover from interruptions, and to manage the runtime environment in which these systems execute. He also noted that LangGraph is not only a graph abstraction but a runtime, and that LangChain, LangGraph, and Deep Agents run on that runtime.

This framing matters because it suggests the agent era will look less like the mobile app era and more like the distributed systems era. Enterprises that succeed will treat agent platforms as infrastructure, not as a library choice. They will budget for operational excellence, not only for experimentation.

Deep Agents and the memory frontier: where enterprise IP might migrate

The most novel part of the episode, in my view, was the discussion of Deep Agents, described as an open-source “harness” inspired by the behavior patterns behind deep research tools and coding agents. The core idea is not that there is a secret new model. The idea is that you can extract dramatically different performance from the same model depending on the harness you build around it: planning, task decomposition, sub-agent delegation, context management, and memory.

Danilo spoke about memory in a way that has strategic implications beyond LangChain’s ecosystem. Deep Agents incorporate short- and long-term memory, including semantic, episodic, and procedural memory, along with an instructions file that can update based on feedback, sometimes even during the same session, creating a loop in which the system refines its own guidance over time.

The practical consequence is obvious: the agent becomes more useful over time because it adapts to preferences and rules. The strategic consequence is more subtle: memory begins to look like a new repository of value.

In classic software engineering, the IP lives in the code. In agent systems, if the model is a commodity and the orchestration patterns become widely shared, the enterprise-specific advantage may increasingly live in the memory layer: how a company encodes preferences, procedures, and experience, how it structures feedback, and how it turns operational traces into durable improvements.

Danilo floated the idea that “where is the IP” might shift from code to memory. Even if you don’t buy that fully, it is a useful provocation because it forces leaders to ask what they are actually building. Are they building an application, or are they building an organizational learning loop embodied in software?

The organizational barrier is real, and it’s not just skills

When the conversation moved from technology to why value creation stalls, Danilo did not blame models. He emphasized people and process. Successful organizations, in his experience, create dedicated groups with multiple competencies, aligning engineers, product people, and data scientists so they can iterate quickly and intelligently, then disseminate value across the organization once a pattern is proven.

This is the part of agent engineering that many enterprises resist, because it implies organizational change. You cannot treat agents as just another feature for a single team to ship and forget. Danilo explicitly rejected the “ship and forget” mindset, describing continuous observation and improvement as the defining operating model.

He also pointed to a practical governance concern that executives will recognize immediately: leadership wants visibility into value and cost. If you can tag agent traces by team, feature, or agent, you can begin to ask operational questions that look like business questions: which workflows are delivering value, which teams are consuming disproportionate budget, whether a more expensive model is justified by improved outcomes, and whether a system is drifting toward becoming a liability.

The agent era forces a convergence of engineering, product, and finance into the same conversation, because the cost of an agent is not an internal detail. It is part of its behavior.

The pace of change, the paralysis risk, and the three categories of organizations

Danilo closed one of the most important segments with a description of why organizations get stuck even when they believe the technology is inevitable: the pace of change creates paralysis. He cited that regulated enterprises update their agent stack every three months or faster, while models evolve even more rapidly, creating an environment in which teams fear committing to an approach that will soon be outdated.

His response was pragmatic. From an enterprise perspective, the frontier is not waiting for the next model. The frontier is learning to extract reliable value from what already works, and doing so with the engineering discipline required to manage risk.

He described three categories that will feel familiar to anyone advising enterprises today: organizations that avoid AI because they fear losing control, organizations that use it well and get value, and organizations that use it poorly and incur liability.

The uncomfortable truth is that the difference between the second and third category is rarely the model. It is the engineering discipline around it.

What scaling really demands

If there is a single red thread through the episode, it is that scaling agents is less about “agent capability” and more about “agent controllability.” The core technical shift is moving from linear wrappers to stateful orchestration. The core operational shift is moving from local demos to production learning loops. The core governance shift is moving from hope-based trust to evidence-based evaluation. The core organizational shift is moving from isolated experimentation to cross-functional teams with shared accountability for outcomes.

That is what agent engineering demands: not a new buzzword, but a new operating system for how enterprises build and run probabilistic systems.

The reason this discipline is arriving now is not because the industry finally found a better metaphor. It is because enterprises have reached the point where agents can create real value, and therefore can create real damage, and therefore must be treated as infrastructure rather than experiments.

If you want a practical mental model to carry forward, Danilo’s own contrast is a good one: it is very easy to build an agent; it is very hard to make it work in production.

The companies that internalize that early will stop arguing about whether agents are the future and will instead compete on who can engineer them responsibly, sustainably, and at scale.

In Summary: What This Episode Taught Us About Scaling AI Agents

If you step back from the technical details, Episode 176 leaves you with a set of hard-earned lessons about what it actually takes to scale AI agents beyond experimentation.

The first lesson is that building an agent is no longer the difficult part. The tooling has matured to the point where a motivated individual can assemble something useful in minutes. The difficulty begins the moment that agent touches production reality. At that point, the question shifts from “Can it generate?” to “Can we rely on it?” That shift changes everything, because reliability in a probabilistic system demands engineering discipline, not prompt creativity.

The second lesson is that orchestration matters more than model size. Throughout the conversation, Danilo repeatedly emphasized that moving from simple chains to graph-based, stateful systems is not a cosmetic upgrade. It is what allows agents to behave like enterprise systems rather than chat interfaces. Real workflows require memory, checkpoints, branching logic, retries, and the ability to pause for human input without losing context. Without that architectural backbone, agents remain brittle.

The third lesson is that observability is not optional. In deterministic software, the code defines behavior. In agent systems, behavior emerges at runtime. That means the traces become the most faithful representation of what the system actually is. If you cannot see how an answer was produced, which models were called, which tools were invoked, how long each step took, then you probably cannot debug it, improve it, or justify it to leadership. Observability becomes the foundation for trust.

The fourth lesson is that evaluation lags behind ambition. Many organizations deploy agents before they have clearly defined what “good” looks like. Yet in a probabilistic environment, quality cannot be assumed. It must be measured. Enterprises that scale successfully define evaluation criteria that reflect business reality like correctness, compliance, conciseness, cost, and use those signals to create feedback loops. Without evaluation, scaling becomes guesswork.

The fifth lesson is that shipping is part of learning. You cannot simulate every real-world edge case in a staging environment. Production reveals what theory cannot. But shipping only works as a learning mechanism if it is paired with instrumentation and guardrails. Scaling agents is therefore an iterative discipline, not a one-time rollout.

The sixth lesson is that cost and quality are inseparable. The conversation made it clear that cheap agents can become the most expensive agents if they damage trust or create rework. Scaling requires conscious architectural decisions about context size, model selection, and task decomposition. Economic sustainability is not an afterthought; it is a design parameter.

The seventh lesson is organizational. Technology alone does not scale. Companies that succeed typically establish cross-functional ownership, with engineers, product leaders, and domain experts working together. Leadership visibility into value creation and cost becomes essential. Agent engineering is not a side project; it becomes part of operational strategy.

Taken together, these lessons form a clear pattern. Scaling agents is less about chasing the next model release and more about mastering the engineering and organizational systems that surround the models we already have. The companies that internalize this will move beyond prototypes. The companies that do not will accumulate impressive demos and fragile deployments.

That, more than any single tool or framework, is what this episode ultimately revealed.

*This article was enhanced with the help of AI tools, drawing on the podcast transcript and complementary online research. To go deeper into the source material, I encourage you to listen to the full episode and make your own learnings.

You can watch the full episode here.