The deepest IT management layers are also the most dangerous. The infrastructure players focused on them are just as eager to reap the benefits from agentic AI as any other vendor. More advanced, self-policing automation also brings yet another abstraction layer, additional room for error, and a potential shake-up of the job description for IT engineers industry-wide. What impact are AI agents having already? How ambitious or cautious are some of the leading vendors? And, fundamentally, what does the future of an ‘agentic IT infrastructure’ look like? We discuss it with experts from Cisco, Google, Nutanix, SUSE and more.
To be clear, we’re not referring to an IT infrastructure for running AI. Instead, the focus here is on AI inside IT infrastructures, affecting the core systems themselves rather than merely being enabled by them. This deployment, largely in its infancy, is really an extension of the high degree of automation already happening inside systems successfully. Routine, monotonous operational workloads generally occur underneath proprietary cloud architectures, Terraform IaC scripts or other pre-built tooling. Resource provisioning, automatic scaling, log analysis, anomaly detection – much of it has been realized without the aid of generative AI or agentic workflows.
These developments have allowed infrastructure management to move to more high-value tasks. Promises around AI have relied on exactly that premise, but there’s more to the agentic angle than that. The sense we get from all experts is that automation in and of itself is advancing to another stage where the nature of IT infrastructure itself shifts. Let’s delve into that.
An autonomous infrastructure
The challenge for infra vendors is to go beyond simply serving AI models and workloads. “The question is: how do we build an enterprise that thinks and scales autonomously?” says Rick Spencer, General Manager of Engineering at SUSE. “So this is what we call agent-assisted infrastructure management. Infrastructure that self-heals, that is constantly vigilant about production and security issues, and that can proactively address these issues.”
This approach mirrors the machine learning-assisted revolution that security and observability tools have already taken on. The difference is that we’re talking about a more intelligent form of automation. Although ML can help automate tasks, the decision point hasn’t shifted. Humans have designed systems around the technology to enable it to take action. Determinism is key: ML will deliver you the same results every time provided the data ingested and the model used remains static. With agentic AI, the underlying technology is inherently probabilistic. We’ll get to that point very soon.
Before getting to that, it is key to stress that the overall goal Spencer mentioned gets pretty technical as soon as you get to the implementation stage. Thomas Cornely, EVP of Product Management at Nutanix, emphasizes the inherent complexity. He describes how Nutanix started its AI infusion with chatbots and self-service portals, with the company now moving towards more ambitious targets. Now, the goal is an intelligent infrastructure. “At some point, I’m going to have my agent take care of getting resources for its own set of sub-agents, all with their own permissions.”
Anurag Dhingra, SVP and GM for Enterprise Connectivity and Collaboration at Cisco, has explained to us before how what his company calls AgenticOps applies to infrastructure management. Generative UX allows humans to understand processes that may not previously have been captured by traditional dashboards. AI Canvas, a technology Cisco introduced last June, “generates dashboards on the fly, and then those dashboards are completely created in service of what you’re asking in the moment.” This can also apply to third parties, such as ServiceNow, to coordinate specialized agents to understand the IT infrastructure holistically.
AgenticOps is an example of how traditional infrastructures may evolve simply by changing the ways infrastructure engineers experience them. However, an agent generating a dashboard based on the infrastructure is not the same as an agent acting upon these. What’s to be gained from that, and what are the risks?
Entropy in the system
Agents with a complex job description are recipes for chaos. Manosiz Bhattacharyya, Nutanix’s CTO, defines the limits of AI ambitions. “If AI makes a mistake and something breaks, then it is your fault. And I think that responsibility for humans is not going anywhere. AI is by definition a probabilistic system. Even with two identical calls to the same LLM, you will not get the same response.”
“At the end of the day, you cannot hold an agent accountable for work”, Craig McLuckie, founder and CEO of StackLok, tells us at SUSECON. “You can’t hold an agent accountable for an outcome. The same people accountable today are going to be accountable. But you can only be held accountable to something that you understand.”
Technically, then, probablistic systems must boil down to deterministic states. So far, this has meant agents operate just as humans do inside infrastructure, restricted to writing and committing declarative Infrastructure-as-Code. The difference is that humans remain in the loop.
Another option is to utilize something like the Open Policy Agent (OPA). It’s a graduated CNCF project and precedes the post-ChatGPT AI boom. In essence, it’s a template that’s already worked for IT infrastructures in the past but can be particularly useful today. OPA enforces rule-based policies that are just as useful for modern AI agents as previous automated systems.
The point is: we already know how to enable context-aware policy enforcement. However, to fully utilize agents, one might seek to rejigger these policies. Only then can they suit an agentic use case and enable greater automation. Step by step, this may allow more room for probabilistic systems to reach desired, deterministic outcomes.
The “happy path”
Agents may well remain useful, benign assistants in what Bhattacharyya calls the “happy path”. But McLuckie says agents are “fantastically useful, fantastically powerful technologies.” They must, he states, be able to perform their work in a way that doesn’t imperil infrastructure. “If [agents] suddenly turn into a gremlin at midnight, you better hope you built a gremlin-proof cage for them.” Technologies such as the Model Context Protocol (MCP) and various security measures around them can be of great help there. The cage needn’t be reinvented wholesale.
There is an inherent tension here, though. On the one hand, AI agents can simply automate more systems. On the other, they are fundamentally different entities. They can and if unsupervised will perform tasks that you’d normally ask a person to do. Right now, they’re unpredictable enough to require supervision, and too expensive to deploy without a second thought. Long-context workloads especially will absorb tokens at a staggering pace. OpenAI even escalates costs after several tens of thousands of tokens’ worth of context. Make a loop continuous, and you’re quickly exceeding the costs an employee doing the same tasks would represent.
Building protection layers around agentic behavior is one step, but the fundamentals surely change when giving them proper agency? If you want your infrastructure to truly behave intelligently, you’d want it to anticipate human needs for IT resources. You’d want it to verify enterprise intent beyond the basics as well. But what skills do you need to harness that new level of automation?
The job factor
We return to Cisco’s Anurag Dhingra to bring us some good news for network administrators. They are key to any large-scale IT infrastructure. They can become even more useful with the right tools. “If you are a seasoned network administrator or network ops person now, you have very capable, almost digital teammates available to you and so you can delegate a bunch of routine tasks to them,” he explains. This allows professionals to focus on designing networks and operating them at scale rather than handling small, repetitive tasks.
Just like their network administrator counterparts, infrastructure engineers will wonder if their agentic ‘colleagues’ are setting them up for a job change. Or, at the very least, we’re curious to know if those jobs need to fundamentally evolve. Dan Ciruli, VP and GM Cloud Native at Nutanix, makes the point that the fundamentals aren’t changing. Despite all the novelty of agentic operations and intelligent infrastructures, and indeed changes to the nomenclature, “It’s still about taking a workload and putting a physical piece of infrastructure somewhere. Don’t get too excited about the new tools. Learn to become comfortable with the new tools, but the job you have, which is keeping your company running, isn’t changing.”
This is inherently true if the job description is broad. Nevertheless, the specifics can and will change, as admittedly they always do. Ciruli goes on to reflect on his comments, stating that the lack of fundamental change is a very calming message at a turbulent time. Indeed it is, even if we’re unsure how long-term the prospect is. Given agentic approaches to infrastructure management are at best fledgling and realistically far from the finished product, we won’t know how jobs change in the end. What is clear, however, is that the responsibility is staying with the same people. Humans will ultimately need to sign off on agentic actions with whatever degree of leeway such AI agents are given.
A new foundation
While most IT decision-makers will have caught on to the jump towards agentic AI, its technologies in practical terms rarely reach beyond MCP-based integrations. Often dubbed the “USB-C for AI”, it has served as the unofficial interface for agent-application communication. Andi Gutmans, VP and GM of Data Cloud at Google, recently highlighted in conversation with Techzine how MCP is “just an API“. In other words: the connective tissue to begin the integration of AI-enabled systems. Few will remember the goose framework and AGENTS.md that were simultaneously donated to the Agentic AI Foundation at its inception in December, just as Anthropic donated MCP. However, these contributions by Block and OpenAI respectively are just examples of the various building blocks required to get us to an agentic AI infrastructure.
As it happens, MCP on its own isn’t inherently secure. Just as with Kubernetes, a treasure trove of solutions have appeared to make agents observable and predictable. ServiceNow recently even introduced a kill switch, easily clickable in the event of any AI-induced mishap.
Context is everything
Whenever agentic AI comes to the fore, its context determines its implications. Agents running the show to automate customer service tickets or to escalate security alerts are wholly different. At a more fundamental level, agents running on infrastructure and agents running said infrastructure present two completely different trust levels. Sandboxing an agent inside a traditional framework is but one approach to mitigate potential issues. Actually allowing agents to sandbox other agents, allocate resources, limit privileges, et cetera, that is a ball game we’ve yet to play.
It’s evident from our conversations with the vendors operating at this deeper technology layer that there is plenty of appetite for ambition. Just as their SaaS counterparts are exploring a vast implementation of agents across familiar workflows, infrastructure players can readily tell you where AI agents may well become vital to managing IT. They could, in theory, do so much better than overworked humans ever could. But getting to the point where the humans can trust workflows to be automated by an intelligent infrastructure will take time. Presumably a lot of it, given the limitation isn’t just technical but also financial and psychological.
A common goal
Vendors are hardly in different camps on the matter. In the main, their collective views center on the notion that AI agents merit a low level of trust. For that reason, they need supervision. But as with previously established ML-enabled workflows or even just plainly well-designed abstraction layers, trust builds as a technology matures. Once, the accuracy of an ML-based security alert may have been merely peripheral to hunting for threats inside an infrastructure. They have now become trustworthy enough to be core to IT security posture management. That’s just one example, but it shows that such a paradigm shift requires time. What’s different with this new development is that infrastructure companies can spot the foundational change from miles away. They cannot currently anticipate how that change will affect their solutions. Thus, for the time being, the job roles of the people using said solutions won’t change.
We suspect that a meaningful number of organizations will move fast and break things not just when it comes to AI deployments. Some will also test AI-guided, “intelligent” infrastructure in the wild. We’ve already seen sporadic headlines of AI agents deleting entire production environments, hallucinating permissions, and internal LLMs being accessed externally and changing infrastructure configurations. For every news item detailing a cautionary tale, there will be ten more that are either too benign or obscure to notice. Eventually, and this may have even already happened unbeknownst to anyone else, an organization will find a high degree of automation that can satisfy the promise of an “intelligent infrastructure” without suffering from adverse AI actions.
Conclusion
AI could already provide some quick wins that an organization pursuing greater automation could exploit. An example of this is resolving the problem of state drift or resource drift. In Infrastructure-as-Code, the desired infrastructure will diverge from the actual one. An agent can spot this divergence, propose a solution and resolve the issue after a human gives the OK. The trick right now is to have such workflows run intermittently so as to avoid the enormous cost of continuous AI usage, but often enough to provide a real benefit. This is just one use case in a familiar setup (IaC), but AI can try to make similar attempts at optimizations all over IT infrastructure.
It is up to the likes of Cisco, Google, Nutanix, SUSE and their rivals to come up with ways to evolve their solutions towards an agentic infrastructure. This means utilizing their expertise and safeguards to gradually change what it means to manage IT resources, and it means banding together. With the Agentic AI Foundation, a starting point is already there to coordinate these efforts. Will agents remain gremlins forever, locked inside tiny digital cages? If not, what benchmark do we use to allow greater automation? That is an open question. For now, the intelligence layer is sprinkled atop familiar foundations, meaning IT infrastructure management isn’t changing as much or as quickly as one might have been led to believe.