AI only works if the infrastructure is right

AI only works if the infrastructure is right

AI is in the spotlight, but without a robust infrastructure, it remains a promise. How do you ensure that your data, computing power, and governance are in order to apply AI at scale truly? And how do you build a foundation that will grow with the demands of increasingly intelligent algorithms today, tomorrow, and the day after tomorrow? We discuss this in a roundtable discussion with experts from AWS, NetApp, Nutanix, Pure Storage, Red Hat, and SUSE.

The rapid rise of generative AI has prompted organizations to experiment and implement it. But if you look beyond the hype, you’ll see that sustainable AI is only possible if the underlying infrastructure—from data architecture to GPU capacity—grows with it. And that requires more than just technical upgrades.

What works today may no longer be sufficient tomorrow. Consider the explosive growth of model sizes, increasing latency and privacy requirements, as well as the demand for scalable governance. AI infrastructure is therefore a strategic issue: what do you need to put in place now to ensure you are not overtaken by your own ambitions in the future?

Practical recommendations

The successful implementation of artificial intelligence is therefore closely linked to the underlying infrastructure. But how you define that AI infrastructure is open to debate. An AI infrastructure always consists of different components, which is clearly reflected in the diverse backgrounds of the participating parties. As a customer, how can you best assess such an AI infrastructure?

Twee mannen zitten aan een houten tafel met naamkaartjes, kopjes en glazen voor zich in een goed verlichte kamer met grote ramen en een spiegel.
From left to right: Eric Lajoie (SUSE) and Pascal de Wild (NetApp)

Pascal de Wild of NetApp gets straight to the point: “80 percent of all AI projects fail not because the technology is inadequate, but because you don’t determine where you want to go beforehand.” This statistic goes to the heart of the problem. The majority of companies do not determine in advance where they want to go. They invest a lot of money to build something big without defining a clear use case. They may only use a small part of the infrastructure, which is expensive. “Ultimately, the result will also determine what you want to build, what you can put in it, what software you will use, and what hardware,” De Wild explains.

For companies looking to get started with AI infrastructure, a phased approach is crucial. Start small with a pilot, clearly define what you want to achieve, and expand step by step. The infrastructure must grow with the ambitions, not the other way around. A practical approach must be based on the objectives. Then the software, middleware, and hardware will be available. For virtually every use case, you can choose from the necessary and desired components.

Different needs per company

Felipe Chies of AWS also sees that what a company wants to do with its AI infrastructure is crucial. Not every organization has the same infrastructure needs. Generative AI has emerged as a desirable application. However, there is a difference between using an available foundation model and building your model. “Then you look at the GPUs, networking, storage, and everything else that goes with it. So also at the end-to-end tools that can help you with that,” says Chies. For many companies, however, an API link to existing AI services is sufficient. The cloud then acts as an abstraction layer, removing the underlying complexity.

Company size also plays a role in determining what your AI infrastructure should ideally look like. In many cases, a managed cloud service is sufficient, allowing AI to be brought into production relatively easily with the available tools and infrastructure. After all, building a large model yourself is only an option for a small number of companies. If that is the case, it is important to set up an AI infrastructure that meets your requirements using the various building blocks.

Complete stack as a foundation

Een man met een bril en een grijs overhemd zit aan een houten tafel met papieren en glazen voor zich, in een kamer met grote ramen en gordijnen die bomen buiten laten zien.
Ruud Zwakenberg (Red Hat)

Setting up AI infrastructure often starts small, for example with a pilot project using a powerful GPU and a dataset to train a model. Anyone who looks beyond the initial experiments will quickly see that much more is needed. Ruud Zwakenberg of Red Hat emphasizes that it ultimately comes down to building intelligent applications that support concrete use cases. Zwakenberg agrees that the use case is very important in this regard. If you want to develop and offer a serious, intelligent application, you will need a large GPU. You will also need tooling, a platform, orchestration, operating systems, and a method for collecting data.

Once AI is put into production, additional requirements come into play: security, monitoring, and management. The entire stack must be robust and reliable so that the applications continue to run as intended. Zwakenberg: “If you really want to build intelligent applications, you need quite a lot. And from a customer perspective, that all falls under infrastructure.” He thus demonstrates that AI infrastructure is a broad concept that encompasses all layers of the technical stack and that the success of AI initiatives depends on the proper integration of all these building blocks.

Shift to smaller models

The panel also discussed finding the right balance between capacity and costs when setting up an AI infrastructure. Especially when it comes to GPUs, which are often expensive and scarce, it is crucial not to oversize. Organizations need to carefully consider the computing capacity required for their specific AI application. This also means taking a critical look at the type of model being used. In this case, bigger is not always better, as recent practice has shown. Training or deploying a model that supports 150 languages is unnecessary if the business need is limited to a few languages. Choosing the right format prevents unnecessary waste of resources.

At the same time, the AI landscape requires a high degree of flexibility. Technological developments are rapid, models change, and business requirements can shift from quarter to quarter. It is therefore essential to establish an infrastructure that is not only scalable but also adaptable to new insights or shifting objectives. Consider the possibility of dynamically scaling computing capacity up or down, compressing models where necessary, and deploying tooling that adapts to the requirements of the use case. This ensures that the infrastructure remains future-proof and cost-efficient.

Hybrid capacity

Een man in een blauw shirt zit aan een tafel met drinkglazen, waterflessen en een Coca-Cola fles voor een raam met gedessineerde gordijnen.
Marco Bal (Pure Storage)

While many discussions about AI infrastructure focus on GPU capacity, Marco Bal of Pure Storage points to the importance of the entire data pipeline. Not every step in the AI process requires the same computing power. GPUs are ideal for parallel processing, such as model training, but CPUs are more effective for other, more complex tasks. In practice, it comes down to a combination of different resources, tailored to the type of operation and the stage in the workflow. A well-designed infrastructure enables this mix, regardless of whether the model is running in the cloud or on-premises.

Moreover, AI projects rarely go straight from idea to end result. As projects progress, use cases shift, desired outcomes change, and additional data sources are added. All these factors influence the scalability requirements of the infrastructure. This requires an architecture that is as flexible and scalable as what organizations are used to from the public cloud. “Most projects start with an idea, but along the way, everything changes – and that has an impact on what the infrastructure has to be able to handle,” says Bal.

Gap between platform and application

When designing AI infrastructure, organization is also key, notes Eric Lajoie of SUSE. He regularly sees tensions between the platform team and the application team at customers. Both are crucial for AI implementations, but often have different focuses. While the platform team addresses issues such as scalability, security, and infrastructure choice, the application team is primarily focused on solving a specific business problem. “They don’t really care about the platform. Whether it’s SaaS or on-premises, it’s all about the business case they have to solve,” observes Lajoie. These separate mindsets can frustrate AI projects if there is no shared vision.

Additionally, the choice between cloud and on-premises solutions is playing an increasingly important role in discussions about sovereignty and control. Lajoie sees organizations with strict compliance requirements consciously opting for air-gapped infrastructures, where no connection to the internet is allowed. In such cases, fully managed SaaS solutions are not an option, and everything must run locally, from AI models to data storage. At the same time, application teams sometimes just want to access an API and pay per use. The choice of infrastructure is then not a technical discussion, but a trade-off between control, risk, and speed.

The desire for simplicity

Twee mannen zitten aan een vergadertafel met papieren, glazen en naamplaatjes voor zich, in een kamer met gedessineerde gordijnen en een raam met groen buiten.
From left to right: Ricardo van Velzen (Nutanix) and Felipe Chies (AWS)

Ricardo van Velzen of Nutanix emphasizes that on-premises AI infrastructure must be technically comparable to cloud solutions, especially from the perspective of IT administrators. While data scientists see the infrastructure as an API they can work with, it is often a completely different world for IT admins. “IT admins are used to setting up virtual machines and simple tools, but cloud-native technologies often feel complex to them,” explains Van Velzen. That complexity can lead to resistance or errors in management.

According to Van Velzen, it is therefore crucial that on-premises platforms offer the same user-friendliness and scalability as public cloud environments. This means that the tooling around storage and management must be intuitive and make maintenance and control as easy as possible. This allows organizations to reap the benefits of on-premises infrastructure without compromising on flexibility or manageability.

An equivalent experience between on-premises and cloud environments also helps organizations switch smoothly between them, which is essential in a hybrid world. By familiarizing IT teams with modern, cloud-native platforms that are easy to manage, the adoption of AI infrastructure is accelerated, and operational risks are minimized.

AI infrastructure supports strategy

The successful deployment of AI requires much more than just the latest technologies. It requires a well-thought-out and flexible infrastructure that can grow with the changing demands of organizations and technologies. From data architecture to computing capacity and from governance to operational management, everything must fit together seamlessly. Only then can companies realize their AI ambitions.

The roundtable discussion also showed that organization and collaboration are just as important as technology. Bridging the gap between platform and application teams, aligning expectations, and choosing the right infrastructure model—on-premises, cloud, or hybrid—are essential for AI projects to succeed. Simplicity and manageability, regardless of the chosen environment, are key words in this regard.

This was the first story in our AI infrastructure series. In the following article, we will explore how the AI wave is compelling organizations to reassess their infrastructure.