Starburst: Chewing through data access is key to AI adoption

Starburst: Chewing through data access is key to AI adoption

At a time when world governments are currently offering free AI training in a direct attempt to drive AI adoption in the workplace (we’ve seen the reports, UK), discussion surrounding how, when and where AI services are being brought to bear inside modern workflows is everywhere. Evan Smith, technical content manager and instructional designer at data query and analytics company Starburst, thinks that the AI adoption race is certainly here, but to succeed, it needs to overcome one key bottleneck – data access.

As we known in a few short years, AI technology of all types has gone from a set of theoretically viable technologies to a true revolution in both technology and organisational change management. The connection between the technology itself and the organisational change it promises is worth lingering over in more detail. Evans describes it as a connection that is “best visualised as a circle” i.e. AI technology and organisational practices work together to create a virtuous feedback loop in which new technological innovations drive efficiencies, those efficiencies drive organisational change, which drives more efficiencies… and on and on. 

Given all of this, does he think it’s worth asking, can anything disrupt the AI revolution? 

“It may seem that nothing can, but there is one substantial problem that is worth discussing. This problem is central to AI adoption as we know it, impacting all forms of AI, regardless of model. Without solving it, AI technology cannot increase productivity. It cannot drive organisational change. It cannot create a virtuous cycle. The key ingredient in the AI revolution comes in the form of context, specifically contextual data. This might sound obvious, but the nuances of AI’s need for context have far-reaching implications,” said Evans. 

Why LLMs struggle with context

As we know, Large Language Models (LLMs) are trained on generic data. Evans reminds us that these datasets are vast… and the training that takes place is comprehensive and impressive. It is as a direct result of this training that models are able to demonstrate the core capabilities of AI as we know it. These abilities are what make LLMs so versatile, but there is an inherent limitation in that versatility. Like the generic data that they were trained on–indeed, because of it–LLM abilities are also generic. 

“We can think of this as an AI version of garbage in, garbage out. In this case, generic data in, generic abilities out. This is great if we’re looking for something approaching general intelligence, but unfortunately, a lot of what we ask LLMs to do is not general at all. It’s highly specific,” he said. 

That gap between the generic and the specific is built into the technology. It’s a problem. To solve it requires something very particular to be added to the LLM, contextual augmentation, namely, access to contextual data. This is why we might agree that context is king. Without context, LLMs create generic responses. When given access to context, they begin to produce contextually relevant insights and actions. In this sense, contextually relevant insights are really what people mean when they say that AI will provide value. 

Beyond generic inadequacy

Evans provides the following example and asks us to consider a user prompting an LLM to generate some content.

Anyone who has used an LLM knows that asking the model to create content based on its general training will generate inadequate, generic results. If we ask it to “create a marketing message that talks about a company’s product”… it will do its best, but it lacks the context to do it well. If, instead, we included a series of documents that summarise a company’s position in detail, the LLM will have the contextual understanding to create much better results. 

“What’s really going on here is that the LLM’s general training is merging with the specific context provided by the user. Only these two things together can achieve satisfactory results. The same is true of all models and the flaw is as true of individual prompts as it is of complex agentic workflows. This means that context has now become the main bottleneck in achieving actual value from AI, and therefore for fully realising the value from the AI revolution itself,” said Evans.

So what’s the problem with unlocking context? Why can’t we just add as much context as LLMs need? In a word, data access. The issue might be down to the fact that data stacks are far from uniform.

Heterogeneous storage headaches

The fact that the generic nature of LLMs can be augmented by contextual data is a valuable solution to the bottleneck problem. But it presents another problem in the form of data access. Contextual data might exist, but it is typically scattered across multiple systems, held in multiple formats and generally stored heterogeneously. All of this makes data access difficult. Data silos, always a perennial problem for analytics, have now become a critical roadblock to AI adoption and value realisation. 

Another problem comes from compliance requirements. Many industries, organisations, and jurisdictions regulate how data is accessed and moved. This is particularly true in industries like financial services, healthcare, insurance, or government, but it is true to a greater or lesser extent in all industries. 

“All of this means that you cannot easily solve your data access problem by simply granting access. Neither can you solve it by moving all your data to a single, central location. Apart from the inherent logistical difficulties involved, which are typically vast, data centralisation still does not solve your compliance problem, as there are often strict regulations regarding the movement of data,” said Evans, but he does offer a possible solutions… and it comes in the form of data federation.

Data federation for the nation

Data federation describes an approach to data processing that flips the data centralisation narrative on its head. Instead of attempting to move all data to a single location, data federation extends access to it across different systems, rather than copying it. Here we may actually experience universal data access real for the first time, regardless of the heterogeneity of the underlying source systems. In doing so, it provides the very thing AI needs to succeed. When applied to contextual data,

Evans suggests that data federation can provide access to context to feed and augment the generic training data of models. The result is likely the best approach that organisations have when facing their AI goals and contending with data access bottlenecks. 

“Moving data by default is really something of a brute force approach. It was needed during the heyday of the data warehouse, but technologies like Apache Iceberg and Trino make data lakehouses built around data federation more accessible than ever,” he said. “In the past, data federation was slower than data centralisation. But in recent years, advances in Massively Parallel Processing (MPP) mean that technologies built to take advantage of federation, like Trino, are finally able to make the data federation dream a reality.” 

Trino was built to query large datasets at speeds comparable to or better than alternatives. That same performance and accessibility is perfectly suited to solving the AI contextual data bottleneck. 

The new data lakehouse data stack

“The approach becomes even more powerful when combined with other technologies that enhance the data platform used to federate. In this sense, data federation should not be taken in isolation, but viewed in concert with a host of other technologies that support this workflow. These technologies are centred around the data lakehouse, and Apache Iceberg in particular. Combined with a data federation engine, whether it be Trino, Starburst, or another technology, an Iceberg data lakehouse is able to meet the needs of AI data access bottlenecks,” explained Evans. 

We can think of this whole story as something of an emerging data stack, which takes the need for contextual data as the driving impetus behind a shift towards data federation. All of this emerges at a time when unlocking the value of AI is more important than ever. LLM technology is ready to change the world, but it cannot do so without specificity and context. Whether organisations can ultimately unlock that context adequately will depend far more on data access than anything else.