Data relationships are complicated. Your airline flight membership login doesn’t show your last flight or the fact that you got miles for a broken in-seat screen, your bank doesn’t register the fact that it offered you the same investment last month (as it peddles it again this week), or your TV movies login suddenly forgets who you are – and that’s just at the consumer level.
Poor data management and data integration affects organisations every day through lost commercial opportunities, reduced operational efficiencies, missing or inept compliance actions and the introduction of all manner of business risk that should never have entered the enterprise.
All of which begs the question, if it’s a backend problem, then what’s wrong with the rear-end portion of our data backbones?
Separation Of Data Systems
Part of the answer comes down to how data is ingested, absorbed and ultimately given a place to live as it is stored and inwardly managed inside an enterprise data stack. Traditionally, there has been a separation of church and state in this area of information management i.e. structured data (customer records or financial data as accurately identifiable and quantifiable business transactions, some of the more standardized sensor data from the IoT etc.) is stored in databases; while conversely, unstructured data (videos, emails, voice records or audio files, documents, PDFs and less standardized sensor data) has been stored in a file system.
What’s a file system in this instance if a database itself is a collection of files? A database is agreed to be a formalised repository of data that uses a tabular and column-based structure to enable queryable relationships that exist between data records. A file system (which might be a data lake, a data warehouse or a combo-hybrid data lakehouse) lacks the structure, security of a database and… essentially, lacks the concurrent queryability of a database, because it’s a ship built for storage, not steerage.
Data science purists argue that this separation has made sense for decades, but some now say that it doesn’t anymore. Guess why?
Spoiler Alert: It’s An Agentic AI Thing
In the modern age of agentic AI services, with AI agents trying to work across an enterprise, our pre-existing data hegemonies need to be reassessed. Michel Tricot, CEO and co-founder of data pipeline company Airbyte says that he has seen this problem play out at first hand. He asks us to consider a financial services firm that is training an AI model to analyse customer interactions and recommend investment strategies. The firm has customer profiles in Salesforce for customer relationship management and also has recorded advisory calls on a file server.
But, these systems don’t talk to each other.
The AI service queries the database for customer risk tolerance, then searches files for relevant conversations, but it can’t connect them. Which advisor spoke with which client? When? Under what compliance framework? This disconnect creates more than inefficiency. It creates risk.
The Status Quo Is A Problem
“As an experienced data professional, I know very well that these data silos were built by enterprise IT teams for good reasons at the time,” stated Tricot, during a data analytics summit this week. “Database teams have traditionally focused on transactional integrity and structured queries. Storage teams managed files and folders. Each group developed expertise, tools and governance for their domain. It worked because analysts could talk to relevant stakeholders, pull the right data, and make reasonable interpretations based on their experience.”
For an AI agent to do the same, without requiring constant human intervention, it sounds like this work cannot be restricted by data silos, whatever the shape (and query structure) of the database in question..
When an agent needs to answer a question such as: “What investment strategy did we propose to this customer in Q1?” it needs to traverse both the CRM and the call recordings seamlessly. Current architectures force these agents to work like archaeologists, piecing together fragments without a full picture or full context.
“I’ve watched companies make significant investments in AI only to discover time and time again that their data architecture blocks any meaningful insights. The AI works perfectly in the training phase with clean, connected datasets. In production, with real enterprise data sprawl, accuracy plummets and results are unsatisfactory – in fact, it’s the main reason we’re seeing so many early AI projects fail,” explained Tricot.
AI Needs An All-Terrain Vehicle
In a typical enterprise setup, Airbyte’s Tricot reminds us that structured data flows through extract, transform and load ETL pipelines into warehouses, while files get dumped into data lakes or object storage. The metadata that describes what that data actually relates to (if we remember that metadata is essentially information “about” information) is either somewhere inaccessible or cast aside during processing.
“When establishing agentic AI workflows, this separation becomes catastrophic,” heeds Tricot. “A contract PDF sitting in SharePoint needs to maintain its connection to the vendor record in the enterprise resource planning systems. A support call recording needs its link to the ticket in the service desk. Humans can intuit these connections and make good decisions, but AI cannot. It needs to be fed this relational context to interpret data correctly.”
He says that permissions, authorisations and compliance requirements are also critical elements included in metadata. Without this context, AI is highly risky, especially with personally identifiable information and sensitive data.
Building Unified Data Flows
“I’ve noticed that the data movement principles of old are shifting to a new, integrated approach. Instead of parallel processes for different data types, organisations need to move structured and unstructured data in the same pipelines,” said Tricot. “This starts at ingestion. Whether data arrives as database records or files, capture relationship metadata immediately. A customer email isn’t just text, it’s text connected to a customer ID, a timestamp, a sender and potentially a case number. These connections must travel with the data.”
To make all this work, data teams are advised to convert unstructured data into flexible formats such as Markdown (nicely defined here by Alexander Obregon as a lightweight markup language that is widely used in various fields, particularly in writing, programming, and web development) and load them into lakehouses, such as Iceberg (Apache Iceberg is a high-performance data format for huge analytic tables) that are resilient to schema change. In addition, set up a permission sync data flow to connect security metadata to records and files.
Most critically, advises Tricot, data teams need to maintain explicit relationship mapping. This is to document how product documentation relates to feature requests, how sales calls connect to opportunity records and so on. These relationships already exist implicitly in how your business operates. AI needs them to be explicit and accessible.
The Competitive Reality
“I’ve seen that organisations that get this right gain significant advantages. Their AI agents make better decisions because they have the proper data context. They satisfy audit requirements and maintain security boundaries because metadata with permissions and history flows with the data. Companies that don’t make changes to their architecture to accommodate these needs will find that their AI investments flounder,” said Tricot.
It feels like the importance of metadata preservation will only intensify as AI capabilities expand. In the age of agentic AI, metadata preservation isn’t just about better data management. It’s the difference between AI that actually works, or a very expensive liability that could even cost some people their jobs due to compliance and privacy violations.