Keeping the huge amount of data in a data lake searchable is a daunting task, especially without corresponding data tables. The American-Dutch company Elastic now offers an alternative with Search AI Lake. This search and analytics engine looks inside large amounts of unstructured data without needing metadata or tables. That makes it well-suited for AI training as well as security and observability workloads.
Search AI Lake can search in both traditional ways and via vectors. Elastic also promises enormous scalability by decoupling storage from compute. The ability to make large amounts of data more searchable makes the product particularly applicable for training LLMs. These models have an unquenchable hunger for data, but should be fed the right kinds of food at the right time.
Tip: The interplay between vector databases and AI: fine-tuning LLMs on a higher level
The application does not require data tables, as is the case with Databricks or Snowflake’s data lake applications. However, it does use the Elastic Common Schema (ECS) format. Elastic has donated this format to the Cloud Native Computing Foundation (CNCF) in the hope it will be adopted more widely.
Search AI Lake further leverages the existing Elasticsearch Query Language. This makes it possible to federally search data in Elastic clusters, i.e. in different sources and all kinds of shapes and sizes, and serve these up in a unified manner.
Particularly suitable for GenAI training
Speaking to VentureBeat, Elastic CEO Ash Kulkarni states that Search AI Lake can quickly search large amounts of data in real time. It also provides native support for searching dense vectors, he says, which means vectors where most elements are ‘non-zero’ and thus contain relevant data.
The search engine also supports hybrid search, faceted search (where users can add filters or attributes to search results), and information ordering based on relevance. According to Kulkarni, these options are particularly important for applications such as GenAI training and Retrieval Augmented Generation (RAG). Prioritizing and organizing the source information provides a more efficient learning process for AIs.
According to Elastic, Search AI Lake should become the preferred data platform for generative AI models. These can benefit immensely from scalable search of vector databases. The application is available in preview standalone or as an application within the new service Elastic Cloud Serverless. This service provides a specialized interface for different use cases.
Real-time data processing
Founded in Amsterdam in 2012, Elastic gained particular recognition with ElasticSearch. This open-source search engine for distributed search and analysis can process large amounts of data in real time. It is built on Apache Lucene and provides a RESTful API for indexing and searching data. It’s used for tasks like enterprise data search, big data analytics, processing sensor data from IoT applications, and searching logs from security and DevOps operations.
The company already anticipated the increasing search workload required by AI with the launch of the ElasticSearch Relevance Engine (ESRE) last year. This engine combines traditional search with vector search.
Also read: VAST Data and Superna together keep enterprise AI adoption secure