3 min Analytics

Databricks launches API to generate synthetic datasets

Databricks launches API to generate synthetic datasets

Databricks introduces an API that allows customers to generate synthetic data for their machine learning projects.

SiliconAngle writes the API is available in Mosaic AI Agent Evaluation. This is a tool Databricks offers as part of its data-lakehouse platform. This tool helps developers compare artificial intelligence applications’ output quality, cost, and latency. Mosaic AI Agent Evaluation was launched in June. This was at the same time as the introduction of the Mosaic AI Agent Framework. That facilitates the implementation of retrieval-augmented generation.

Synthetic data involves information generated using AI specifically for neural network development. Creating training datasets this way is significantly faster and more cost-effective than compiling them manually. Databricks’ new API is designed for generating question-and-answer collections, which are useful in developing applications that use large language models (LLM).

A three-step process

Developers must first upload a frame, or collection of files, containing business information relevant to the task their AI application is to perform. Frames must be in a format supported by Apache Spark or Pandas. Spark is the open-source data processing engine on which Databricks’ platform is based. This is why Pandas is a popular analytics tool for the Python programming language.

After uploading sample data, developers must specify how many questions and answers the API should generate. They can optionally provide additional instructions to customize the API’s output. For example, a software team can specify the style in which the questions should be generated, the purpose for which they are used, and the end users who will work with the AI application.

Simple workflow

Uneven training data can reduce the quality of an AI model’s output. This is why companies often have synthetic datasets checked by experts for errors before feeding them to a neural network. Databricks says it has developed its API to simplify this part of the workflow.

“Importantly, the generated synthetic answer is a set of facts that are required to answer the question rather than a response written by the LLM,” Databricks engineers report in a blog post. “This approach has the distinct benefit of making it faster for an SME to review and edit these facts vs. a full, generated response.”

Databricks plans to release several enhancements to the API early next year. A new graphical interface will allow dataset checkers to quickly check question-and-answer pairs for errors and add more pairs as needed. In addition, Databricks will add a tool to track how a company’s synthetic datasets change over time.