Databricks launches API to generate synthetic datasets

Databricks introduces an API that allows customers to generate synthetic data for their machine learning projects.

SiliconAngle writes the API is available in Mosaic AI Agent Evaluation. This is a tool Databricks offers as part of its data-lakehouse platform. This tool helps developers compare artificial intelligence applications’ output quality, cost, and latency. Mosaic AI Agent Evaluation was launched in June. This was at the same time as the introduction of the Mosaic AI Agent Framework. That facilitates the implementation of retrieval-augmented generation.

Synthetic data involves information generated using AI specifically for neural network development. Creating training datasets this way is significantly faster and more cost-effective than compiling them manually. Databricks’ new API is designed for generating question-and-answer collections, which are useful in developing applications that use large language models (LLM).

A three-step process

Developers must first upload a frame, or collection of files, containing business information relevant to the task their AI application is to perform. Frames must be in a format supported by Apache Spark or Pandas. Spark is the open-source data processing engine on which Databricks’ platform is based. This is why Pandas is a popular analytics tool for the Python programming language.

After uploading sample data, developers must specify how many questions and answers the API should generate. They can optionally provide additional instructions to customize the API’s output. For example, a software team can specify the style in which the questions should be generated, the purpose for which they are used, and the end users who will work with the AI application.

Simple workflow

Uneven training data can reduce the quality of an AI model’s output. This is why companies often have synthetic datasets checked by experts for errors before feeding them to a neural network. Databricks says it has developed its API to simplify this part of the workflow.

“Importantly, the generated synthetic answer is a set of facts that are required to answer the question rather than a response written by the LLM,” Databricks engineers report in a blog post. “This approach has the distinct benefit of making it faster for an SME to review and edit these facts vs. a full, generated response.”

Databricks plans to release several enhancements to the API early next year. A new graphical interface will allow dataset checkers to quickly check question-and-answer pairs for errors and add more pairs as needed. In addition, Databricks will add a tool to track how a company’s synthetic datasets change over time.

Top story

Inside TCS’ digital race behind Formula E

The world of Formula E combines technology and speed with sustainability. It's a blend that Tata Consultancy ...

Erik van Klinken June 27, 2025

Whitepapers

Databricks launches API to generate synthetic datasets

A three-step process

Simple workflow

Stay tuned, subscribe!

The AI wave is forcing organizations to rethink their infrastructure

SAP CEO says EU doesn’t need a massive AI buildout. Is he right?

HPE OpsRamp plays a very important role in the platform

E-commerce solutions provider puts its own portfolio on display

Intel and Altera aim to bring AI to edge computing with new series of chips

AI-powered cameras shake up retail

Manhattan Associates provides supply chain software, is it more than a fancy name?

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon