OpenAI swaps Nvidia for Cerebras with GPT-5.3-Codex-Spark

OpenAI releases GPT-5.3-Codex-Spark, a smaller AI encoding model that generates over 1,000 tokens per second on Cerebras hardware. It is OpenAI’s first GPT model that does not run on Nvidia.

The model is optimized for ultra-fast inferencing on Cerebras’ Wafer Scale Engine 3, with OpenAI adding a latency-first serving tier to the existing infrastructure. The speed comes in handy for interactive work where developers need immediate feedback.

In January, OpenAI announced a multi-year partnership with Cerebras, whereby the company purchases large-scale computing power to support its AI services. That deal reportedly includes up to 750 megawatts of computing power over three years. Codex-Spark is the first concrete result of this collaboration.

Speed versus intelligence

OpenAI’s latest frontier models can work autonomously for hours, days, or weeks on long-running tasks. Codex-Spark complements this with a model for real-time adjustments. Developers can interrupt or make adjustments while working, and the model responds immediately with targeted edits to code, logic, or interfaces.

By focusing on speed, Codex-Spark keeps the working method light. It makes minimal, targeted adjustments. At launch, the model has a 128k context window and is text-only. During the preview, separate rate limits apply and may fluctuate during periods of high demand.

GPT-5.3-Codex-Spark performs strongly on benchmarks such as SWE-Bench Pro and Terminal-Bench 2.0, completing tasks in a fraction of the time compared to GPT-5.3-Codex. On Terminal-Bench 2.0, Codex-Spark achieved 77.3 percent accuracy, an improvement over the 64 percent achieved by GPT-5.2-Codex.

Latency improvements for all models

OpenAI implemented latency improvements across the entire request-response pipeline, benefiting all models. The company streamlined how responses stream between client and server, rewrote parts of the inference stack, and adjusted session initialization.

Through a WebSocket connection and targeted optimizations in the Responses API, the overhead per client-server roundtrip decreased by 80 percent. Per-token overhead also decreased by 30 percent, while time-to-first-token was cut in half. The WebSocket path is enabled by default for Codex-Spark and will soon become the default for all models.

Collaboration with Cerebras

Codex-Spark runs on Cerebras’ Wafer Scale Engine 3, a specialized AI accelerator for fast inferencing. OpenAI integrated this low-latency path into the same production serving stack as the rest of the infrastructure, so it works seamlessly within Codex and supports future models.

GPUs remain fundamental for training and inferencing at OpenAI and deliver the most cost-effective tokens for broad use. Cerebras complements that by excelling in workflows that require extremely low latency. According to OpenAI, GPUs and Cerebras can be combined for certain workloads to achieve optimal performance.

Codex-Spark is available immediately to ChatGPT Pro users in the latest versions of the Codex app, CLI, and VS Code extension. Because the model runs on specialized low-latency hardware, separate rate limits apply and may change based on demand. Codex-Spark is also available via the API for a small group of design partners. Access will be expanded in the coming weeks.

OpenAI swaps Nvidia for Cerebras with GPT-5.3-Codex-Spark

Speed versus intelligence

Latency improvements for all models

Collaboration with Cerebras

Stay tuned, subscribe!

Silicon One is the engine under the hood of Cisco’s AI story

ISO 27001 inspires confidence, but it is only the beginning

IBM FlashSystem: ‘Autonomous AI takes over 90% of storage management’

How Cisco CX transforms customer experience with AI

Why your SOC needs a ROC, according to Qualys

SAP Business Network: $6.5 trillion B2B collaboration platform

"Not all clouds are created equal" in the AI era: how is OCI different?

Why Salesforce built three levels of AI commerce agents

4 steps to create a future-proof data infrastructure

Secure networking: the foundation for the AI era

Why AI adoption requires a dedicated approach to cyber governance

Professional print materials for European tech events, why booth design still makes the difference

Appdevcon

Webdevcon

Dutch PHP Conference

De IT Afdeling van de toekomst

GITEX ASIA 2026

Southeast Asia AI Application Summit 2026

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices