3 min Analytics

OpenAI swaps Nvidia for Cerebras with GPT-5.3-Codex-Spark

OpenAI swaps Nvidia for Cerebras with GPT-5.3-Codex-Spark

OpenAI releases GPT-5.3-Codex-Spark, a smaller AI encoding model that generates over 1,000 tokens per second on Cerebras hardware. It is OpenAI’s first GPT model that does not run on Nvidia.

The model is optimized for ultra-fast inferencing on Cerebras’ Wafer Scale Engine 3, with OpenAI adding a latency-first serving tier to the existing infrastructure. The speed comes in handy for interactive work where developers need immediate feedback.

In January, OpenAI announced a multi-year partnership with Cerebras, whereby the company purchases large-scale computing power to support its AI services. That deal reportedly includes up to 750 megawatts of computing power over three years. Codex-Spark is the first concrete result of this collaboration.

Speed versus intelligence

OpenAI’s latest frontier models can work autonomously for hours, days, or weeks on long-running tasks. Codex-Spark complements this with a model for real-time adjustments. Developers can interrupt or make adjustments while working, and the model responds immediately with targeted edits to code, logic, or interfaces.

By focusing on speed, Codex-Spark keeps the working method light. It makes minimal, targeted adjustments. At launch, the model has a 128k context window and is text-only. During the preview, separate rate limits apply and may fluctuate during periods of high demand.

GPT-5.3-Codex-Spark performs strongly on benchmarks such as SWE-Bench Pro and Terminal-Bench 2.0, completing tasks in a fraction of the time compared to GPT-5.3-Codex. On Terminal-Bench 2.0, Codex-Spark achieved 77.3 percent accuracy, an improvement over the 64 percent achieved by GPT-5.2-Codex.

Latency improvements for all models

OpenAI implemented latency improvements across the entire request-response pipeline, benefiting all models. The company streamlined how responses stream between client and server, rewrote parts of the inference stack, and adjusted session initialization.

Through a WebSocket connection and targeted optimizations in the Responses API, the overhead per client-server roundtrip decreased by 80 percent. Per-token overhead also decreased by 30 percent, while time-to-first-token was cut in half. The WebSocket path is enabled by default for Codex-Spark and will soon become the default for all models.

Collaboration with Cerebras

Codex-Spark runs on Cerebras’ Wafer Scale Engine 3, a specialized AI accelerator for fast inferencing. OpenAI integrated this low-latency path into the same production serving stack as the rest of the infrastructure, so it works seamlessly within Codex and supports future models.

GPUs remain fundamental for training and inferencing at OpenAI and deliver the most cost-effective tokens for broad use. Cerebras complements that by excelling in workflows that require extremely low latency. According to OpenAI, GPUs and Cerebras can be combined for certain workloads to achieve optimal performance.

Codex-Spark is available immediately to ChatGPT Pro users in the latest versions of the Codex app, CLI, and VS Code extension. Because the model runs on specialized low-latency hardware, separate rate limits apply and may change based on demand. Codex-Spark is also available via the API for a small group of design partners. Access will be expanded in the coming weeks.