GPT-5.3 Codex-Spark: 1,000 Tokens/Sec & OpenAI Just Dumped Nvidia

GPT-5.3 Codex-Spark — Fast Facts (February 12, 2026):

Speed: 1,000+ tokens per second — 15x faster than GPT-5.3 Codex, faster than any developer can read code as it generates
The chip: Cerebras Wafer Scale Engine 3 — a single processor the size of a dinner plate with 4 trillion transistors, designed from scratch to eliminate the GPU latency bottleneck
Historic first: OpenAI's first production model running on non-Nvidia hardware — the first real crack in Nvidia's ironclad grip on AI inference
Pipeline improvements: 80% less client-server overhead, 50% faster time-to-first-token, 30% less per-token processing time — and these improvements will roll out to ALL OpenAI models
The catch: Pro-only ($200/month), text-only, 128K context, no API access yet, and Cerebras infrastructure is still scaling — geographic availability is limited

Every AI coding tool released in the last two years has the same fundamental problem: it makes you wait. You write a prompt, hit enter, and stare at a spinner while the model generates. By the time 200 lines of code appear, you've lost the thread of what you were building. You're not pair programming with AI — you're submitting a ticket and waiting for it to come back.

Agentic coding has fundamentally changed software development. For the first time, machines can autonomously work for hours or days without human supervision. But this mode of interaction can also leave developers feeling out of the loop with long wait times and less opportunity to direct the work. As software development is iterative, developers need to inject taste, direction, and sensibility along the way.

GPT-5.3 Codex-Spark is OpenAI's first model designed for real-time coding — optimized to feel near-instant when served on ultra-low latency hardware, delivering more than 1,000 tokens per second while remaining highly capable for real-world coding tasks. At that speed, a 300-line function takes under two seconds. You can read it as it writes. You can interrupt mid-generation and redirect. The AI stops being a batch processor and starts being a collaborator. And the hardware behind it — Cerebras' Wafer Scale Engine 3 — is the most consequential chip story in AI since Nvidia's H100 became the backbone of the entire industry. This is OpenAI's first production deployment on silicon outside its long-standing core stack with Nvidia. That sentence carries more weight than most people have stopped to consider.

The Latency Problem: Why Speed Is the Feature Nobody Talks About

Raw capability benchmarks dominate AI coverage. SWE-bench scores, GPQA percentages, ARC-AGI-2 results — these are the numbers that fill announcement posts and comparison articles. But there's a variable that benchmarks don't capture: how it feels to use the model when you're actually building something.

AI coding assistants have a latency problem. Even the best models — GPT-5.3-Codex, Claude Opus 4.6 — take seconds to minutes to generate substantial code. That delay breaks the flow state that makes developers productive. At 1,000+ tokens per second, Codex-Spark generates code faster than most developers can read it.

The research on flow state in software development is unambiguous: interruptions longer than 2–3 seconds cause context switching that takes 15–20 minutes to recover from fully. Every "generating..." spinner that runs for 8 seconds is a 20-minute tax on your productivity. The frontier models that win benchmarks are often the worst offenders — their extended reasoning chains are brilliant and slow. Speed is not just about the chip. Two systems may have similar intelligence, but the faster one will feel significantly better to use. Users don't just evaluate what the model can do — they evaluate how quickly they can do it.

What Exactly Is the Cerebras Wafer Scale Engine 3? (The Non-Nvidia Hardware)

To understand why Codex-Spark is fast, you need to understand why traditional AI models are slow. And to understand that, you need to understand what a GPU cluster actually is.

Every major AI model you've ever used — GPT-5, Claude Opus 4.6, Gemini 3.1 Pro — runs inference on clusters of Nvidia GPUs. Not one GPU. Hundreds of them, connected by high-speed cables (NVLink, InfiniBand). When a model generates a token, data has to travel between GPUs over those connections. Traditional AI models run on clusters of small GPUs. These GPUs must communicate with each other over cables, which creates a bottleneck. This bottleneck slows down the speed of the model.

Cerebras built a completely different answer. The WSE-3 is not a GPU — it is a single wafer-scale processor designed from the ground up for neural network inference and training. Instead of dozens of chips talking over cables, the WSE-3 is one chip. Cerebras' purpose-built Wafer Scale Engine features the largest on-chip memory of any AI processor, enabling high-speed inference at thousands of tokens per second per user. The architecture scales out to thousands of systems, extending fast memory capacity into the multi-terabyte domain to support trillion-parameter models for both training and inference.

The physical scale of the thing is hard to convey in text. The WSE-3 contains 4 trillion transistors on a single piece of silicon roughly the size of a dinner plate. An Nvidia H100 — the GPU that powers most of the AI industry — has 80 billion transistors on a chip smaller than your palm. The WSE-3 has 50 times more transistors, all on one die, with no inter-chip communication bottleneck. OpenAI has agreed to deploy 750 megawatts of Cerebras-backed compute online in phases through 2028. This is not a pilot experiment. This is an infrastructure commitment.

The Three Infrastructure Changes OpenAI Made (That Affect Every Model)

Codex-Spark isn't just a model on fast hardware. To make real-time coding possible, OpenAI also re-engineered the entire request-response pipeline. The three specific changes they made — and this is the part most coverage skips — are being deployed across all OpenAI models, not just Codex-Spark:

1. Persistent WebSocket Connection (80% Less Overhead)

OpenAI moved away from traditional HTTP request methods and introduced a persistent WebSocket connection. This change reduces client-server round-trip overhead by 80%. Traditional AI API calls open a new HTTP connection for every request — connection setup, TLS handshake, routing overhead, all happening before a single token generates. WebSocket keeps the connection alive, eliminating that setup cost on every call. For interactive coding sessions making dozens of small requests, this compounds dramatically.

2. Rewritten Inference Stack (30% Less Per-Token Time)

OpenAI rewrote key pieces of their inference stack, reducing per-token processing time by 30%. This is internal model serving infrastructure — the code that takes model weights and turns them into token probabilities. A 30% reduction in per-token time is the kind of engineering win that usually takes a hardware generation to achieve; OpenAI got it through software optimization.

3. Reworked Session Initialization (50% Faster First Token)

OpenAI reworked how sessions are initialized so that the first visible token appears sooner — improving time-to-first-token by 50%. Time-to-first-token is arguably the most psychologically important latency metric. A model that starts streaming immediately feels fast even if the total generation time is longer. Users perceive responsiveness from the first character, not from the last. Cutting TTFT in half is a felt improvement — the kind users notice without being told it changed.

These improvements are already rolling out to ALL OpenAI models:

These end-to-end latency improvements — 80% less roundtrip overhead, 30% less per-token processing, 50% faster first token — will become the default for all models across OpenAI's serving stack. If GPT-5.3 Instant or GPT-5.2 Pro has felt snappier in the last few weeks, this is why. Codex-Spark was the testbed. The entire fleet gets the benefit.

Codex-Spark Benchmarks: Fast AND Capable

The legitimate concern with any speed-first model is the capability tradeoff. Distilled or smaller models are almost always weaker on reasoning benchmarks. OpenAI says Codex-Spark achieved results between GPT-5.1-Codex-mini and GPT-5.3-Codex on SWE-Bench Pro and Terminal-Bench 2.0 — in a fraction of the time. Here's the full picture:

Benchmark	GPT-5.3 Codex-Spark	GPT-5.3 Codex (full)	What It Measures
SWE-Bench Pro	Near-parity with full Codex	New SOTA	Real-world software engineering, multi-language, contamination-resistant
Terminal-Bench 2.0	77.3%	New SOTA (higher)	CLI, bash, system tasks — actual agentic terminal work
Speed	1,000+ tokens/sec	~65 tokens/sec	15x faster — crosses "real-time" threshold
Time-to-first-token	50% faster than prior Codex	Baseline	Critical for perceived responsiveness
Cybersecurity risk	✅ NOT "High" classified	"High" classified	Spark is smaller — doesn't hit the concerning cyberattack automation thresholds

The SWE-Bench Pro parity is the headline: Codex-Spark solves real-world software engineering tasks at nearly the same accuracy as the full model, but in a fraction of the time. For most practical development work, this is the benchmark that matters. The Terminal-Bench 2.0 gap shows where the smaller model struggles — complex multi-step terminal operations where reasoning depth matters more than speed.< /p>

GPT-5.3 Codex-Spark is optimized for throughput, not deep complexity. The practical split: use Codex-Spark for editing specific sections of code, writing unit tests, generating boilerplate, renaming and refactoring, quick bug fixes, and rapid iteration cycles. Use full GPT-5.3 Codex for complex multi-file architecture decisions, long-running autonomous tasks, and anything that requires extended reasoning across a large codebase.

What "Real-Time Steering" Actually Means

The capability that 1,000 tokens/sec unlocks is more interesting than just "it's faster." The model's speed allows developers to interrupt and redirect logic in real-time, shifting the workflow from batch-processing to live pair-programming. At 65 tokens/second, interrupting mid-generation is awkward — by the time you've decided you want to change direction, the model has already written 30 more lines down the wrong path. You stop it, delete the output, and restart. At 1,000 tokens/second, code generates faster than you can read it, but the latency between "I want to change direction" and "the model responds to my new direction" compresses to under a second. You steer in real time rather than batching instructions.

GPT-5.3 Codex-Spark is tuned for interactive development workflows such as editing specific sections of code and running targeted tests, and will not automatically execute tests unless instructed. That last detail matters: the model defaults to minimal edits. It won't auto-run test suites or trigger side effects without explicit instruction — a conservative default that's the right call for real-time interactive sessions where you might interrupt and change direction multiple times before committing to a change.

OpenAI Breaks Up With Nvidia (Sort Of)

The hardware story here is at least as significant as the model story, and it's been underreported. Codex-Spark is the first OpenAI model not to use Nvidia's hardware, running solely on Cerebras Wafer-Scale Engine 3 chips. The release comes as both Cerebras and OpenAI are trying to prove to enterprises their worth over their competitors.< /p>

To be precise about what this does and doesn't mean: OpenAI said the partnership with Nvidia is "foundational" and stated the company is anchored on Nvidia as the core of its training and inference stack, while also expanding the ecosystem around it through partnerships with Cerebras and others. OpenAI's most powerful models continue to be trained and served on Nvidia systems. This is not a divorce. It's a second relationship. The strategic logic is clear: Nvidia hardware dominates training and bulk inference. Cerebras hardware wins on low-latency interactive inference. Different workloads, different silicon.

The strategic significance extends beyond raw performance. By deploying a production model on Cerebras hardware, OpenAI is actively diversifying away from its near-total dependency on Nvidia GPUs — a calculated move to reduce supply chain risk and negotiate better terms with hardware partners. For the broader AI industry, it validates Cerebras's wafer-scale approach as a viable alternative for inference workloads.< /p>

The industry implications are significant. OpenAI's use of Cerebras' wafer-scale engine infrastructure could represent an opportunity for other AI hardware vendors that specialize in application-specific integrated circuits. "By making these services available and running on AI ASICs, particularly the ones specifically designed for inference, creates this business model for similar players in the market." OpenAI just proved the market exists. Expect every other AI lab to evaluate non-Nvidia inference options within 12 months.

Who Can Access Codex-Spark Right Now

Access Method	Status	Requirement
Codex app (chatgpt.com/codex)	✅ Live — research preview	ChatGPT Pro ($200/month)
Codex CLI	✅ Live — research preview	ChatGPT Pro; update CLI to latest version
VS Code extension	✅ Live — research preview	ChatGPT Pro; update extension
API (design partners)	⚠️ Limited — small set of design partners only	Apply via platform.openai.com
API (general availability)	Not yet — expanding "over coming weeks"	Monitor platform.openai.com/docs/models
ChatGPT Plus ($20/month)	Not available	Pro plan required
Free tier	Not available	Paid plan required

Because Codex-Spark runs on specialized low-latency hardware, usage is governed by a separate rate limit that may adjust based on demand during the research preview. The Cerebras infrastructure is still scaling — OpenAI is expanding access as capacity comes online. Codex itself now has more than 1 million weekly active users, according to OpenAI, and will expand beyond Pro users in the coming weeks as the company evaluates performance and demand.

What Codex-Spark Cannot Do (The Honest Limitations)

Codex-Spark is currently text-only at a 128K context window and is the first in a family of ultra-fast models. The specific limitations matter for production use:

Text-only: No image input, no screenshot analysis, no diagram understanding — code and text only. If you rely on uploading UI screenshots to your coding workflow, Codex-Spark can't do that yet
128K context window: Solid for single files and small projects; limiting for large codebases. GPT-5.3 Codex standard handles much longer contexts for full-repo work. Gemini 3.1 Pro offers 1M tokens standard for context-heavy work
No API (yet): Only a handful of design partners have API access. Enterprise teams building automated pipelines can't integrate Codex-Spark until general API access opens
Pro-only: The $200/month Pro plan is a real barrier for individual developers. Plus users ($20/month) don't get access during the research preview
Cerebras geography: Currently limited to Cerebras WSE-3 hardware, which constrains geographic availability and total capacity. Use rs in regions where Cerebras data centers aren't yet deployed will experience higher latency or no access
Not for security tasks: Codex-Spark does not meet the "High capability" threshold for cybersecurity in OpenAI's Preparedness Framework, making it unsuitable for sensitive auth or security tasks. For those, use the full GPT-5.3 Codex (with appropriate access controls)

What's Coming Next for Codex-Spark

As OpenAI learns more with the developer community about where fast models shine for coding, they will introduce even more capabilities — including larger models, longer context lengths, and multimodal input. The roadmap as stated: this is the first in a family of ultra-fast models. The 128K text-only limitation is explicitly called out as a current state, not permanent architecture. Cerebras expects to bring ultra-fast inference capability to the largest frontier models in 2026 — m eaning GPT-5.3 Codex full and potentially GPT-5.4 could eventually run on Cerebras hardware at Spark-like speeds. The combination of frontier-class reasoning and 1,000 token/second generation doesn't exist yet — it's the product roadmap.

Frequently Asked Questions

What Is GPT-5.3 Codex-Spark?

GPT-5.3 Codex-Spark is a smaller version of GPT-5.3 Codex and OpenAI's first model designed for real-time coding — delivering more than 1,000 tokens per second while remaining highly capable for real-world coding tasks. It is the first milestone in OpenAI's partnership with Cerebras, running on the Cerebras Wafer Scale Engine 3 rather than Nvidia GPUs.

How Is Codex-Spark 1,000 Tokens Per Second Possible?

The speed comes from two sources: the Cerebras WSE-3 chip eliminates the inter-GPU communication bottleneck that slows traditional inference, and OpenAI rewrote the serving stack with a persistent WebSocket connection (80% less overhead), a rewritten inference stack (30% less per-token time), and reworked session initialization (50% faster first token). The hardware and software improvements compound — neither alone would produce 1,000 tokens/second.

Is Codex-Spark Free?

No — Codex-Spark is rolling out as a research preview for ChatGPT Pro users at $200/month. Not available on Plus ($20/month) or free plans during the research preview. API access is currently limited to a small set of design partners, with general API availability expanding over the coming weeks.

How Fast Is Codex-Spark vs. GPT-5.3 Codex?

Codex-Spark runs at roughly 1,000 tokens per second — about 15x faster than GPT-5.3 Codex, wh ich operates at approximately 65 tokens per second in standard serving. Time-to-first-token is 50% faster than prior Codex serving. A 300-line function that takes ~20 seconds with full Codex takes ~2 seconds with Codex-Spark.

Does Codex-Spark Work in VS Code?

Yes — Codex-Spark is available in the latest versions of the Codex app, CLI, and VS Code extension for ChatGPT Pro users. Upd ate your VS Code Codex extension to the latest version and select Codex-Spark from the model picker. The model defaults to minimal edits and won't auto-run tests unless you explicitly instruct it to.

Why Is OpenAI Using Cerebras Instead of Nvidia?

OpenAI describes Nvidia as "foundational" with its most powerful models continuing to run on Nvidia systems. Cerebras complements that by excelling at workflows demanding extremely low latency — it is not replacing Nvidia but adding a dedicated low-latency inference tier. The strategic reasons: reducing supply chain dependency on a single vendor, diversifying hardware for cost negotiation leverage, and accessing inference speeds that GPU clusters structurally cannot match for interactive workloads.

What Is the Cerebras Wafer Scale Engine 3?

The WSE-3 is a single wafer-scale processor designed from the ground up for neural network inference and training — not a GPU. It contains 4 trillion transistors on a single die, with massive on-chip memory that eliminates data movement bottlenecks between chips. Unlike GPU clusters where chips communicate over interconnects, the WSE-3 processes everything on one piece of silicon. It is the largest chip ever built for AI computation.

What Are the Limits of Codex-Spark?

Codex-Spark is currently text-only at a 128K context window. No image input, no multimodal capabilities, shorter context than GPT-5.3 Codex standard. API access is limited to design partners currently. Rate limits apply separately from standard Codex limits due to the specialized Cerebras hardware. Not recommended for security-sensitive tasks — it does not reach the "High" cybersecurity threshold, making it unsuitable for auth or vulnerability research work.

Will Codex-Spark Get Multimodal and Longer Context?

Yes — OpenAI confirmed plans to introduce larger models, longer context lengths, and multimodal input as they learn from developer feedback during the research preview. The 128K text-only limitation is a current state of the first release in what they describe as "a family of ultra-fast models" — not a permanent design constraint.