AI ToolsCode DevelopmentGPT-5.3 Codex-Spark
GPT-5.3 Codex-Spark
Paid
Visit
GPT-5.3 Codex-Spark

OpenAI built a coding AI that runs 15x faster than its flagship - not on Nvidia, but on a chip the size of a dinner plate with 4 trillion transistors. Nvidia's monopoly just got its first serious crack.

GPT-5.3 Codex-Spark: 1,000 Tokens/Sec & OpenAI Just Dumped Nvidia

GPT-5.3 Codex-Spark — Fast Facts (February 12, 2026):

  • Speed: 1,000+ tokens per second — 15x faster than GPT-5.3 Codex, faster than any developer can read code as it generates
  • The chip: Cerebras Wafer Scale Engine 3 — a single processor the size of a dinner plate with 4 trillion transistors, designed from scratch to eliminate the GPU latency bottleneck
  • Historic first: OpenAI's first production model running on non-Nvidia hardware — the first real crack in Nvidia's ironclad grip on AI inference
  • Pipeline improvements: 80% less client-server overhead, 50% faster time-to-first-token, 30% less per-token processing time — and these improvements will roll out to ALL OpenAI models
  • The catch: Pro-only ($200/month), text-only, 128K context, no API access yet, and Cerebras infrastructure is still scaling — geographic availability is limited

Every AI coding tool released in the last two years has the same fundamental problem: it makes you wait. You write a prompt, hit enter, and stare at a spinner while the model generates. By the time 200 lines of code appear, you've lost the thread of what you were building. You're not pair programming with AI — you're submitting a ticket and waiting for it to come back.

Agentic coding has fundamentally changed software development. For the first time, machines can autonomously work for hours or days without human supervision. But this mode of interaction can also leave developers feeling out of the loop with long wait times and less opportunity to direct the work. As software development is iterative, developers need to inject taste, direction, and sensibility along the way.

GPT-5.3 Codex-Spark is OpenAI's first model designed for real-time coding — optimized to feel near-instant when served on ultra-low latency hardware, delivering more than 1,000 tokens per second while remaining highly capable for real-world coding tasks. At that speed, a 300-line function takes under two seconds. You can read it as it writes. You can interrupt mid-generation and redirect. The AI stops being a batch processor and starts being a collaborator. And the hardware behind it — Cerebras' Wafer Scale Engine 3 — is the most consequential chip story in AI since Nvidia's H100 became the backbone of the entire industry. This is OpenAI's first production deployment on silicon outside its long-standing core stack with Nvidia. That sentence carries more weight than most people have stopped to consider.

The Latency Problem: Why Speed Is the Feature Nobody Talks About

Raw capability benchmarks dominate AI coverage. SWE-bench scores, GPQA percentages, ARC-AGI-2 results — these are the numbers that fill announcement posts and comparison articles. But there's a variable that benchmarks don't capture: how it feels to use the model when you're actually building something.

AI coding assistants have a latency problem. Even the best models — GPT-5.3-Codex, Claude Opus 4.6 — take seconds to minutes to generate substantial code. That delay breaks the flow state that makes developers productive. At 1,000+ tokens per second, Codex-Spark generates code faster than most developers can read it.

The research on flow state in software development is unambiguous: interruptions longer than 2–3 seconds cause context switching that takes 15–20 minutes to recover from fully. Every "generating..." spinner that runs for 8 seconds is a 20-minute tax on your productivity. The frontier models that win benchmarks are often the worst offenders — their extended reasoning chains are brilliant and slow. Speed is not just about the chip. Two systems may have similar intelligence, but the faster one will feel significantly better to use. Users don't just evaluate what the model can do — they evaluate how quickly they can do it.

What Exactly Is the Cerebras Wafer Scale Engine 3? (The Non-Nvidia Hardware)

To understand why Codex-Spark is fast, you need to understand why traditional AI models are slow. And to understand that, you need to understand what a GPU cluster actually is.

Every major AI model you've ever used — GPT-5, Claude Opus 4.6, Gemini 3.1 Pro — runs inference on clusters of Nvidia GPUs. Not one GPU. Hundreds of them, connected by high-speed cables (NVLink, InfiniBand). When a model generates a token, data has to travel between GPUs over those connections. Traditional AI models run on clusters of small GPUs. These GPUs must communicate with each other over cables, which creates a bottleneck. This bottleneck slows down the speed of the model.

Cerebras built a completely different answer. The WSE-3 is not a GPU — it is a single wafer-scale processor designed from the ground up for neural network inference and training. Instead of dozens of chips talking over cables, the WSE-3 is one chip. Cerebras' purpose-built Wafer Scale Engine features the largest on-chip memory of any AI processor, enabling high-speed inference at thousands of tokens per second per user. The architecture scales out to thousands of systems, extending fast memory capacity into the multi-terabyte domain to support trillion-parameter models for both training and inference.

The physical scale of the thing is hard to convey in text. The WSE-3 contains 4 trillion transistors on a single piece of silicon roughly the size of a dinner plate. An Nvidia H100 — the GPU that powers most of the AI industry — has 80 billion transistors on a chip smaller than your palm. The WSE-3 has 50 times more transistors, all on one die, with no inter-chip communication bottleneck. OpenAI has agreed to deploy 750 megawatts of Cerebras-backed compute online in phases through 2028. This is not a pilot experiment. This is an infrastructure commitment.

The Three Infrastructure Changes OpenAI Made (That Affect Every Model)

Codex-Spark isn't just a model on fast hardware. To make real-time coding possible, OpenAI also re-engineered the entire request-response pipeline. The three specific changes they made — and this is the part most coverage skips — are being deployed across all OpenAI models, not just Codex-Spark:

1. Persistent WebSocket Connection (80% Less Overhead)

OpenAI moved away from traditional HTTP request methods and introduced a persistent WebSocket connection. This change reduces client-server round-trip overhead by 80%. Traditional AI API calls open a new HTTP connection for every request — connection setup, TLS handshake, routing overhead, all happening before a single token generates. WebSocket keeps the connection alive, eliminating that setup cost on every call. For interactive coding sessions making dozens of small requests, this compounds dramatically.

2. Rewritten Inference Stack (30% Less Per-Token Time)

OpenAI rewrote key pieces of their inference stack, reducing per-token processing time by 30%. This is internal model serving infrastructure — the code that takes model weights and turns them into token probabilities. A 30% reduction in per-token time is the kind of engineering win that usually takes a hardware generation to achieve; OpenAI got it through software optimization.

3. Reworked Session Initialization (50% Faster First Token)

OpenAI reworked how sessions are initialized so that the first visible token appears sooner — improving time-to-first-token by 50%. Time-to-first-token is arguably the most psychologically important latency metric. A model that starts streaming immediately feels fast even if the total generation time is longer. Users perceive responsiveness from the first character, not from the last. Cutting TTFT in half is a felt improvement — the kind users notice without being told it changed.

These improvements are already rolling out to ALL OpenAI models:

These end-to-end latency improvements — 80% less roundtrip overhead, 30% less per-token processing, 50% faster first token — will become the default for all models across OpenAI's serving stack. If GPT-5.3 Instant or GPT-5.2 Pro has felt snappier in the last few weeks, this is why. Codex-Spark was the testbed. The entire fleet gets the benefit.

Codex-Spark Benchmarks: Fast AND Capable

The legitimate concern with any speed-first model is the capability tradeoff. Distilled or smaller models are almost always weaker on reasoning benchmarks. OpenAI says Codex-Spark achieved results between GPT-5.1-Codex-mini and GPT-5.3-Codex on SWE-Bench Pro and Terminal-Bench 2.0 — in a fraction of the time. Here's the full picture:

Benchmark GPT-5.3 Codex-Spark GPT-5.3 Codex (full) What It Measures
SWE-Bench Pro Near-parity with full Codex New SOTA Real-world software engineering, multi-language, contamination-resistant
Terminal-Bench 2.0 77.3% New SOTA (higher) CLI, bash, system tasks — actual agentic terminal work
Speed 1,000+ tokens/sec ~65 tokens/sec 15x faster — crosses "real-time" threshold
Time-to-first-token 50% faster than prior Codex Baseline Critical for perceived responsiveness
Cybersecurity risk ✅ NOT "High" classified "High" classified Spark is smaller — doesn't hit the concerning cyberattack automation thresholds

The SWE-Bench Pro parity is the headline: Codex-Spark solves real-world software engineering tasks at nearly the same accuracy as the full model, but in a fraction of the time. For most practical development work, this is the benchmark that matters. The Terminal-Bench 2.0 gap shows where the smaller model struggles — complex multi-step terminal operations where reasoning depth matters more than speed.< /p>

GPT-5.3 Codex-Spark is optimized for throughput, not deep complexity. The practical split: use Codex-Spark for editing specific sections of code, writing unit tests, generating boilerplate, renaming and refactoring, quick bug fixes, and rapid iteration cycles. Use full GPT-5.3 Codex for complex multi-file architecture decisions, long-running autonomous tasks, and anything that requires extended reasoning across a large codebase.

What "Real-Time Steering" Actually Means

The capability that 1,000 tokens/sec unlocks is more interesting than just "it's faster." The model's speed allows developers to interrupt and redirect logic in real-time, shifting the workflow from batch-processing to live pair-programming. At 65 tokens/second, interrupting mid-generation is awkward — by the time you've decided you want to change direction, the model has already written 30 more lines down the wrong path. You stop it, delete the output, and restart. At 1,000 tokens/second, code generates faster than you can read it, but the latency between "I want to change direction" and "the model responds to my new direction" compresses to under a second. You steer in real time rather than batching instructions.

GPT-5.3 Codex-Spark is tuned for interactive development workflows such as editing specific sections of code and running targeted tests, and will not automatically execute tests unless instructed. That last detail matters: the model defaults to minimal edits. It won't auto-run test suites or trigger side effects without explicit instruction — a conservative default that's the right call for real-time interactive sessions where you might interrupt and change direction multiple times before committing to a change.

OpenAI Breaks Up With Nvidia (Sort Of)

The hardware story here is at least as significant as the model story, and it's been underreported. Codex-Spark is the first OpenAI model not to use Nvidia's hardware, running solely on Cerebras Wafer-Scale Engine 3 chips. The release comes as both Cerebras and OpenAI are trying to prove to enterprises their worth over their competitors.< /p>

To be precise about what this does and doesn't mean: OpenAI said the partnership with Nvidia is "foundational" and stated the company is anchored on Nvidia as the core of its training and inference stack, while also expanding the ecosystem around it through partnerships with Cerebras and others. OpenAI's most powerful models continue to be trained and served on Nvidia systems. This is not a divorce. It's a second relationship. The strategic logic is clear: Nvidia hardware dominates training and bulk inference. Cerebras hardware wins on low-latency interactive inference. Different workloads, different silicon.

The strategic significance extends beyond raw performance. By deploying a production model on Cerebras hardware, OpenAI is actively diversifying away from its near-total dependency on Nvidia GPUs — a calculated move to reduce supply chain risk and negotiate better terms with hardware partners. For the broader AI industry, it validates Cerebras's wafer-scale approach as a viable alternative for inference workloads.< /p>

The industry implications are significant. OpenAI's use of Cerebras' wafer-scale engine infrastructure could represent an opportunity for other AI hardware vendors that specialize in application-specific integrated circuits. "By making these services available and running on AI ASICs, particularly the ones specifically designed for inference, creates this business model for similar players in the market." OpenAI just proved the market exists. Expect every other AI lab to evaluate non-Nvidia inference options within 12 months.

Who Can Access Codex-Spark Right Now

Access Method Status Requirement
Codex app (chatgpt.com/codex) ✅ Live — research preview ChatGPT Pro ($200/month)
Codex CLI ✅ Live — research preview ChatGPT Pro; update CLI to latest version
VS Code extension ✅ Live — research preview ChatGPT Pro; update extension
API (design partners) ⚠️ Limited — small set of design partners only Apply via platform.openai.com
API (general availability) Not yet — expanding "over coming weeks" Monitor platform.openai.com/docs/models
ChatGPT Plus ($20/month) Not available Pro plan required
Free tier Not available Paid plan required

Because Codex-Spark runs on specialized low-latency hardware, usage is governed by a separate rate limit that may adjust based on demand during the research preview. The Cerebras infrastructure is still scaling — OpenAI is expanding access as capacity comes online. Codex itself now has more than 1 million weekly active users, according to OpenAI, and will expand beyond Pro users in the coming weeks as the company evaluates performance and demand.

What Codex-Spark Cannot Do (The Honest Limitations)

Codex-Spark is currently text-only at a 128K context window and is the first in a family of ultra-fast models. The specific limitations matter for production use:

  • Text-only: No image input, no screenshot analysis, no diagram understanding — code and text only. If you rely on uploading UI screenshots to your coding workflow, Codex-Spark can't do that yet
  • 128K context window: Solid for single files and small projects; limiting for large codebases. GPT-5.3 Codex standard handles much longer contexts for full-repo work. Gemini 3.1 Pro offers 1M tokens standard for context-heavy work
  • No API (yet): Only a handful of design partners have API access. Enterprise teams building automated pipelines can't integrate Codex-Spark until general API access opens
  • Pro-only: The $200/month Pro plan is a real barrier for individual developers. Plus users ($20/month) don't get access during the research preview
  • Cerebras geography: Currently limited to Cerebras WSE-3 hardware, which constrains geographic availability and total capacity. Use rs in regions where Cerebras data centers aren't yet deployed will experience higher latency or no access
  • Not for security tasks: Codex-Spark does not meet the "High capability" threshold for cybersecurity in OpenAI's Preparedness Framework, making it unsuitable for sensitive auth or security tasks. For those, use the full GPT-5.3 Codex (with appropriate access controls)

What's Coming Next for Codex-Spark

As OpenAI learns more with the developer community about where fast models shine for coding, they will introduce even more capabilities — including larger models, longer context lengths, and multimodal input. The roadmap as stated: this is the first in a family of ultra-fast models. The 128K text-only limitation is explicitly called out as a current state, not permanent architecture. Cerebras expects to bring ultra-fast inference capability to the largest frontier models in 2026 — m eaning GPT-5.3 Codex full and potentially GPT-5.4 could eventually run on Cerebras hardware at Spark-like speeds. The combination of frontier-class reasoning and 1,000 token/second generation doesn't exist yet — it's the product roadmap.

Frequently Asked Questions

What Is GPT-5.3 Codex-Spark?

GPT-5.3 Codex-Spark is a smaller version of GPT-5.3 Codex and OpenAI's first model designed for real-time coding — delivering more than 1,000 tokens per second while remaining highly capable for real-world coding tasks. It is the first milestone in OpenAI's partnership with Cerebras, running on the Cerebras Wafer Scale Engine 3 rather than Nvidia GPUs.

How Is Codex-Spark 1,000 Tokens Per Second Possible?

The speed comes from two sources: the Cerebras WSE-3 chip eliminates the inter-GPU communication bottleneck that slows traditional inference, and OpenAI rewrote the serving stack with a persistent WebSocket connection (80% less overhead), a rewritten inference stack (30% less per-token time), and reworked session initialization (50% faster first token). The hardware and software improvements compound — neither alone would produce 1,000 tokens/second.

Is Codex-Spark Free?

No — Codex-Spark is rolling out as a research preview for ChatGPT Pro users at $200/month. Not available on Plus ($20/month) or free plans during the research preview. API access is currently limited to a small set of design partners, with general API availability expanding over the coming weeks.

How Fast Is Codex-Spark vs. GPT-5.3 Codex?

Codex-Spark runs at roughly 1,000 tokens per second — about 15x faster than GPT-5.3 Codex, wh ich operates at approximately 65 tokens per second in standard serving. Time-to-first-token is 50% faster than prior Codex serving. A 300-line function that takes ~20 seconds with full Codex takes ~2 seconds with Codex-Spark.

Does Codex-Spark Work in VS Code?

Yes — Codex-Spark is available in the latest versions of the Codex app, CLI, and VS Code extension for ChatGPT Pro users. Upd ate your VS Code Codex extension to the latest version and select Codex-Spark from the model picker. The model defaults to minimal edits and won't auto-run tests unless you explicitly instruct it to.

Why Is OpenAI Using Cerebras Instead of Nvidia?

OpenAI describes Nvidia as "foundational" with its most powerful models continuing to run on Nvidia systems. Cerebras complements that by excelling at workflows demanding extremely low latency — it is not replacing Nvidia but adding a dedicated low-latency inference tier. The strategic reasons: reducing supply chain dependency on a single vendor, diversifying hardware for cost negotiation leverage, and accessing inference speeds that GPU clusters structurally cannot match for interactive workloads.

What Is the Cerebras Wafer Scale Engine 3?

The WSE-3 is a single wafer-scale processor designed from the ground up for neural network inference and training — not a GPU. It contains 4 trillion transistors on a single die, with massive on-chip memory that eliminates data movement bottlenecks between chips. Unlike GPU clusters where chips communicate over interconnects, the WSE-3 processes everything on one piece of silicon. It is the largest chip ever built for AI computation.

What Are the Limits of Codex-Spark?

Codex-Spark is currently text-only at a 128K context window. No image input, no multimodal capabilities, shorter context than GPT-5.3 Codex standard. API access is limited to design partners currently. Rate limits apply separately from standard Codex limits due to the specialized Cerebras hardware. Not recommended for security-sensitive tasks — it does not reach the "High" cybersecurity threshold, making it unsuitable for auth or vulnerability research work.

Will Codex-Spark Get Multimodal and Longer Context?

Yes — OpenAI confirmed plans to introduce larger models, longer context lengths, and multimodal input as they learn from developer feedback during the research preview. The 128K text-only limitation is a current state of the first release in what they describe as "a family of ultra-fast models" — not a permanent design constraint.

GPT-5.3 Codex-Spark Alternatives

Similar tools in Code Development

GPT-5.3 Codex

GPT-5.3 Codex

No ratings
AI Coding AssistantPaid
SkillMaps

SkillMaps

No ratings
AI Coding AssistantFreemium
Codeium

Codeium

No ratings
AI Coding AssistantFreemium
GitHub Copilot Workspace

GitHub Copilot Workspace

No ratings
App DevelopmentFreemium
Antigravity

Antigravity

3.5
App DevelopmentFreemium
Cursor

Cursor

5.0
App DevelopmentFreemium
v0

v0

5.0
App DevelopmentFreemium
Windsurf

Windsurf

No ratings
App DevelopmentFreemium
BlackBox AI

BlackBox AI

5.0
App DevelopmentFreemium
Lovable AI

Lovable AI

5.0
No-Code App BuildersFreemium
Replit Agent v3

Replit Agent v3

2.0
No-Code App BuildersPaid
Replit Agent v2

Replit Agent v2

No ratings
No-Code App BuildersPaid
Ask Codi

Ask Codi

No ratings
Code OptimizationFreemium
Workik AI

Workik AI

No ratings
Code OptimizationFreemium
Raygun

Raygun

No ratings
Code OptimizationPaid
Code Mentor AI

Code Mentor AI

No ratings
Code OptimizationFreemium
GTmetrix

GTmetrix

No ratings
Code OptimizationFreemium
Cloud Defence

Cloud Defence

No ratings
Code OptimizationFreemium
AppDynamics

AppDynamics

No ratings
Code OptimizationFreemium
Dynatrace

Dynatrace

No ratings
Code OptimizationFreemium
New Relic

New Relic

No ratings
Code OptimizationPaid
Taskade

Taskade

No ratings
Code OptimizationFreemium
Appli Tools

Appli Tools

No ratings
Code TestingFreemium
LambdaTest

LambdaTest

No ratings
Code TestingFreemium
BrowserStack

BrowserStack

No ratings
Code TestingFreemium
Appium

Appium

No ratings
Code TestingFreemium
Smart Bear

Smart Bear

No ratings
Code TestingPaid
Cypress

Cypress

No ratings
Code TestingFreemium
Cucumber

Cucumber

No ratings
Code TestingFreemium
Test Sigma

Test Sigma

No ratings
Code TestingFreemium
Codium

Codium

No ratings
Code TestingFreemium
Selenium

Selenium

No ratings
Code TestingFreemium
TrackJS

TrackJS

No ratings
Code DebuggingPaid
OverOps

OverOps

No ratings
Code DebuggingFreemium
Honeybadger

Honeybadger

No ratings
Code DebuggingFreemium
GlitchTip

GlitchTip

No ratings
Code DebuggingFreemium
LogRocket

LogRocket

No ratings
Code DebuggingFreemium
Bugsnag

Bugsnag

No ratings
Code DebuggingFreemium
Raygun Debug

Raygun Debug

No ratings
Code DebuggingPaid
Airbrake

Airbrake

No ratings
Code DebuggingPaid
Rollbar

Rollbar

No ratings
Code DebuggingFreemium
Sentry

Sentry

No ratings
Code DebuggingFreemium
Codara

Codara

No ratings
Code ReviewPaid
SonarQube

SonarQube

No ratings
Code ReviewPaid
PullRequest

PullRequest

No ratings
Code ReviewFreemium
Code Rabbit AI

Code Rabbit AI

No ratings
Code ReviewFreemium
ZZZCode AI

ZZZCode AI

No ratings
Code ReviewFreemium
Reviewable

Reviewable

No ratings
Code ReviewPaid
CodeClimate

CodeClimate

No ratings
Code ReviewPaid
Codacy

Codacy

No ratings
Code ReviewFreemium
snyk.io

snyk.io

No ratings
Code ReviewPaid
CodeWP

CodeWP

No ratings
Code EditingFreemium
Sourcery

Sourcery

No ratings
Code EditingPaid
Snyk

Snyk

No ratings
Code EditingPaid
Repl.it

Repl.it

No ratings
Code EditingFreemium
Codota

Codota

No ratings
Code EditingFreemium
Kite

Kite

No ratings
Code EditingFreemium
Tabnine Editor

Tabnine Editor

No ratings
Code EditingFreemium
GitHub Copilot

GitHub Copilot

4.0
App DevelopmentFreemium

Reviews

Real experiences from verified users

-
0 reviews

No reviews yet

Be the first to share your experience