Podcast Episode

AWS and Cerebras Join Forces to Build the Fastest AI Inference in the Cloud

March 13, 2026

0:00

2:15

Amazon Web Services and Cerebras Systems have announced a collaboration to deploy Cerebras CS-3 systems inside AWS data centres, using a technique called inference disaggregation to split AI workloads between AWS Trainium and the CS-3 chip. The service will launch on Amazon Bedrock in the coming months and promises speeds thousands of times faster than traditional GPU-based alternatives.

A New Approach to AI Speed

Amazon Web Services and Cerebras Systems have announced a landmark collaboration that could reshape how artificial intelligence runs in the cloud. The partnership will see Cerebras CS-3 systems deployed directly inside AWS data centres, paired with Amazon's custom Trainium chips to deliver what both companies describe as the fastest AI inference available.

How Inference Disaggregation Works

At the heart of the partnership is a technique called inference disaggregation, which splits AI workloads into two distinct phases handled by purpose-built processors. AWS Trainium tackles the computationally heavy prefill phase, processing the user's input prompt, before handing off to the Cerebras CS-3 for the decode phase, where output tokens are generated at extraordinary speed. The two systems communicate via Amazon's high-speed Elastic Fabric Adapter networking.

The arrangement promises a fivefold increase in high-speed token capacity within the same hardware footprint, and the service will be available exclusively through Amazon Bedrock.

A Milestone for Cerebras

The deal represents a major distribution win for Cerebras, whose Wafer-Scale Engine processors contain nine hundred thousand AI cores and four trillion transistors on a single dinner plate-sized chip. The Sunnyvale startup already powers inference for OpenAI, Cognition, and Meta, and signed a ten billion dollar agreement with OpenAI in January. Cerebras closed a one billion dollar funding round in February at a valuation exceeding twenty-two billion dollars and is preparing for an IPO as soon as April, with Morgan Stanley leading the offering.

Why Inference Speed Matters Now

The collaboration reflects a broader industry shift as AI workloads move from training to real-time inference. Agentic coding tools now generate roughly fifteen times more tokens per query than conversational chat, driving urgent demand for faster output. AWS will support both the new disaggregated configuration and traditional setups, giving customers flexibility to route workloads as needed.

Published March 13, 2026 at 8:26pm

AWS and Cerebras Join Forces to Build the Fastest AI Inference in the Cloud

A New Approach to AI Speed

How Inference Disaggregation Works

A Milestone for Cerebras

Why Inference Speed Matters Now

More Recent Episodes

Stanford Report Finds Robots Fail 88% of Household Tasks

Artemis II Crew Returns From Historic Moon Voyage, Set for First Press Conference

Microsoft Launches MAI-Image-2-Efficient: 41% Cheaper AI Image Generation