CS-2: The most powerful solution for AI compute

The Cerebras CS-2 is built from the ground up to accelerate deep learning in the data center.

It is the complete solution for AI compute, powered by the world’s largest chip, co-designed with Cerebras Software so it’s simple to program, and packaged in an innovative system that fits directly into your infrastructure.

Overview

Faster time to solution with the Cerebras CS-2

CS-2 is designed to enable fast, flexible training and low-latency datacenter inference.

Powered by the 2nd generation Wafer-Scale Engine (WSE-2), CS-2 has greater compute density, more fast memory, and higher bandwidth interconnect than any other datacenter AI solution.

Easily programmable with leading ML frameworks, CS-2 helps industry and research organizations unlock cluster-scale AI performance with the simplicity of a single device. Faster time to solution with greater power and space efficiency.

  • 850,000
    AI-Optimized Cores
    123x more cores
  • 40 GB
    On-Chip SRAM
    1,000x more on-chip memory
  • 220 Pb/s
    Interconnect Bandwidth
    45,000x more bandwidth
  • 20 PB/s
    Memory Bandwidth
    12,800x more bandwidth
  • 1.2 Tb/s
    System I/O
  • 15 RU
    System Dimensions

Compared to leading industry GPU

  • 400,000
    AI-Optimized Cores
    54x more cores
  • 18 GB
    On-Chip SRAM
    450x more on-chip memory
  • 100 Pb/s
    Interconnect Bandwidth
    20,800x more bandwidth
  • 9 PB/s
    Memory Bandwidth
    5,700x more bandwidth
  • 1.2 Tb/s
    System I/O
  • 15 RU
    System Dimensions

Compared to leading industry GPU

WSE-2 Advantages

Why is our big chip the right solution for AI?

The right solution for AI goes beyond the table stakes of designing a flexible core optimized for sparse linear algebra computations (though we did that too).

Today’s state-of-the-art models take days or weeks to train. Organizations often need to distribute training across tens, hundreds, even thousands of GPUs to make training times more tractable. These huge clusters of legacy, general-purpose processors are hard to program and bottlenecked by communication and synchronization overheads.

Rather than build a slightly smaller cluster of slightly faster small devices, Cerebras wafer-scale innovation brings the AI compute and memory resources of a cluster onto a single device, making orders-of-magnitude faster training and lower-latency inference easy to use and simple to deploy.

  • AI acceleration with no communication bottlenecks
    With 850,000 cores on one chip, CS-2 delivers cluster-scale speedup without the communication slowdowns that come from parallelizing work across a massive cluster of devices.
  • Accessible performance for every organization
    One chip in one system means no distributed training or parallel computing experience needed. CS-2 makes massive-scale acceleration easy to program for.
  • Real-time inference latencies for large models
    Keeping compute and memory on chip means extremely low latencies. On CS-2, you can deploy large inference models in a real-time latency budget without quantizing, downsizing, and sacrificing accuracy.
Chip Technology

Cerebras Wafer Scale Engine 2

With over 50x the silicon area of the largest GPU, the WSE-2 provides world-leading AI compute density and memory for efficient data access.

On the WSE-2, data movement between cores and memory happens entirely on-silicon, resulting in the highest-bandwidth, lowest-latency communication fabric in a datacenter AI solution.

850,000 AI-optimized Cores

Cluster-scale speedup on a single chip

Each core is independently programmable and optimized for the computations that underpin neural networks, enabling it to deliver maximum speed and flexibility.

Now pack 850,000 of these cores into one device, and any data scientist can run state-of-the-art models and explore innovative algorithmic techniques at record speed and scale, without ever touching distributed scaling complexities.

High bandwidth, ultra low latency fabric

No communication overheads

Keeping all cores on silicon eliminates the inefficiencies of connecting hundreds of small devices via slow wires and cables.

Our Swarm communication fabric connects all cores in a 2D on-chip mesh to deliver unprecedented interconnect bandwidths of 220 Pb/s, 45,000x more than the leading GPU.

This means no synchronization headaches or communication overheads and sub-millisecond latencies for real-time workloads, all at a fraction of the power draw of traditional GPU clusters.

Massive on-chip memory

Larger models and data on one device

Unlike traditional devices, in which the working cache memory is tiny, WSE-2 takes 40GB of super-fast on-chip SRAM and spreads it evenly across the entire surface of the chip.

This gives every core single-clock-cycle access to fast memory at extremely high bandwidth – 20 PB/s. This is 1,000x more capacity and 9,800x greater bandwidth than the leading GPU.

Fit industry-leading models entirely on a single chip. Maximum training acceleration. Inference within real-time latency budgets.

The New CS-2 System

Revolutionary system design

You can’t achieve revolutionary performance gains if you’re limited by standard chip packaging.

Every detail of the CS-2 system — from power and data delivery to cooling to packaging — has been carefully engineered to drive the colossal WSE-2.

This means no compromises from us and no large-scale cluster deployment complexity for you.

CS-2 Technology

Custom power and cooling

To solve the 70-year-old problem of wafer-scale, we needed not only to yield a big chip, but to invent new mechanisms for powering, packaging, and cooling it.

The traditional method of powering a chip from its edges creates too much dissipation at a large chip’s center. To prevent this, CS-2’s innovative design delivers power perpendicularly to each core.

To uniformly cool the entire wafer, pumps inside CS-2 move water across the back of the WSE-2, then into a heat exchanger where the internal water is cooled by either cold datacenter water or air.

CS-2 Benefits

Faster insights in lower cost and less space

At 15 RU, using max system power of 23kW, the CS-2 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.

With cluster-scale compute available in a single device, you can push your research further – at a fraction of the cost.

CS-2 Benefits

Easy to deploy and integrate

CS-2 connects to your surrounding infrastructure via 12x standard 100 Gigabit Ethernet links and converts standard TCP-IP traffic into Cerebras protocol to feed the WSE-2’s 850,000 cores.

Simply plug CS-2 into power and a 100 Gb Ethernet switch, and you’re ready to start accelerating your AI workloads at wafer-scale speed.

Software Platform

Software that integrates seamlessly with your workflows

The Cerebras software platform integrates with popular machine learning frameworks like TensorFlow and PyTorch, so researchers can use familiar tools and effortlessly bring their models to the CS-2.

A programmable low-level interface allows researchers to extend the platform and develop custom kernels – empowering them to push the limits of ML innovation.

Graph Compiler

Cerebras Graph Compiler drives full hardware utilization

The Cerebras Graph Compiler (CGC) automatically translates your neural network to a CS-2 executable.

Every stage of CGC is designed to maximize WSE-2 utilization. Kernels are intelligently sized so that more cores are allocated to more complex work. CGC then generates a placement and routing, unique for each neural network, to minimize communication latency between adjacent layers.

Software Tools

Designed for flexibility and extensibility

The Cerebras software platform includes an extensive library of primitives for standard deep learning computations, as well as a familiar C-like interface for developing custom software kernels.

A complete suite of debug and profiling tools allows researchers to optimize the platform for their work.

CS-2 Cluster

Datacenter-scale AI processing with a CS-2 cluster

Multiple CS-2s can be clustered together for even greater scale and performance, resulting in greater deployment ease, lower engineering costs, and more flexibility.

  • Higher performance with lower complexity
    Even the most ambitious, extreme-scale deep learning applications require far fewer CS-2 systems to achieve the same effective compute as large-scale clusters of traditional small devices. This means faster deployments and less complexity.
  • Easier and more flexible research
    Scaling to fewer, more powerful nodes takes less time and energy than massive-scale deployment of small devices. Smaller clusters incur much lower synchronization overheads, so researchers don't have to use extreme batch sizes and brittle, over-tuned model configurations.