Scaling Up and Out: Training Massive Models on Cerebras Systems using Weight Streaming

Michael James, Chief Architect, Advanced Technologies, and Co-Founder | September 14, 2021

The Cerebras Wafer-Scale Engine is the world’s largest and most powerful computer processor. Designed to solve the hardest problems in artificial intelligence, it performs massive number crunching in a communication intensive environment. The AI race taking place between leading technology companies is producing larger and more intricate models at an astounding pace. We’ll look at how the Wafer-Scale Engine trains models much larger than even itself—approaching the scale of a human brain.

(Recommended further reading: Weight Streaming whitepaper)

Wafer-Scale Engine
The Wafer-Scale Engine (or simply the “Engine”) is at the center of the show. Its job is to provide a computation environment without bottlenecks. It is a single silicon substrate that has three physical planes: an arithmetic plane, a memory plane, and a communication plane.

The arithmetic plane has 850,000 independently programmable cores and a total of 3.4 million floating point units.
The cores are directly coupled to a memory plane—40GB of on-chip “cache” memory—enough to hold an entire BERT_LARGE It provides a remarkable 20 PB/s bandwidth on random access patterns.
A communication plane interconnects the cores in a cartesian mesh with 28 PB/s of homogenous bidirectional bandwidth. Unicast and multicast messages are hardware primitives. Each message has 38 bits and is sent by hardware with guaranteed in-order delivery and end-to-end flow control.

The bottleneck-free design eliminates memory walls that plague other computer architectures.

Modern AI is based on the mathematics of large sparse tensors. Huge bandwidth to random access memory is the key to accelerating sparse-tensor math. Accordingly, the Engine can train neural networks with 90% unstructured weight sparsity (i.e., with weights equal to zero arbitrarily scattered across the network) without ever spending time executing a multiply by zero. This 10x acceleration transforms the Engine’s raw throughput of 5.8 PFLOP/s into a peak training performance of 58 PFLOP/s.

Streaming Weights

For models up to about a billion weights, we can hold the entire network in the Engine’s cache memory. This approach works brilliantly. But what if the network is bigger than that? Over the past three years, the size of the largest AI models has increased by three orders of magnitude, with the largest models now using 1 trillion parameters. This exponential growth will continue as researchers race to develop “true” AI, also known as artificial general intelligence.

Cerebras makes training these massive models practical. Our new “weight streaming” execution mode gives a seamless scaling path from BERT_LARGE through GPT-3 and all the way up to a models with more than 100 trillion parameters – similar to the number of synapses in the human brain!

In a biological brain, information flows from neuron to neuron across synapses. In an artificial neural network, the state of each neuron is called an “activation”, and “weights” record the connection strength of synapses. The human cerebral cortex has about 16 billion neurons and 125 trillion synapses. The Engine’s cache memory holds up to 20 billion activations, and we will show how to train at scales of 100 trillion weights. There is still much work to be done as we learn how to harness the potential of models at this scale. This will require lots of experimentation, trial-and-error, and more fundamental insights. These are the types of challenges that inspired us to create an engine to power our technological future.

Figure 1 Weight streaming execution mode system diagram

The Engine holds complete layers of activations – enormous layers – directly on chip. This avoids a classic inefficiency of “blocking” a matrix multiplication. A blocked matrix multiply is designed for computers with a small cache and hierarchically larger and slower memory tiers beyond the cache. A blocked multiplication must repeatedly scan a matrix, reloading blocks from a next tier of memory many times. The blocking technique is useful in low bandwidth situations, but it is also inefficient since the same blocks are repeatedly loaded.

The more than a hundred trillion parameters that a human-brain-scale model will employ, are clearly above any on-chip memory capacity, even that of Cerebras’ behemoth Engine. Common optimizers such as ADAM require four scalar values per parameter. Stored in single precision, 100T trainable parameters require 1.6 PB memory capacity.

Because huge models require capacity beyond the Engine’s capacity, we make no attempt to store the model parameters on the Engine. Instead, we use an off-chip model memory tier to hold parameters. The optimizer itself operates directly on this model memory using co-processors. To support the training process, the model memory has a bandwidth of 4 TB/s. Updating model parameters is an elementwise operation that can be distributed across processing shards. This distribution effectively makes the communications overhead disappear.

Figure 2 Training Timing Diagram

Messages propagate from the model memory to the Engine in a continuous stream. The middle row of Figure 2 shows that a stream of incoming weights keeps the Engine busy. It uses this stream to model neural state changes and to generate a stream of return gradients. During training, the Engine is used 100% of the time without interruption.

The off-chip memory tier is elastic, and Cerebras will provide configurations from 4 TB to 2.4 PB. If you aren’t provisioning for brain-scale training, then a smaller 8 TB configuration is adequate for models like GPT-3. The predictable layer-wise access patterns for model weights allows us to use DRAM and flash storage in a hybrid fashion to achieve both high performance and high capacity.

The training set – the data the neural network learns from – is kept this on separate MemoryX shards because it has different requirements and scaling properties from the model parameters. GPT-3 used 300 billion words, only twice the model parameter count. Text is small though. Imagine a training database with every movie ever filmed, or the vast streams of events generated by particle accelerators.

Figure 3 Animated Diagram Showing the Weight Streaming Process

Accelerating All Linear Algebra

Graphics processors can only run matrix-matrix operations at full performance. This greatly restricts the algorithms that can be explored efficiently on a graphics processor. Linear algebra operations at the heart of numerical algorithms come in three flavors—matrix-matrix (like GEMM), matrix-vector (like GEMV) and vector-vector (like DOT and AXPY). The Engine runs all these operations at full performance.

Figure 4 Massive memory bandwidth enables full performance for all linear algebra operations

Matrix-matrix multiplication is the heavyweight operation. Each row of A interacts with every column of B to create a cube of products. The cube is quashed flat by summing terms in the third dimension for the result C. Matrices, like shipping pallets, require goods to be packed in rows and columns, to be properly aligned, and all sent to the same destination. It is OK to send a matrix slowly by ship or by truck because the receiver has a lot of work to perform to generate the multiplication result. In other words, the receiver will not have finished processing one matrix before the next one arrives. The execution time masks the shipping time.

Matrix-vector multiplication requires vastly more bandwidth. While the structure of the operation is the same as matrix-multiply, the skinny vector admits no data re-use of the matrix terms. Because there is little data re-use, the computation might complete over a thousand times faster. However, it still requires a full third of the data transfer for the matrix operand. Without extra bandwidth, matrix-vector multiplication is not much faster than matrix-matrix in practice. Graphics processors encounter this bandwidth limitation, and it is why they are poor at real-time inference. Instead of using matrix-vector, GPUs wait for a batch of inference inputs—increasing latency thousand-fold. Cerebras’ Engine has full bandwidth for matrix-vector and produces phenomenally low latency inference results.

Vector-vector operations are commandos: versatile, fast, and efficient. They can be used as building blocks to construct the matrix operations. Unconstrained by shipping pallets, they can also do much more. These operations give raw access to floating point units on unique data streams. But there’s a catch: vector-vector operations have no data reuse and require three times as much bandwidth as even matrix-vector operations. The Engine has bandwidth to sustain vector-vector operations over the activation planes in its memory. We will see that this gives us the means to take advantage of sparsity. The brain relies on sparsity to an astonishing degree: about 1,000,000x. If the brain was not sparse, its surface area would be larger than the state of Texas! Its actual surface area, if spread flat, would be just two square feet.

The Engine’s ability to efficiently leverage sparsity saturates at around 3,000x – shy of the brain’s sparsity, but still enormously ahead of other processors.

Sparse GEMM

Writing matrix multiply in terms of vector-vector operations unlocks the power of sparsity. We make one AXPY call for every non-zero element of the matrix. Zero values are omitted – automatically – since they would have no effect anyway. This is how the Engine leverages its enormous memory bandwidth to train with sparse weights. Computation speeds up directly from unstructured weight sparsity. It is both fast and remarkably simple.

Figure 5 Cerebras Sparse GEMM architecture

Imagine the Engine running a sparse GEMM: Activations have feature and token dimensions. Let them start out spread thinly over the surface of the Engine. Activations are jam, spread uniformly across the Engine. Features to rows. Tokens to columns. Only non-zero weights stream from model memory through the Engine’s intake valves. Weights arrive tagged with an index field to identify their matrix location. Some cores near the intake read the index and fan-out the weights to their target row. Cores in the target row also use the index to select the appropriate activations. When an output feature in the sparse weight matrix is complete, a control message (think carriage return) arrives instead of a weight. The cores respond by moving the accumulated sum to a partial sum thread for reduction along the column. The partial sum is a ring so it can start and end on any row. Thus, the output is uniformly distributed over the Engine. Same as our starting condition. Ready for the next sparse GEMM.

The gradient pass is the same, in reverse. Broadcasts swap place with reductions. Output activations are renamed “deltas”. The original input activations are still resident from the forward pass. This time the model memory only transmits the sparsity mask via a sequence of indices. This tells the Engine which components of the gradient to compute. They are sent out via outlet valves to the optimizer which updates the state variables in model memory.

Other machines are not able to run this algorithm with any performance because bandwidth bottlenecks preclude it. We don’t have bandwidth bottlenecks, so let’s look at the second order effects.

Unstructured sparsity has hot spots and cold spots. Known as the “tail worker effect”, a hot spot assigned to a core causes all other the cores to wait for it to complete with nothing to do. Avoiding synchronization barriers mitigates this effect. Therefore, the Engine runs partial sum reduction in a separate parallel thread. Consider a GPT3-size matrix with 600 million values. At 90% sparsity, there are still over 70 thousand non-zeros sent to each row. The law of large numbers minimizes the tail-worker effect. We can do better though. Matrix multiplication is permutation invariant. The optimizer sorts matrix rows based on their non-zero count and assigns them in round-robin to physical Engine rows. Together with the law of large numbers, this ensures that even extreme power law distributions have an essentially uniform spread over the Engine.

So, what limits sparsity acceleration? The last effect to consider is Amdhal’s law. Partial sum work needs to be done regardless of the level of sparsity. The Engine reduces partial sum overhead to four machine cycles. One cycle to copy the accumulated sum to another thread; one cycle to reset the accumulator to zero; and two cycles to run a double-wide add operation on the partial sum chain.

With Amdhal’s law characterized, we can see how the Engine converts sparsity into acceleration. Transformer networks like GPT-3 follow well established scaling laws[i]. The laws relate the number of network parameters to the layer width and critical batch size. Following these laws, we see how sparse matrix-multiplication is accelerated by varying levels of sparsity for different network sizes. The plots show acceleration using a single Engine and a cluster of sixteen Engines. The more Engines that are present the smaller the batch size must be per Engine. Less work on an Engine means that Amdhal’s law will be more pronounced there.

Figure 6 Converting sparsity to acceleration

Scaling Out is Easy

As you’d expect given the result above, training time can be accelerated further by employing more Engines. Cerebras’ scale-out fabric, called SwarmX, operates seamlessly. No changes are needed to the Engine, or its software. No changes are needed to the Optimizer, or its software, either. What works on one node works on a cluster without porting.

Figure 7 The SwarmX fabric enables Linear Performance Scaling

The scale-out fabric has a tree structure. When weights travel toward the Engines, they are replicated. This is a broadcast. When gradients travel toward the optimizer, they are reduced. Broadcast and reduce are duals. We built the data reduction operations into the data transport, which is an extremely efficient way to perform these tasks.

Figure 8 Cerebras CS-2 cluster scaling performance for increasing model sizes

And how do we expect the cluster to perform? As Figure 8 shows, the bigger the model, the further the linear trend persists to larger cluster sizes . Note that the 10x in the legend indicates the speed up we achieve from a conservative 90% sparsity. The multiple lines indicate results for models with different aspect ratios. This data shows that it’s possible to train a model with a trillion parameters in just a few days.

GPT-3 was trained for months, using over a thousand GPUs. Let’s ask: What is possible with a thousand Cerebras Engines? The brain-scale model we have been considering is 600 times larger than GPT-3. The scaling chart shows this will complete with only a year of training time on current generation equipment. While less than the 20 years it takes to train a human brain (plus the billion years it takes to evolve a human brain), it is also clear that this is out-of-reach for most. The important point is this is now architecturally possible. When research advancements make 100x sparse training viable, the runtime time shrinks to a month.

The Ultimate AI Accelerator

As research advances to the stratosphere, a trail of extraordinarily valuable applications will be created in the wake of that research. We are witnessing this happen right now. These applications will require infrastructure that can work at huge scale.

Our weight streaming architecture is a practical solution for both today’s mainstream models and the enormous models of the future. The architecture and techniques we’ve discussed above are applicable to all models, not just NLP. Our paper^[ii] from SC ‘20 shows the Engine is 10,000x faster than graphics processors at the core algorithm of physics-based simulation.

The Cerebras Wafer-Scale Engine is the ultimate scale-up accelerator for AI workloads. Weight streaming makes the Engine the ultimate scale-out accelerator as well. With this kind of horsepower, the possibilities are endless.

Questions? Ideas? Suggestions? Click here to connect with our team.

Recommended further reading:

Weight Streaming whitepaper A deeper dive into the technology of weight streaming, including a survey of existing approaches used to scale training to clusters of compute units and explore the limitations of each in the face of giant models.

Harnessing the Power of Sparsity for Large GPT AI Models A look at how Cerebras is enabling innovation of novel sparse ML techniques to accelerate training and inference on large-scale language models.

[i] Kaplan et al, “Scaling Laws for Neural Language Models”, arxiv.org/abs/2001.08361

[ii] Rocki et al, “Fast stencil-code computation on a wafer-scale processor”, dl.acm.org/doi/10.5555/3433701.3433778