Wafer-Scale Processors: The Time Has Come

What is wafer-scale integration?

Let’s begin with the concepts of wafers and integration — the basics of chip making.

Silicon chips are made in chip fabrication facilities, or fabs, owned by Intel, or Taiwan Semiconductor (TSMC), Samsung, or a few other companies. A chip fab is a sort of printing press. The electronic circuits of a processor or memory or any other computer chip are printed onto a thin circular disk of silicon. The disk is called a wafer, and it plays the role of the paper in this printing process. (Fabs are uber-fancy printing presses costing billions, using photolithography, chemical deposition and etching to do the printing, in super-clean rooms run by employees who have to wear bunny suits. The kind without floppy ears.)

Engineer in a bunny suit, with the Cerebras Wafer Scale Engine (WSE)

A wafer is a circle, today about 12 inches in diameter. A chip is typically a square no more than an inch on a side — much smaller than a wafer. Many copies of the same chip are printed onto the wafer. They are arranged into a grid with spaces called scribe lines between them. (In fact, the printing involves a step and repeat process that uses light to chemically change the silicon surface, one chip at a time.) The wafer is then cut along those scribe lines into individual chips. The chips are tested, and after any defective ones are discarded, the rest can be packaged and sold.

Wafer-scale integration is the idea that you make a single chip out of the whole wafer. You skip the step above concerning cutting the wafer up — with one chip there is nothing to cut.

Chip testing is necessary because both the silicon wafer and the printing process are not quite perfect, and there are flaws that can cause partial or total chip failures. These flaws occur at random places on the wafer, a few per wafer. And the flaws are very small.

If we cover the wafer with many, small, identical chips, then only a small fraction will be affected by flaws, and we achieve a high yield of working chips. The larger the chip, the greater the fraction that contain at least one flaw. Yield therefore goes down for larger chips.

And if we cover the whole wafer with one super-large chip then it will certainly contain flaws. Why would we do that? Until now, there was no good answer.

What is the advantage of wafer-scale integration?

The reason to think about wafer-scale, and making it work, is that it is a way around the chief barrier to higher computer performance: off-chip communication.

Today’s processor chips are enormously good at computation. Once the data they need to process are loaded onto the chip, the work gets done quickly, and requires relatively little power. This is because silicon devices are extraordinarily efficient.

But a processor chip normally cannot work alone. It doesn’t have enough internal computer memory to hold all the data that it needs to work on. And with computational jobs getting larger and larger with each passing day, a single processor does not have enough processing power to perform them in entirety in the desired time. So systems are built from groups of chips. Multiple processor chips to boost the processing power, and other chips, most importantly memory chips, to boost the memory capacity. Now if only these groups of processors could talk to each other fast enough….

Data move in and out of processor chips. They move in and out of the associated memory chips. They move between processor chips when, say, the result of a computation on one chip is required to carry out another computation on a different chip.

High Performance Computing Center Stuttgart (HLRS) of the University of Stuttgart

Data moves between chips by sending electrical signals along plain old metal wires. It’s a technology as old as Morse and the telegraph. To send a single bit, let’s say with the value 1, the voltage is raised to a reference value at one end of the wire; electrons flow into the wire; the voltage rises at the other end; the voltage there is sensed, and the value stored by the receiving chip. On-chip data movement also happens through wires — the difference between on-chip and off-chip communication is that the off-chip wires are longer and thicker. Now the bad news: the time it takes to send the bit and the energy it takes (proportional to the number of electrons that flow into the wire) depend on the length and thickness of the wire. Technically it is something called the wire’s capacitance that matters. And long, thick, off-chip wires have a lot more capacitance than do on-chip wires. So the bottom line is that chip-to-chip data movement is much slower and much less energy efficient than on-chip data movement. How much slower? When a processor needs to get a value, say a number, from its on-chip memory, it takes only a few nanoseconds for the data to arrive in the processor’s arithmetic unit. When it has to move the data in from off-chip memory, it can take hundreds of nanoseconds. There is a comparable energy difference as well.

And there is a third problem, on top of the two (time to access off-chip data and energy cost to access off-chip data) mentioned above. There is also a bottleneck at the chip boundary, again due to those thick off-chip wires. On a conventional chip, the bottom of the chip package is covered with tiny connection points for wires. Most are used to provide power, but many, perhaps a thousand, are for moving data. And that number of connections often isn’t enough to move all the data in a timely way. (This is a choke point, in the same way freeway onramps become congested at rush hour as everyone tries to use the same onramp lanes at the same time). In contrast, on-chip wires can be packed much more tightly, and we can afford way more of them. In fact, there are around 20,000 on-chip wires connecting to each of the un-diced “chips” on the Cerebras Wafer Scale Engine (WSE), with more than 80 percent of them dedicated to data rather than control. This is more than a tenfold improvement in the communication bandwidth that is possible with separate, diced, packaged chips.

We now can understand the advantage of using the whole wafer. A wafer has over 50 times the area of a conventional chip. More room for compute cores, more room for fast, local memory, and lots of room for efficient on-chip communication between the cores.

Why hasn’t it been tried before?

It has been. In the 1970s and 1980s there were attempts, but they failed. Texas Instruments and ITT tried to develop wafer scale processors. Trilogy raised over $200 million to build wafer-scale, high-performance systems. Even though the wafer of that era was only about 90 mm across, the technical challenges of making a wafer, of powering, cooling, and packaging it, and of overcoming the flaws proved to be beyond what was then possible. As we’ll see below, things are different today.

Why is wafer-scale needed now?

The gap between compute demand and compute supply is growing, and growing rapidly. Artificial intelligence is a primary reason for the demand. We are seeing a revolution in what’s possible due to the development of deep artificial neural networks. They talk, they listen, they translate English into Urdu, they play games, they drive, they read X-rays and CT scans, and the boundaries are being pushed back every day. But before they are useful, they have to be trained. And training demands enormous compute power. It can take weeks to months to train the latest networks. And the compute demand for training them is doubling every 3.5 months.

At the same time, improvements to the chip-making technology that has driven the modern revolution in computing and communication are slowing down. The reason gets back to the chip-printing process. For several decades, we have been able to print more and more circuit devices, leading to more and more powerful processors and to larger and larger memories, on that same one-inch silicon chip. Why? Because they are made of transistors and wires, and our printing process kept making the transistors and the wires smaller and smaller. Every 18 months we got twice as many transistors and twice as many wires in the same silicon area, a phenomenon first noted by Intel’s Gordon Moore. This repeated doubling is called Moore’s Law. Twice the transistors on the same chip area might have made for hotter and hotter chips; but that didn’t happen. It didn’t happen because we could use lower and lower voltages to drive the smaller wires and transistors. This phenomenon, known as Dennard scaling, let us pack more and more transistors, and switch them faster, without increasing the power per unit silicon area. Every few years we doubled the transistors per unit area of silicon without increasing power — for both chip vendors and customers it was summertime, and the livin’ was easy.

All good things come to an end. For the last ten years, the Dennard scaling “free lunch” has gone away. We haven’t been able to increase the speed of circuits. The clocks of processors no longer get faster with each generation. We did, thanks to continued Moore’s Law improvement in transistor density, squeeze more and more cores onto chips. But that may change fairly soon too. The size of wires and transistors is now down to the 50 angstrom range, and the unit-cell of the silicon crystal lattice is 5 angstroms on a side. (An angstrom is one ten-billionth of a meter; a human hair is half a million angstroms thick.) There isn’t much room left to shrink. As Moore himself said, there’s no getting around the fact that we make these things out of atoms, and that may mean the end of Moore’s Law. Another reason is the growing cost of new fabs. Moore’s Law is likely to die an economic death before it hits the ultimate physical limits.

Expected design costs through 5nm. Source: IBS via ExtremeTech

So we’re at a crossroads. AI needs much more compute than is available from the chips of today, and we cannot expect the industry trends of the past few decades to provide that needed extra compute. With Dennard scaling and Moore’s Law running out of gas, we need something new. What will be the answer? The economic value provided by ever-better AI will be a huge driver of new ways to boost performance. Fortunately, as we will explain, wafer scale is here, now, and it gives a huge boost to compute. Since we won’t be putting ever more transistors onto each square millimeter of silicon, we will instead put more square millimeters of silicon onto each chip.

The Challenges of Wafer-Scale and How Cerebras Has Overcome Them

The promise of wafer-scale is clear, but the challenges thwarted attempts in the past. Can the challenge of wafer-scale integration be overcome? The answer, clearly, is yes: Cerebras has done it. At Hot Chips in August 2019, we announced our Wafer Scale Engine (WSE), which at 1.2 trillion transistors and 46,225 mm² of silicon is the largest chip ever built by 56x.

The Cerebras WSE is 56x larger than the largest GPU

What were the problems? First, one must invent a way to communicate across scribe lines. Second the problem of yield must be resolved. Third, a package must be made for the wafer, one that is compatible with power delivery and cooling and that makes the wafer structurally stable and durable, despite cycles of power on and off and variable workloads. Fourth, power deliver and cooling must be resolved. Here is how we solved these problems

Cross Scribe Line Connections

The standard fabrication process is one of step and repeat, which produces identical independent die on the wafer and leaves scribe lines between them. The scribe lines are where the wafer is cut to create separate chips. But fabs also place structures they need to test and control their fab process into the scribe line spaces between the die.

Cerebras had other ideas. We did not want software and algorithms to have to know anything about die; rather, we wanted a model of a homogeneous mesh of cores connected to their mesh neighbors at high bandwidth, regardless of whether the neighboring core is on the same die or on an adjacent die. In order to make this a reality, we knew what to put in the scribe lines. Wires. Lots and lots of wires. The Cerebras cores are tightly connect to their neighbors in a high-bandwidth mesh. We add enough wires across scribe lines to make the die and scribe lines into aspects of the physical design that have no impact on software and performance, and that can be ignored by software. Cores on adjacent die are connected with the same bandwidth as cores on the same die. This is a key development in making a true wafer-scale system with no architectural inhomogeneities.

Tens of thousands of on-silicon wires connect the die in the Cerebras Wafer-Scale Engine

The cross scribe line wiring has been developed by Cerebras in partnership with TSMC. TSMC allowed us to use the scribe lines for tens of thousands of wires. We were also allowed to create certain keep-out zones with no TSCM test structures where we could embed Cerebras technology. The short wires (inter-die spacing is less than a millimeter) enable ultra-high bandwidth with low latency. The wire pitch is also comparable to on-die, so we can run the inter-die wires at the same clock as the normal wires, with no expensive serialization/deserialization. The overheads and performance of this homogeneous communication are far more attractive than those of multi-chip systems that involve communication through package boundaries, transceivers, connecters or cables, and communication software interfaces.

Yield: Wafer-Scale Chips That Work

Compared with the 1980s, things have changed to make yield at wafer scale possible. First, in mature silicon fabs today, defects are fewer than they were in the earlier days of large-scale chip integration. Second, the technique of routing around bad cells in an array of small identical cells, extended with spares, is now standard practice for memory, be it DRAM, or on-chip SRAM. Cerebras has taken that approach to processing, building a large array of small, identical processing cells, with provision of spares, just as in memory. We’ll explain this a bit more.

As we mentioned, chip fabrication naturally creates an array of identical designs across the surface of the wafer due to the step and repeat of the photolithography. There are around 100 of these “die” per wafer. So we could, conceivably, make each of these die into a single enormous core with huge internal structures like caches, register files, and arithmetic units. Or a small number of somewhat smaller cores. But then, the chip flaws would likely kill a large fraction of the cores. Instead, noting that the AI workload can make use of massive parallelism, we choose a design with a large array of well-interconnected, small cores. Flaws will make a few of them unusable; these will be a very small fraction of the total number of cores. Moreover, we can place extra cores on each die so as to have spares. In that way, we can always deliver systems with identical logical structure, say a mesh-connected array of a certain fixed shape.

What we’re doing borrows the strategy used for memory chips. Flaws are common in the manufacture of dynamic random access memory (DRAM). The bits of a DRAM are arranged as an array of cells and some of the cells don’t work. So the DRAM is designed with a larger physical array, and when a small number of cells is bad, then the rows and columns of the array containing bad cells are simply left unused. A fully functional working array of the needed size normally remains, and the chip works as advertised.

The Cerebras WSE employs a variant of this strategy. The nature of a memory array makes it hard to map out a single cell, but easier to eliminate a row or column of the array of cells. In the Cerebras design, however, the interconnections between cores make it feasible to map out individual bad cores. The bad cores are identified at manufacturing time, and the interconnect between processors is configured to avoid them, using instead extra processors that are part of the physical design in order to present a defect-free array of the designed size in the shipped product.

WSE’s uniform small core architecture enables redundancy to address yield at very low cost

Packaging and Assembly

In assembly, the fact that materials respond differently to heat presents a fundamental issue. The wafer-scale chip is mounted on a printed circuit board (PCB). Silicon and PCB materials expand at different rates under changes in temperature. Thus, things that are aligned when the system is cool get slightly displaced when it heats up. The largest displacement occurs at the edges of the connection between the silicon chip and the Main PCB Board. In a smaller chip, this displacement is small enough that the chip to PCB connections (wires) can flex slightly and still work. But at the size of the wafer, the differences in expansion between the two materials would stress these connections enough to break some, if traditional packing techniques were used.

With all traditional attachment techniques foreclosed due to thermal mismatch, Cerebras invented a new material and designed a connector from it. This custom connecter mates the wafer to the main PCB, while absorbing the thermal displacement without breaking any electrical connections.

This sandwich of wafer, connector, and main PCB must be packaged with a fourth component, a cold plate that maintains the wafer temperature at a level comfortable for the electronics, despite an overall power delivery in the mid-teen kilowatts range. There is no existing package that can maintain thermal and electrical contact and tolerate variable expansion in three dimensions, for a system of this size. And there is no packaging machinery that can assemble one. To build the package, the four components have to be fitted together to achieve precise alignment and then held in place with techniques that maintain that alignment through multiple power and thermal cycles. Cerebras invented custom machinery along with tools, fixtures, and process software that make this all possible.

The Cerebras sandwich: Cold plate, wafer, connector compatible with thermal expansion, on PCB

Power and Cooling

Our wafer does not draw more power per unit area than ordinary chips; actually, it is a little cooler. But it is much, much bigger, so the total power draw is beyond that of a standard chip. And it is a challenge to bring in that much power, from voltage regulators, through the PCB, and into the face of the silicon wafer.

A first innovation concerns the physical placement of the power supplies. They are mounted on the PCB, so that the power flows in the through-PCB direction, rather than laterally across the PCB from the side, which is conventional. The 3D arrangement makes for a shorter, low resistance path, which curtails resistive heating and power loss in the power supply and distribution network. And there simply isn’t enough room and enough copper in the PCB to handle longer, lateral paths anyway.

The heat generated by the electronics has to be removed by a system that reliably maintains the wafer at a fairly cool temperature. Traditional air flow over a heat sink cannot handle this heat load in the confined space of a wafer. Water is far more effective than air to get a high flux of heat, and so we extract heat from the surface of the wafer by holding in tight contact with a cold plate that is itself cooled by the flow of water through a manifold. With both power and heat flowing through the wafer surfaces, there is no significant difference in temperature across the surface of the wafer, and all parts of it are effectively and uniformly cooled.

Current flows in from the bottom of the sandwich, heat flows out the top

Conclusion

Training neural networks has created computational demand beyond the reach of traditional chips like graphics processors. The industry is now building clusters of thousands of expensive graphics processors, consuming megawatts of power, taking months to deploy, and we still can’t meet the demand.

With wafer-scale chips we can, and we have.

At Cerebras Systems we had a vision to make wafer-scale work. It took gumption to build a new, more powerful chip technology. It took fearless engineering to solve problems never before solved. We convinced the leading technologists and venture capitalists of Silicon Valley to take a chance on our vision. We invented solutions for each of the key barriers to wafer-scale. In some cases, we extend existing ideas and techniques and put them to work in a new context, and in other areas solutions required whole cloth invention. Now, we have delivered our Wafer Scale Engine, the largest chip ever built. The Wafer Scale Engine is 56 times larger than any previous chip. It contains 3,000 X more on chip memory, 10,000 X more memory bandwidth and 33,000X more on chip communication bandwidth. More cores, more memory and more low latency bandwidth between cores all made possible by wafer scale.

The advantages of wafer-scale chips have long been known. What we did not know was whether they could be made to work. But now we do know: Wafer Scale is real. And the Cerebras Wafer Scale Engine is ready to transform AI compute.