Chipmaker Cerebras is working on patching its chips – already the largest in the world – to create what could be the largest computing stack ever for AI computing.
The reasonably sized “chip-scale array,” as Cerebras calls it, can link 16 CS-2s together in an assembly to create a computing system containing 13.6 million natural language processing cores. But wait, the block could be bigger.
“We can deliver up to 192 CS-2s to a group,” said Andrew Feldman, CEO of Cerebras. HPCwire.
Chipmaker AI made its announcement in artificial intelligence summit, where the company presents a paper on the technology behind patching among a huge group. The company at the beginning Technology preview at Hot Chips last year, but expanded on the idea at this week’s show.
Cerebras claimed that a single CS-2 system — which has a single chip the size of a chip with 850,000 cores — trained an AI natural language processing model with 20 billion parameters, the largest system ever trained on a single chip. Cerebras goal is to train bigger models, in less time.
“We ran the largest NLP networks on groups of CS-2s. We saw linear performance when we added CS-2s. This means that as you progress from one to two CS-2s, the training time is halved,” Feldman said.
Larger NLP models aid in more accurate training. The largest models currently have more than a billion parameters, but they are growing larger. Google researchers We proposed a new NLP paradigm with 540 billion parameters and neural models Can reach 1 trillion parameters.
Each CS-2 system can support models with more than 1 trillion parameters and Cerebras said earlier HPCwire CS-2 systems can handle models with up to 100 trillion parameters. A combination of these CS-2 systems can be paired to train larger AI models.
Cerebras has provided a fabric called SwarmX that will connect the CS-2 systems into the cluster. The implementation model is based on a technology called “stream weight”, which separates memory, computation, and networks into separate groups, making connections straightforward.
AI computing depends on the size of the model and the speed of training, and the classification allows users to quantify the computing requirements for the problems they are looking to solve. In every CS-2 system, model parameters are stored in an internal system called MemoryX, which is more than a memory element in the system. Computing performed on 850,000 computer cores
“The weight flow implementation model separates computing and parameter storage. This allows computing and memory to scale separately and independently,” Feldman said.
SwarmX interconnection is a separate system that brings together a huge group of CS-2 systems. SwarmX runs at the block level, which is almost the same as MemoryX that runs on a single CS-2 – it separates memory and computing elements into a cluster, and can increase the number of computing cores available to solve larger problems.
SwarmX connects MemoryX to groups of CS-2s. “Clustered together, clusters are dead, easy to configure and operate, and produce a linear measure of performance,” Feldman said.
SwarmX technology takes parameters stored in MemoryX and broadcasts them through the SwarmX fabric to multiple CS-2s. Parameters are copied across MemoryX systems into the block.
Feldman said the cross-sectional SwarmX fabric uses multiple lanes of 100 gigabit capacity as a transmission medium, and the Swarm fabric on the chip is based on wires within silicon.
Cerebras targets the CS-2 block system in NLP models with over a billion parameters, although a single CS-2 system is enough to solve a problem. But Cerebras states that moving from CS-2s to CS-2s in a group cuts training time in half and so on.
“The ensembles together produce a linear measure of performance,” Feldman said, adding, “A ensemble of 16 or 32 CS-2s can train a trillion variable model in less time than current GPU ensembles that train 80 billion variable models.”
Buying two CS-2 systems could set customers back millions of dollars, but Cerebras argued in the presentation that such systems are cheaper than the GPU model in kits, which cannot be scaled effectively and draw more power.
Cerebras argued that GPU cores need to operate identically across thousands of cores to have a coordinated response time. Calculations must also be distributed over a complex network of cores, which can be time-consuming and inefficient in power consumption.
By comparison, SwarmX breaks data sets into pieces for training purposes, and creates a scalable stream that distributes weights between CS-2 systems in a ensemble, which sends gradients back to MemoryX formatted cache systems across the ensemble.
Switching over training an NLP model from a single CS-1 system to an array only requires changing the number of systems in a Python script.
Large language models such as GPT-3 can be deployed on a batch of CS-2 with a single keystroke. That’s how easy it is to do that, Feldman said.