Cerebras beats GPUs, breaking record for largest AI models trained on a single device

Cerebras, the company behind the world’s largest accelerator chip in existence, the CS-2 Wafer Scale Engine, has just announced a milestone: training the world’s largest NLP (Natural Language Processing) AI model in a single device. While that in itself could mean many things (it would not be much of a record to beat if the former largest model, for example, was trained in a smartwatch), the AI ​​model trained by Cerebras rose toward a staggering – and unprecedented – 20 billion parameters. All without having to scale the workload across multiple accelerators. That’s enough to fit the latest sensation on the Internet, the picture-from-text generator, OpenAI’s 12 billion parameter DALL-E (opens in new tab).

The most important part of Cerebras’ performance is the reduction of infrastructure and software complexity requirements. Granted, a single CS-2 system looks like a supercomputer all by itself. Wafer Scale Engine-2 – which, as the name suggests, is etched in a single 7 nm wafer, usually enough for hundreds of mainstream chips – has a staggering 2.6 trillion 7 nm transistors, 850,000 cores and 40 GB of integrated cache in one package, consuming about 15 kW.

Cerebras wafer scale motor

Cerebras’ Wafer Scale Engine-2 in all its glory in wafer size. (Image credit: Cerebras)

By keeping up to 20 billion parameters of NLP models in a single chip, they reduce overhead in training costs across thousands of GPUs (and associated hardware and scaling requirements), while eliminating the technical difficulties of splitting models across them are removed. Cerebras says this is “one of the most painful aspects of NLP workloads,” sometimes “it takes months to complete.”

It’s a tailor-made problem that is not only unique to each neural network being treated, the specifications of each GPU and the network that ties it all together – elements that need to be worked out in advance before the first training ever starts. And it cannot be transferred across systems.

Cerebras CS-2

Cerebras’ CS-2 is a standalone supercomputing cluster that includes not only the Wafer Scale Engine-2, but also all associated power, memory, and storage subsystems. (Image credit: Cerebras)

Pure numbers can make Cerebras’ performance look overwhelming – OpenAI’s GPT-3, an NLP model that can write entire articles that can sometimes fool human readers, has a staggering 175 billion parameters. Launched last year, DeepMind’s Gopher raises that figure to 280 billion. The brains of Google Brain have even announced the training of a trillion-parameter-plus model, the Switch Transformer.

Leave a Comment