Cerebras, the company behind the world’s largest accelerator chip in existence, the CS-2 Wafer Scale Engine, has just announced a milestone: training the world’s largest NLP (Natural Language Processing) AI model in a single device. While that in itself could mean many things (it would not be much of a record to beat if the former largest model, for example, was trained in a smartwatch), the AI model trained by Cerebras rose toward a staggering – and unprecedented – 20 billion parameters. All without having to scale the workload across multiple accelerators. That’s enough to fit the latest sensation on the Internet, the picture-from-text generator, OpenAI’s 12 billion parameter DALL-E (opens in new tab).
The most important part of Cerebras’ performance is the reduction of infrastructure and software complexity requirements. Granted, a single CS-2 system looks like a supercomputer all by itself. Wafer Scale Engine-2 – which, as the name suggests, is etched in a single 7 nm wafer, usually enough for hundreds of mainstream chips – has a staggering 2.6 trillion 7 nm transistors, 850,000 cores and 40 GB of integrated cache in one package, consuming about 15 kW.
By keeping up to 20 billion parameters of NLP models in a single chip, they reduce overhead in training costs across thousands of GPUs (and associated hardware and scaling requirements), while eliminating the technical difficulties of splitting models across them are removed. Cerebras says this is “one of the most painful aspects of NLP workloads,” sometimes “it takes months to complete.”
It’s a tailor-made problem that is not only unique to each neural network being treated, the specifications of each GPU and the network that ties it all together – elements that need to be worked out in advance before the first training ever starts. And it cannot be transferred across systems.
Pure numbers can make Cerebras’ performance look overwhelming – OpenAI’s GPT-3, an NLP model that can write entire articles that can sometimes fool human readers, has a staggering 175 billion parameters. Launched last year, DeepMind’s Gopher raises that figure to 280 billion. The brains of Google Brain have even announced the training of a trillion-parameter-plus model, the Switch Transformer.
“In NLP, larger models are shown to be more accurate. But traditionally, only a select few companies had the resources and expertise needed to carry out the painstaking work of breaking up these large models and spreading them into hundreds. or thousands of graphics processing units, “says Andrew Feldman, CEO and co-founder of Cerebra’s Systems.” As a result, very few companies could train large NLP models – it was too expensive, time consuming and inaccessible to the rest of the industry. Today, we are proud to democratize access to the GPT-3XL 1.3B, GPT-J 6B, GPT-3 13B and GPT-NeoX 20B, enabling the entire AI ecosystem to set up large models in minutes and train them on a single CS-2. ”
Yet the number of parameters, like clockspeeds in the world’s best CPUs, is only one possible indicator of performance. Recently, efforts have been made to achieve better results with fewer parameters – Chinchilla, for example, routinely outperforms both GPT-3 and Gopher with only 70 billion of them. The goal is to work smarter, not harder. As such, Cerebras’ performance is more important than first sight – researchers are bound to be able to adapt increasingly complex models, even though the company says their system has the potential to support models with “hundreds of billions even trillions of parameters”.
This explosion in the number of usable parameters makes use of Cerebras’ Weight Streaming technology, which can decouple computer and memory footprints, allowing memory to be scaled to the amount needed to store the rapidly increasing number of parameters in AI workloads. This makes it possible to reduce setup times from months to minutes and easily switch between models such as GPT-J and GPT-Neo “with a few keystrokes“.
“Cerebras’ ability to bring great language models to the masses with cost-effective, easy access opens up an exciting new era in artificial intelligence. It gives organizations that can not spend tens of millions, an easy and cheap entry into major league NLP, “It will be interesting to see the new applications and discoveries CS-2 customers make when training GPT-3 and GPT-J class models on massive datasets,” said Dan Olds, Chief Research Officer, Intersect360 Research.