As we step into 2024, we’ve gathered the Cerebras Machine Learning team to make its top predictions for generative AI, Large Language Models (LLMs), and High-Performance Computing (HPC) for the coming year. We believe 2024 will be marked by a move away from a single hardware vendor toward a diverse ecosystem of computing platforms. LLMs will move from the cloud to highly tuned implementations within enterprises and on mobile devices. Models will evolve to be smarter, tapping into new modes of sparsity. Finally, HPC is set to tap into LLM innovations, fostering alternative methods for conducting scientific computation.

Compute Diversification

The 1990s were defined by the dominance of Windows and Intel, a period that served as a cautionary tale about the stifling effect of monopolies on innovation and progress. Learning from this, today’s AI industry is embracing a new paradigm by moving away from a reliance on GPUs. From OpenAI to AWS, AI companies are building alternative software and hardware systems to bring diversity to the AI ecosystem.

This pivot is enabled by the versatility of open-source AI programming languages and frameworks such as PyTorch and MLIR. Unlike x86, these languages are fully hardware agnostic and can support various CPUs, GPUs, and AI accelerators. Even CUDA is no longer a necessary or desirable component in the ML stack thanks to OpenAI Triton. The Cerebras platform taps into these open-source technologies – our stack is powered by Python 2.0, Lazy Tensor Core, MLIR, and our native runtime.

Moreover, AI model weights are entirely hardware agnostic. It is now commonplace for models initially trained on GPUs to be fine-tuned on Cerebras hardware and vice versa, without any code changes.

Looking to the near future, 2024 is poised to be a turning point where we’ll see a significant move toward a more dynamic and diversified array of AI hardware platforms. This shift is driven by the industry’s pursuit of reduced costs and more control over technology, setting the stage for a robust and versatile environment for AI training and inference.

Cloud, Enterprise, and Mobile Models

Large language models will be deployed in three different ways in 2024 – large cloud models, fine-tuned enterprise models, and efficient mobile models.

The most capable general models will continue to run in the cloud given their huge demands for compute. Inference hardware will have much more memory starting next year, which will support much larger foundation models. State-of-the-art models utilizing sparse approaches will approach 10 trillion parameters in size.

New technology eventually trickles down to smaller form factors and we’ll see that happen in earnest for LLMs in 2024.

One major trend over the next decade will be enterprise LLMs. LLMs are only as good as their training data and today’s LLMs are mostly trained on generic internet and public domain data. For instance, financial service organizations could leverage LLMs trained on their proprietary datasets to enhance fraud detection algorithms or personalize customer service interactions, providing a more secure and tailored experience. In healthcare, a company could train LLMs on anonymized patient records and medical literature to assist in diagnosing diseases or generating treatment plans with greater accuracy and speed.

LLMs will run natively on phones starting in 2024. Thanks to advancements in efficient model training, small models like BTLM and Phi often rival state-of-the-art models from a year ago and can run natively on phones. Qualcomm has shown 10B parameter models running on the Snapdragon processor. This will allow highly performant models like Mistral-7B to run on the phone without relying on cloud services. This will be a huge step toward democratizing AI access, bringing the cost of inference effectively to zero and improve user privacy by processing data directly on the device.

Models Evolve Toward Multi-modal and Sparsity

Multi-modal models emerged toward the end of 2023 and will proliferate and gain wider adoption starting in 2024. Today’s multi-modal models are still quite slow and brittle when it comes to consuming non-text inputs. Multi-modal models next year will natively ingest text, image, sound, and potentially video. We will likely see the first examples of action transformers, which can directly take actions on a computer interface on behalf of the user.

In terms of technical developments, there will be a focused effort to enhance the use of Mixture of Experts (MoE) models. The benefit of MoE models lies in their ability to dynamically allocate computational resources to the most relevant parts of a problem, increasing both the efficiency and accuracy of AI systems while using a fraction of the memory of dense models. These models are complex and require intricate management of different model components, which is why only OpenAI is using them today. But we expect new techniques to be developed that will simplify these processes while preserving the models’ efficiency in computation. Particular attention will likely be given to methods such as speculative decoding and model distillation, which improve model performance and efficiency.

We expect sparsity will evolve from coarse-grained to fine-grained. MoE models are used today as they provide a coarser form of sparsity that works on current GPU hardware. Moving forward, we expect to see the exploration of other sparsity techniques such as weight sparsity that can further enhance the performance and efficiency of large-scale models. With hardware support for unstructured and dynamic sparsity, Cerebras will be actively contributing in this area.

The Convergence of AI + HPC

We expect to see continued convergence between High-Performance Computing (HPC) and AI in 2024, leading to substantial enhancements in various HPC applications. AI methodologies are poised to either augment or entirely revamp traditional computational approaches across numerous scientific fields.

In the realm of climate modeling and environmental simulations, AI can provide more accurate predictions by processing vast datasets faster than ever before, identifying patterns that elude traditional finite element models. This will enable real-time adjustments to models and the ability to simulate complex climate phenomena with greater precision.

For materials science, generative AI will expedite the discovery of new materials by predicting properties and behaviors, replacing or augmenting molecular dynamics simulations with a more efficient, simulation-driven strategy. This could lead to breakthroughs in developing materials with desired properties for electronics, pharmaceuticals, and renewable energy technologies.

In the energy sector, particularly for fusion research, new HPC algorithms can dramatically improve the simulation of nuclear physics. For example, a new method for Monte Carlo Particle Transport runs 130 times faster on Cerebras WSE than a highly efficient GPU implementation. This leap in computational speed not only accelerates the pace at which simulations can be conducted but also allows for a greater complexity and a higher resolution of modeling, providing deeper insights into the behavior of plasma and fusion reactions. Such advancements could lead to more effective designs of fusion reactors and bring us closer to the goal of achieving sustainable and clean nuclear fusion power.

Conclusion

In summary, 2024 stands to be a seminal year, marking the dawn of an ecosystem where diverse AI hardware platforms will flourish. This diversity promises to accelerate the pace of innovation, drive down costs, and expand the reach of AI, ultimately ushering in a future where the full potential of AI and high-performance computing can be realized by anyone, anywhere.