LLM inference speed of light

This post delves into the theoretical limits of language model inference speed, using 'calm', a fast CUDA implementation for transformer-based language model inference, to highlight how the process is inherently bandwidth-limited due to the nature of matrix-vector multiplication and attention computation. It details the separation between ALU (Arithmetic Logic Unit) capabilities and memory bandwidth, showing numerical examples with modern CPUs and GPUs to illustrate why language model inference cannot exceed certain speeds. The piece also explores the potential for optimization through hardware and software improvements, discussing how tactics like speculative decoding and batching can enhance ALU utilisation and inference efficiency.

Visit Original Article →