LLM inference speed of light

2024-04-07

This post delves into the theoretical limits of language model inference speed, using 'calm', a fast CUDA implementation for transformer-based language model inference, to highlight how the process is inherently bandwidth-limited due to the nature of matrix-vector multiplication and attention computation. It details the separation between ALU (Arithmetic Logic Unit) capabilities and memory bandwidth, showing numerical examples with modern CPUs and GPUs to illustrate why language model inference cannot exceed certain speeds. The piece also explores the potential for optimization through hardware and software improvements, discussing how tactics like speculative decoding and batching can enhance ALU utilisation and inference efficiency.

LanguageModel InferenceSpeed CUDA ALUOptimization BandwidthLimitation

Visit Original Article →

Was this useful?