LLM in a flash: Efficient Large Language Model Inference with Limited Memory

The paper presents a solution for efficiently running large language models (LLMs) that exceed available DRAM capacity by storing model parameters in flash memory and loading them into DRAM on demand. It introduces two key techniques: "windowing," which minimises data transfer by reusing neurons, and "row-column bundling," which leverages flash memory's strengths by reading larger contiguous data chunks. These methods enhance inference speed and enable models up to twice the size of available DRAM to run effectively on devices with limited memory.

Visit Original Article →