Beyond Self-Attention: How a Small Language Model Predicts the Next Token

This article explores how a small transformer language model predicts the next token, focusing on the role of transformer blocks and feed-forward networks beyond multi-head self-attention. The author shares findings from a six-month investigation, proposing that each transformer block predicts the next tokens based on learned associations with classes of strings from the training data.

Visit Original Article →