Beyond Self-Attention: How a Small Language Model Predicts the Next Token
2024-02-02
This article explores how a small transformer language model predicts the next token, focusing on the role of transformer blocks and feed-forward networks beyond multi-head self-attention. The author shares findings from a six-month investigation, proposing that each transformer block predicts the next tokens based on learned associations with classes of strings from the training data.
Was this useful?