The important thing is that what we are doing here is not so much “understanding” as it is “learning the rules of how to collect and mix information.” Each token (unit of input) in a sentence gathers the information it needs from other tokens in the same sentence. Then update your own expressions with the information you have gathered. By repeating this process dozens of times in layers, it becomes possible to handle everything from local word connections to long-distance dependencies in stages. In this article, without relying on mathematical formulas, I will use text alone to explain what self-attention does, why it works, and where it tends to get clogged, to the point where it almost feels like an implementation.
The inside of LLM is a “stacking machine”: the premise is that it is a decoder-type Transformer
Table of Contents
- The inside of LLM is a “stacking machine”: the premise is that it is a decoder-type Transformer
- What is self-attention? Each token decides for itself “where to look”
- Causal mask: A model that can be “generated” by not looking into the future
- Multi-Head Attention: One point of view is not enough, so have multiple “viewpoints” in parallel.
- Residual connection and LayerNorm: A framework for not breaking even when stacked deeply
- MLP: “Process” the information gathered by attention and make it into a usable form
- Problems that occur during implementation: There are many patterns that work but are incorrect.
- Why long sentences are painful: Self-attention has a heavy cost of looking at “all versus all”
- Summary: Self-attention is “an information gathering device that learns and decides where to refer.”
Many generative LLMs have a structure in which blocks of the same shape are stacked vertically in many layers. The input is not a string but a string of tokens, and each token is first transformed into a vector by embedding. From here on, it basically passes through Transformer blocks of the same shape in order, and finally is converted into an output for selecting the “token that is likely to come next.”
An important constraint as a generative model is that tokens at a certain position cannot refer to “future tokens.” Since sentences are generated from left to right, it would be a mistake to look at the future that has not yet been output. Therefore, Transformer uses self-attention calculations to limit the “viewable range.” This is a mechanism called a causal mask, and by treating the following words as “invisible,” the model is always in a state of “predicting the next based only on the past.”
What is self-attention? Each token decides for itself “where to look”
An intuitive explanation of self-attention is as follows. Each time each token in a sentence is processed, it determines which part of the sentence is useful given its current situation, extracts information from that part, and updates its own expression. In other words, each token becomes an “inquiry”, while other tokens also become “information sources”.
At this time, it is easier to understand if you think of the token as having three role representations internally. The first is the direction of the inquiry, “What are you looking for?” The second is the name tag, “What characteristics do you have?” The third is the content (information) that you can provide. The inquiry is compared with the name tag, and the more it is determined that “this information source is closely related to me,” the more strongly the contents of that information source are taken in. As a result, each token is updated to a new expression that gathers information from multiple places in the sentence and mixes it together.
What is important here is that the rules for determining reference destinations are not fixed. Rather than simply “looking at nearby words,” you can also look at faraway places depending on the input content. For example, even if the subject and predicate are far apart, the subject can be referenced and updated to a matching expression if necessary. This is why self-attention is said to be strong against long-distance dependence.
Causal mask: A model that can be “generated” by not looking into the future
During generation, each position can only refer to the tokens before it. Normally, self-attention allows you to look anywhere in the text, but if you allow this to happen, you end up in a state of “solving while looking at the answer” when learning. Therefore, a causal mask is used to force the system to “not be able to refer to what is behind it” and always predict what will happen next based on “the past” only.
This constraint works in the same way during learning and inference. Even if the entire text is at hand during learning, calculations at each position are made so that the future cannot be seen, and the system is adjusted to the situation at the time of inference. In this way, the behavior learned through learning can be used directly for generation. In other words, the causal mask is a “safety device for establishing the Transformer as a generator.”
Multi-Head Attention: One point of view is not enough, so have multiple “viewpoints” in parallel.
If there is only one type of self-attention, there will be only one “judgment criterion” for determining the reference destination of each token. However, in natural language, multiple relationships overlap at the same time. Multiple perspectives are needed to read the same sentence, such as similarity in word meaning, dependency, co-reference, topic continuity, and the scope of negation and conditions.
Multi-Head Attention is a practical solution to this problem. The internal expression is divided into multiple groups, and each group independently decides where to look. A division of labor may occur, with one head becoming better at seeing local connections and another at tracking the subject at the beginning of a sentence. Of course, the meaning is not necessarily determined for each head, but providing “the freedom to refer to multiple points of view at the same time” itself boosts the power of expression.
Finally, the information collected by each head is combined and returned to its original dimension. Therefore, it is best to think of Multi-Head as creating a “comprehensive context expression by adding up different reference patterns”, rather than simply parallelizing them.
Residual connection and LayerNorm: A framework for not breaking even when stacked deeply
Transformer gets very deep. Deep networks have high expressive power, but learning tends to be unstable. So the Transformer block puts residual connections around self-attention and MLP. Residual connection is a mechanism in which the original inputs are added together and passed on together, rather than passing only the transformation result to the next. This makes it difficult for information to be lost even when the conversion is not learned well, making it easier for learning to progress.
LayerNorm has the role of adjusting the scale and bias of the internal representation of each token. This prevents the expression from becoming unruly even when layers are added. For large-scale learning, where you place LayerNorm affects stability. In practice, an arrangement that normalizes before transformation is often adopted, with the aim of making learning in deep stacks more stable.
MLP: “Process” the information gathered by attention and make it into a usable form
Self-attention is good at deciding where to get information from, but it doesn’t necessarily mean you can use the information you bring in as is. So each block contains an MLP (small neural network for each position). MLP works independently at each token position, nonlinearly transforming the representation into “usable features.”
Intuitively, if self-care is the person who gathers the ingredients, MLP is the person who prepares the ingredients and cooks them. Organizes the mixed contextual information into a form that is easy to classify and predict. This is repeated each time the block is passed, and the representation becomes more abstract and more useful for the task.
Problems that occur during implementation: There are many patterns that work but are incorrect.
While you can write a self-attention implementation without knowing mathematical formulas, I’m afraid that if you make a mistake, it will behave like that. A typical problem is “normalizing in the wrong direction.” Originally, each token should assign a weight to its reference destination, but if the axes are mixed up, the normalization results in a different meaning. Since output is still output, it appears that learning is progressing, but performance does not improve or behavior changes.
Another issue is how to handle masks. The constraint that prevents us from looking into the future may be weakened due to low-precision arithmetic formats or implementation considerations, or conversely, even necessary references may be deleted. Particularly when mixed precision is used to speed things up, “obscuring” operations with very large negative values may not work as expected due to rounding. For such problems, it is effective to use unit tests and visualization to check whether the future is really being referenced.
Why long sentences are painful: Self-attention has a heavy cost of looking at “all versus all”
The weakness of self-attention is the amount of calculation. Instead of each token having the degree of freedom that “any token in the sentence can be a reference candidate,” the number of reference combinations increases. In other words, as the sentence gets longer, the number of reference relationship candidates increases rapidly, which increases the computational and memory requirements. The reason why inference tends to be expensive with long sentences is not just because the number of tokens increases, but also because the cost of creating reference relationships increases.
There are additional circumstances when making inferences. Since generation increases by one token, each time a new token is added, it is necessary to calculate the “relationship with all the past”. If a mechanism called KV cache is used here, there is no need to recalculate past information, improving speed. Even so, there remains a need to “look at the past as a whole,” so the tendency for longer sentences to become slower will not go away. This is why long text correspondence is not only a problem of positional representation, but also a problem of computational design.
Summary: Self-attention is “an information gathering device that learns and decides where to refer.”
Transformer’s self-attention works by calculating and determining where each token in a sentence should look from the input, and then updates its own expression by incorporating the necessary information. The model is established as a generator because the causal mask prohibits future references. Multi-Head allows multiple viewpoints to be viewed simultaneously, residual connections and LayerNorm allow deep layering to be learned, and MLP processes the collected information non-linearly to increase expressive power. On the other hand, self-cautions become costly when the text becomes long, and the normalization axis and mask handling become a breeding ground for bugs in implementation. Understanding Transformer means being able to verbalize where information flows from and to where, and what governs performance and cost, rather than treating models as magic.
