Understanding Attention
How do AI models focus on what's important? Through Query-Key-Value attention, each word learns to focus on relevant context from other words.
How do AI models focus on what's important? Through Query-Key-Value attention, each word learns to focus on relevant context from other words.
Tokens are mapped to Key, Query, and Value vectors. The model computes attention scores by comparing the Query vector of a token to the Key vectors of all tokens in the sequence. These weights determine how much focus the token should place on each other token's Value vector, integrating some of the other token's Value into itself.
The attention mechanism can be visualized as a matrix where each cell represents how much one token attends to another. More yellow colors indicate stronger attention weights. In GPT-2, each transformer block has multiple attention matrices. Each matrix might display a specific relationship, using multiple matrices to allow for more complex relationships.
Note: Each attention head in GPT-2 can specialize in different types of relationships. For example, one head might focus on subject-verb agreement, while another might track semantic relationships between words. The model learns these specializations during training.
Strong semantic connection between "American" and "flag".