Page13/20
Attention Mechanisms & Transformers Β· Page 1 of 2
Attention Mechanism
Attention & Transformers
The Problem with RNNs/LSTMs
Sequential processing is slow:
Step 1: Input word 1 β hidden state 1 (wait for completion)
Step 2: Input word 2 β hidden state 2 (can't start until step 1 done)
...
Step N: Process word N
For a 1000-word sentence: 1000 sequential steps! Bottleneck!
Solution: Process all words at once (parallelization)
But then how do we know which words relate to which?
Attention: "Focus on Relevant Words"
When processing a word, don't treat all other words equally. Attend to relevant ones.
Sentence: "The cat sat on the mat"
word: 1 2 3 4 5 6
Processing word 3 ("sat"):
- High attention to "cat" (subject)
- Low attention to "the" (article)
- Medium attention to "mat" (object)
This tells the model which words matter for "sat"!
Self-Attention Mechanism
Query Q = W_q Γ x (what am I looking for?)
Key K = W_k Γ x (what do I have?)
Value V = W_v Γ x (what info to use?)
Attention(Q, K, V) = softmax(Q Γ K^T / βd_k) Γ V
Intuition:
- Q Γ K^T computes similarity between each pair
- softmax converts to attention weights
- Weights Γ V gives weighted average (which words matter)
Example: Computing Attention
Query for "sat": "what verbs happened?"
Keys for all words: ["noun", "noun", "verb", "preposition", "article", "noun"]
Similarity: [low, low, high, low, very_low, low]
After softmax: [0.01, 0.01, 0.85, 0.05, 0.005, 0.01]
β High attention to "sat" (the verb!)
Weighted sum of values: 0.85 Γ V_sat + 0.05 Γ V_on + ...
Result: Information focused on the verb!
Why Attention Works
- Parallelization: All words processed simultaneously (GPU-friendly)
- Long-range dependencies: Can attend to any word (no sequential bottleneck)
- Interpretability: Attention weights show what model focused on
- Flexibility: Can learn different attention patterns per layer
Multi-Head Attention
Use multiple attention mechanisms in parallel:
Head 1: Attends to subjects
Head 2: Attends to verbs
Head 3: Attends to adjectives
...
Concatenate: [head1_output, head2_output, head3_output, ...]
Why? Different "heads" learn different relationships!
Example: 8 heads Γ 64 dimensions = 512-dimensional output
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦