A PyTorch implementation of the Transformer architecture as described in "Attention Is All You Need" paper. This implementation focuses on clarity and educational value.
- Complete transformer architecture implementation
- Multi-head self-attention mechanism
- Positional encoding
- Layer normalization
- Feed-forward networks
- Encoder and decoder stacks
- Masked attention for autoregressive decoding
git clone https://github.com/yourusername/transformer-implementation.git
cd transformer-implementation
pip install -r requirements.txt
- Python 3.8+
- PyTorch 2.0+
- NumPy
- tqdm
The implementation includes the following components:
transformer/
├── model/
│ ├── attention.py # Multi-head attention implementation
│ ├── encoder.py # Transformer encoder
│ ├── decoder.py # Transformer decoder
│ ├── positional.py # Positional encoding
│ ├── feed_forward.py # Feed-forward network
│ └── transformer.py # Complete transformer model
├── utils/
│ ├── masking.py # Attention masking utilities
│ └── preprocessing.py # Data preprocessing tools
├── train.py # Training script
└── inference.py # Inference utilities
from transformer.model import Transformer
# Initialize model
model = Transformer(
src_vocab_size=32000,
tgt_vocab_size=32000,
d_model=512,
n_heads=8,
n_layers=6,
d_ff=2048,
dropout=0.1
)
# Training example
outputs = model(source_sequences, target_sequences)
# Generate sequence
generated = model.generate(
source_sequence,
max_length=50,
temperature=0.7
)
The core self-attention mechanism computes attention scores using queries (Q), keys (K), and values (V):
where:
- Q ∈ ℝ^(seq_len × d_k): Query matrix
- K ∈ ℝ^(seq_len × d_k): Key matrix
- V ∈ ℝ^(seq_len × d_v): Value matrix
- d_k: Dimension of key vectors
- seq_len: Sequence length
Multi-head attention performs h parallel attention operations:
where each head is computed as:
Matrix dimensions:
- W^Q_i ∈ ℝ^(d_model × d_k)
- W^K_i ∈ ℝ^(d_model × d_k)
- W^V_i ∈ ℝ^(d_model × d_v)
- W^O ∈ ℝ^(hd_v × d_model)
Position is encoded using sine and cosine functions:
where:
- pos: Position in sequence
- i: Dimension index
- d_model: Model dimension
Masked attention for decoder self-attention:
where M is the mask matrix:
M[i,j] = -inf if i < j else 0
Each FFN layer applies two linear transformations:
Matrix dimensions:
- W₁ ∈ ℝ^(d_model × d_ff)
- W₂ ∈ ℝ^(d_ff × d_model)
- b₁ ∈ ℝ^d_ff
- b₂ ∈ ℝ^d_model
Each encoder layer consists of:
- Multi-head self-attention
- Layer normalization
- Feed-forward network
- Residual connections
out = LayerNorm(x + MultiHeadAttention(x))
out = LayerNorm(out + FeedForward(out))
Each decoder layer consists of:
- Masked multi-head self-attention
- Multi-head cross-attention with encoder outputs
- Feed-forward network
- Layer normalization and residual connections
out = LayerNorm(x + MaskedMultiHeadAttention(x))
out = LayerNorm(out + MultiHeadAttention(out, enc_out))
out = LayerNorm(out + FeedForward(out))
Applied after each sub-layer:
where:
- μ: Mean of the input
- σ: Standard deviation of the input
- γ, β: Learned parameters
- ε: Small constant for numerical stability
-
Cross-Entropy Loss For sequence prediction:
$$L = -\sum_{t=1}^T \sum_{v=1}^V y_{t,v} \log(p_{t,v})$$ where:- T: Sequence length
- V: Vocabulary size
- y: True labels
- p: Predicted probabilities
-
Label Smoothing Applied to target distributions: $$y'{t,v} = (1-\alpha)y{t,v} + \alpha/V$$ where α is the smoothing parameter (typically 0.1)