Transformer Implementation from Scratch

A PyTorch implementation of the Transformer architecture as described in "Attention Is All You Need" paper. This implementation focuses on clarity and educational value.

Features

Complete transformer architecture implementation
Multi-head self-attention mechanism
Positional encoding
Layer normalization
Feed-forward networks
Encoder and decoder stacks
Masked attention for autoregressive decoding

Installation

git clone https://github.com/yourusername/transformer-implementation.git
cd transformer-implementation
pip install -r requirements.txt

Requirements

Python 3.8+
PyTorch 2.0+
NumPy
tqdm

Architecture

The implementation includes the following components:

transformer/
├── model/
│   ├── attention.py       # Multi-head attention implementation
│   ├── encoder.py         # Transformer encoder
│   ├── decoder.py         # Transformer decoder
│   ├── positional.py      # Positional encoding
│   ├── feed_forward.py    # Feed-forward network
│   └── transformer.py     # Complete transformer model
├── utils/
│   ├── masking.py        # Attention masking utilities
│   └── preprocessing.py   # Data preprocessing tools
├── train.py              # Training script
└── inference.py          # Inference utilities

Usage

Training

from transformer.model import Transformer

# Initialize model
model = Transformer(
    src_vocab_size=32000,
    tgt_vocab_size=32000,
    d_model=512,
    n_heads=8,
    n_layers=6,
    d_ff=2048,
    dropout=0.1
)

# Training example
outputs = model(source_sequences, target_sequences)

Inference

# Generate sequence
generated = model.generate(
    source_sequence,
    max_length=50,
    temperature=0.7
)

Mathematical Foundations

Self-Attention

The core self-attention mechanism computes attention scores using queries (Q), keys (K), and values (V):

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

where:

Q ∈ ℝ^(seq_len × d_k): Query matrix
K ∈ ℝ^(seq_len × d_k): Key matrix
V ∈ ℝ^(seq_len × d_v): Value matrix
d_k: Dimension of key vectors
seq_len: Sequence length

Multi-Head Attention

Multi-head attention performs h parallel attention operations:

$$MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O$$

where each head is computed as:

$$head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)$$

Matrix dimensions:

W^Q_i ∈ ℝ^(d_model × d_k)
W^K_i ∈ ℝ^(d_model × d_k)
W^V_i ∈ ℝ^(d_model × d_v)
W^O ∈ ℝ^(hd_v × d_model)

Positional Encoding

Position is encoded using sine and cosine functions:

$$PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}})$$ $$PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}})$$

where:

pos: Position in sequence
i: Dimension index
d_model: Model dimension

Masked Attention

Masked attention for decoder self-attention:

$$Attention(Q, K, V, M) = softmax(\frac{QK^T}{\sqrt{d_k}} + M)V$$

where M is the mask matrix:

M[i,j] = -inf if i < j else 0

Feed-Forward Networks

Each FFN layer applies two linear transformations:

$$FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$$

Matrix dimensions:

W₁ ∈ ℝ^(d_model × d_ff)
W₂ ∈ ℝ^(d_ff × d_model)
b₁ ∈ ℝ^d_ff
b₂ ∈ ℝ^d_model

Component Details

Encoder Block

Each encoder layer consists of:

Multi-head self-attention
Layer normalization
Feed-forward network
Residual connections

out = LayerNorm(x + MultiHeadAttention(x))
out = LayerNorm(out + FeedForward(out))

Decoder Block

Each decoder layer consists of:

Masked multi-head self-attention
Multi-head cross-attention with encoder outputs
Feed-forward network
Layer normalization and residual connections

out = LayerNorm(x + MaskedMultiHeadAttention(x))
out = LayerNorm(out + MultiHeadAttention(out, enc_out))
out = LayerNorm(out + FeedForward(out))

Layer Normalization

Applied after each sub-layer:

$$LayerNorm(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

where:

μ: Mean of the input
σ: Standard deviation of the input
γ, β: Learned parameters
ε: Small constant for numerical stability

Training Objectives

Cross-Entropy Loss For sequence prediction: $$L = -\sum_{t=1}^T \sum_{v=1}^V y_{t,v} \log(p_{t,v})$$ where:
- T: Sequence length
- V: Vocabulary size
- y: True labels
- p: Predicted probabilities
Label Smoothing Applied to target distributions: $$y'{t,v} = (1-\alpha)y{t,v} + \alpha/V$$ where α is the smoothing parameter (typically 0.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Implementation from Scratch

Features

Installation

Requirements

Architecture

Usage

Training

Inference

Mathematical Foundations

Self-Attention

Multi-Head Attention

Positional Encoding

Masked Attention

Feed-Forward Networks

Component Details

Encoder Block

Decoder Block

Layer Normalization

Training Objectives

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
utils		utils
README.md		README.md
attention.py		attention.py
decoder.py		decoder.py
encoder.py		encoder.py
model.py		model.py
transformer.py		transformer.py

namesarnav/transformer-arch

Folders and files

Latest commit

History

Repository files navigation

Transformer Implementation from Scratch

Features

Installation

Requirements

Architecture

Usage

Training

Inference

Mathematical Foundations

Self-Attention

Multi-Head Attention

Positional Encoding

Masked Attention

Feed-Forward Networks

Component Details

Encoder Block

Decoder Block

Layer Normalization

Training Objectives

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages