The Mathematics Behind Modern AI

In 2017, a team of researchers at Google published a paper with a bold title: “Attention Is All You Need.” This work introduced the Transformer architecture, which would revolutionize artificial intelligence and give birth to the large language models that now shape our digital world. But what makes these systems work? What mathematical principles allow a computer to understand context, generate coherent text, and even reason about complex problems?

This document explores the mathematical foundations of three groundbreaking AI architectures: OpenAI’s GPT models, which demonstrated the power of scaling; DeepSeek’s efficient design, which proved that smart engineering can rival brute force; and Google’s Gemini, which extended AI’s reach into the multimodal realm of images, audio, and video.

We’ll present the actual mathematical formulations that power these systems, but explain them in a way that reveals their elegant simplicity. Our goal is to make sophisticated concepts accessible without sacrificing accuracy or depth.

Part 1: The Foundational Architecture – The Transformer

Before diving into the specific innovations of GPT, DeepSeek, and Gemini, we must understand the Transformer architecture that underlies them all. Introduced by Vaswani et al. in 2017, the Transformer solved a fundamental problem: how to process sequences of information in parallel while still capturing relationships between distant elements.

The Attention Mechanism: Learning What Matters

Imagine reading a detective novel. When you encounter the phrase “the butler did it,” your mind instantly connects this to earlier mentions of the butler, the crime scene, and various clues scattered throughout hundreds of pages. You’re not giving equal weight to every word you’ve read—you’re selectively attending to relevant information.

The Transformer does exactly this through its attention mechanism. The mathematics, while precise, capture this intuitive process beautifully.

Scaled Dot-Product Attention

The core attention formula is:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Let’s unpack this step by step:

The Three Matrices: – Q (Query): Represents “what am I looking for?” For each word, this encodes what information it needs from other words. – K (Key): Represents “what information do I have?” Each word advertises what it can offer. – V (Value): Represents “what is my actual content?” The information that will be retrieved.

The Computation:

QK^T: We multiply queries by keys (transposed). This creates a matrix where each entry measures how relevant each key is to each query. Think of it as computing compatibility scores.
Division by √d_k: This scaling factor is crucial. Without it, when d_k (the dimension of our keys) is large, the dot products can become very large, pushing the softmax function into regions where gradients become tiny. The square root scaling keeps the values in a reasonable range. It’s like adjusting the volume on a stereo—too loud and you get distortion, too quiet and you miss important details.
softmax(…): This converts our compatibility scores into probabilities that sum to 1. It’s a way of saying “distribute your attention across all the words, but focus more on the relevant ones.”
Multiply by V: Finally, we use these attention weights to compute a weighted sum of the values. We’re gathering information from all words, but in proportion to their relevance.

Example in Action:

Consider the sentence: “The cat sat on the mat because it was comfortable.”

When processing “it,” the attention mechanism computes: – High attention to “mat” (it could refer to the mat) – High attention to “cat” (it could refer to the cat) – Lower attention to “sat,” “on,” “the” (less relevant for resolving the pronoun)

The context (“because it was comfortable”) helps the model determine that “it” likely refers to the mat, not the cat.

Multi-Head Attention: Multiple Perspectives

A single attention mechanism is powerful, but multiple attention mechanisms working in parallel are even better. This is multi-head attention:

MultiHead(Q, K, V) = Concat(head₁, …, headₕ) W^O

where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Why Multiple Heads?

Different heads can learn to attend to different types of relationships: – One head might focus on syntactic relationships (subject-verb agreement) – Another might capture semantic relationships (synonyms, antonyms) – Yet another might track long-range dependencies (pronouns to their antecedents)

The Mathematics:

Each head has its own learned projection matrices: – W_i^Q ∈ ℝ^(d_model × d_k): Projects the input into query space for head i – W_i^K ∈ ℝ^(d_model × d_k): Projects the input into key space for head i – W_i^V ∈ ℝ^(d_model × d_v): Projects the input into value space for head i

After computing attention for each head independently, we concatenate all the outputs and project them back to the model dimension using W^O ∈ ℝ^(hd_v × d_model).

Typical Configuration: – Number of heads (h): 8 to 96 (depending on model size) – d_k = d_v = d_model / h (typically 64 for each head in a 512-dimensional model)

This design ensures that the total computational cost is similar to single-head attention with full dimensionality, but we gain the benefit of multiple specialized attention patterns.

Positional Encoding: Understanding Order

Unlike recurrent neural networks that process sequences step by step, Transformers process all positions simultaneously. This parallelism is great for speed, but it creates a problem: the model has no inherent sense of word order. “Dog bites man” and “Man bites dog” would look identical!

The solution is positional encoding, which adds position information to each word’s representation:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Understanding the Formula:

pos: The position of the word in the sequence (0, 1, 2, …)
i: The dimension index (0, 1, 2, …, d_model/2)
2i and 2i+1: Even and odd dimensions use sine and cosine respectively

Why Sine and Cosine?

This choice is brilliant for several reasons:

Unique Fingerprints: Each position gets a unique pattern of sine and cosine values across all dimensions.
Relative Positions: The model can learn to attend to relative positions. Due to trigonometric identities, PE(pos+k) can be expressed as a linear function of PE(pos), making it easy for the model to learn patterns like “attend to the word 3 positions back.”
Extrapolation: The model can potentially handle sequences longer than those seen during training, since the sinusoidal pattern continues smoothly.
No Learning Required: Unlike learned positional embeddings, these are fixed functions, reducing the number of parameters.

Visualization:

Imagine each position as a unique musical chord. Position 0 might be a C major chord, position 1 a slightly different chord, and so on. Each dimension contributes a different frequency to this chord, creating a rich, unique signature for each position.

Position-Wise Feed-Forward Networks

After attention, each position’s representation is processed through a simple two-layer network:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Breaking It Down:

First Linear Layer (xW₁ + b₁): Expands the representation from d_model dimensions to d_ff dimensions (typically d_ff = 4 × d_model). This expansion gives the network more capacity to process information.
ReLU Activation (max(0, …)): The Rectified Linear Unit acts as a gatekeeper, setting negative values to zero. This non-linearity is crucial—without it, the entire network would collapse to a single linear transformation.
Second Linear Layer (W₂ + b₂): Projects back down to d_model dimensions.

Why “Position-Wise”?

The same FFN is applied to each position independently. It’s like having the same expert analyze each word separately, after the attention mechanism has already allowed words to share information with each other.

Typical Dimensions: – Input/Output: d_model = 512 to 12,288 (depending on model size) – Hidden layer: d_ff = 2,048 to 49,152 (typically 4× the model dimension)

Layer Normalization and Residual Connections

Two additional components stabilize training:

Residual Connections:

output = LayerNorm(x + Sublayer(x))

The “+x” is a residual connection—we add the input back to the output. This allows gradients to flow directly through the network during training, preventing the vanishing gradient problem in deep networks.

Layer Normalization:

Normalizes the activations to have mean 0 and variance 1, then applies learned scaling and shifting. This keeps the values in a reasonable range and stabilizes training.

The Complete Transformer Block

Putting it all together, a Transformer block consists of:

Multi-head self-attention
Residual connection and layer normalization
Position-wise feed-forward network
Another residual connection and layer normalization

This block is stacked multiple times (6 to 96+ layers) to create the full model.

Part 2: OpenAI GPT – Scaling the Transformer

The GPT Philosophy: Simplicity at Scale

OpenAI’s approach with GPT (Generative Pre-trained Transformer) was elegantly simple: take the Transformer decoder, scale it up massively, train it on enormous amounts of text, and let capabilities emerge from scale.

Architecture Details

GPT uses the standard Transformer architecture with a few key choices:

GPT Architecture Diagram

Figure 1: OpenAI GPT Architecture. The model consists of stacked Transformer blocks with causal self-attention, allowing each position to only attend to previous positions. Each block contains multi-head attention followed by a feed-forward network, with residual connections and layer normalization.

Decoder-Only Architecture:

Unlike the original Transformer (which had both encoder and decoder), GPT uses only the decoder. This means: – It processes text left-to-right – Each position can only attend to previous positions (causal masking) – This makes it naturally suited for text generation

Causal Masking:

The attention mechanism is modified to prevent “looking ahead”:

Attention(Q, K, V) = softmax((QK^T + M) / √d_k) V

where M_ij = 0 if i ≥ j, -∞ if i < j

The mask M ensures that position i can only attend to positions j where j ≤ i. The -∞ values become 0 after softmax, effectively blocking attention to future positions.

The Mathematics of Scale

Model Dimensions (GPT-3 as example): – Layers: 96 – d_model: 12,288 – Number of heads: 96 – d_k = d_v: 128 per head – d_ff: 49,152 – Context length: 2,048 tokens (later extended to 128K in GPT-4) – Total parameters: ~175 billion (GPT-3)

Attention Complexity:

For a sequence of length n: – Attention computation: O(n² × d_model) – Memory for attention scores: O(n² × h)

This quadratic scaling with sequence length is why context length was initially limited.

Training Objective: Next-Token Prediction

GPT is trained with a simple but powerful objective:

L = -∑_{t=1}^T log P(x_t | x_1, …, x_{t-1})

What This Means:

For each position t in the training text: 1. Given all previous tokens (x_1 through x_{t-1}) 2. Predict the probability distribution over the next token (x_t) 3. Maximize the log probability of the actual next token

This seemingly simple task forces the model to learn: – Grammar and syntax – Factual knowledge – Reasoning patterns – Common sense – And much more

Why It Works:

The key insight is that predicting the next word requires understanding context, relationships, and patterns. To predict the next word in “The capital of France is “, the model must know geography. To predict the next word in “She was happy because ”, the model must understand causation and emotion.

Computational Requirements

Training: – GPT-3 training: ~3.14 × 10²³ FLOPs – Training time: Several months on thousands of GPUs – Training data: ~300 billion tokens

Inference: – For each token generated: ~2 × (number of parameters) FLOPs – All parameters are active for every token – Memory: Must store all parameters plus KV cache

Strengths and Limitations

Strengths: – Proven, reliable architecture – Excellent general-purpose performance – Strong few-shot learning capabilities – Predictable scaling behavior

Limitations: – High computational cost (all parameters always active) – Quadratic attention complexity limits context length – Large memory footprint – Expensive to train and run

Part 3: DeepSeek – The Efficiency Revolution

The DeepSeek Philosophy: Matching Performance with Fraction of Cost

DeepSeek asked a provocative question: “Can we achieve GPT-level performance while using only 5% of the computational resources per token?” Their answer came through two major innovations: Multi-Head Latent Attention (MLA) for memory efficiency and DeepSeekMoE for computational efficiency.

DeepSeek Architecture Diagram

Figure 2: DeepSeek Architecture. The model features Multi-Head Latent Attention (MLA) with 98% KV cache reduction through compression, and DeepSeekMoE with sparse expert routing that activates only 37B out of 671B parameters per token. The architecture combines shared experts (always active) with routed experts (selectively activated based on token affinity).

Innovation 1: Multi-Head Latent Attention (MLA)

The KV cache problem is one of the biggest bottlenecks in LLM inference. For each token generated, the model must store the keys and values from all previous tokens. With long contexts and many attention heads, this memory requirement becomes enormous.

The Compression Strategy

MLA solves this through low-rank compression. Instead of storing full keys and values, it stores compressed representations:

Step 1: Compress Keys and Values

c_t^KV = W^DKV h_t

Where: – h_t ∈ ℝd: The input hidden state for token t – WDKV ∈ ℝ^(d_c × d): Down-projection matrix – c_t^KV ∈ ℝ^d_c: Compressed latent vector (d_c ≪ d_h × n_h)

Typical Values: – d (model dimension): 5,120 – n_h (number of heads): 128 – d_h (dimension per head): 128 – d_c (compressed dimension): 512

This means instead of storing 128 × 128 = 16,384 values per token, we store only 512—a 97% reduction!

Step 2: Decompress Keys

[k_{t,1}^C; k_{t,2}^C; …; k_{t,n_h}^C] = k_t^C = W^UK c_t^KV

Where: – W^UK ∈ ℝ^(d_h×n_h × d_c): Up-projection matrix for keys – k_t^C: Decompressed keys for all heads (concatenated)

Step 3: Add Rotary Position Information

k_t^R = RoPE(W^KR h_t)
k_{t,i} = [k_{t,i}^C; k_t^R]

Where: – W^KR ∈ ℝ(d_hR × d): Matrix for RoPE keys – RoPE(·): Rotary Position Embedding function – k_t^R: Position-dependent component (shared across heads) – [·; ·]: Concatenation

Why Separate Position Information?

RoPE (Rotary Position Embedding) is more efficient than sinusoidal encoding. By keeping position information separate and shared across heads, we save even more memory.

Step 4: Decompress Values

[v_{t,1}^C; v_{t,2}^C; …; v_{t,n_h}^C] = v_t^C = W^UV c_t^KV

Where: – W^UV ∈ ℝ^(d_h×n_h × d_c): Up-projection matrix for values

Query Compression

Queries are also compressed, but this is for reducing activation memory during training, not inference:

c_t^Q = W^DQ h_t
[q_{t,1}^C; q_{t,2}^C; …; q_{t,n_h}^C] = q_t^C = W^UQ c_t^Q
[q_{t,1}^R; q_{t,2}^R; …; q_{t,n_h}^R] = q_t^R = RoPE(W^QR c_t^Q)
q_{t,i} = [q_{t,i}^C; q_{t,i}^R]

Where: – c_t^Q ∈ ℝ(d_c’): Compressed query latent (d_c’ ≪ d_h × n_h) – WDQ ∈ ℝ^(d_c’ × d): Query down-projection – W^UQ ∈ ℝ^(d_h×n_h × d_c’): Query up-projection – W^QR ∈ ℝ^(d_h×n_h × d_c’): Query RoPE matrix

Final Attention Computation

o_{t,i} = ∑_{j=1}^t softmax_j(q_{t,i}^T k_{j,i} / √(d_h + d_h^R)) v_{j,i}^C

u_t = W^O [o_{t,1}; o_{t,2}; …; o_{t,n_h}]

Where: – W^O ∈ ℝ^(d × d_h×n_h): Output projection matrix – o_{t,i}: Attention output for head i at position t

Memory Savings:

For DeepSeek-V3 with 128 heads: – Standard MHA: 128 × 128 × 2 = 32,768 values per token – MLA: 512 + 128 = 640 values per token – Reduction: 98%

Innovation 2: DeepSeekMoE Architecture

The second major innovation is the Mixture-of-Experts (MoE) architecture, which activates only a small fraction of parameters for each token.

The Basic MoE Formula

h_t’ = u_t + ∑_{i=1}^{N_s} FFN_i^{(s)}(u_t) + ∑_{i=1}^{N_r} g_{i,t} FFN_i^{(r)}(u_t)

Where: – u_t: Input to the FFN layer – N_s: Number of shared experts (always activated) – N_r: Number of routed experts (selectively activated) – FFN_i{(s)}(·): i-th shared expert – FFN_i{(r)}(·): i-th routed expert – g_{i,t}: Gating value for routed expert i at token t – h_t’: Output of the FFN layer

DeepSeek-V3 Configuration: – N_s = 2 shared experts – N_r = 256 routed experts – K_r = 8 routed experts activated per token – Expert size: Each expert has ~15M parameters

Total Computation: – Shared experts: 2 experts (always active) – Routed experts: 8 experts (selected from 256) – Total active: 10 experts per token – Percentage: 10/258 ≈ 3.9% of FFN parameters

The Routing Mechanism

Step 1: Compute Affinity Scores

s_{i,t} = Sigmoid(u_t^T e_i)

Where: – e_i ∈ ℝ^d: Centroid vector for expert i (learned parameter) – s_{i,t}: Token-to-expert affinity score (between 0 and 1) – Sigmoid: Ensures scores are in (0, 1) range

Intuition:

Each expert has a “specialty” represented by its centroid vector e_i. The affinity score measures how well the current token matches that specialty. Think of it as asking “Is this expert relevant for processing this token?”

Step 2: Select Top-K Experts

g_{i,t}’ = s_{i,t} if s_{i,t} ∈ TopK({s_{j,t} | 1 ≤ j ≤ N_r}, K_r)
g_{i,t}’ = 0 otherwise

Only the K_r experts with highest affinity scores are selected.

Step 3: Normalize Gating Values

g_{i,t} = g_{i,t}’ / ∑_{j=1}^{N_r} g_{j,t}’

The gating values are normalized so they sum to 1, ensuring the output is a proper weighted combination.

Innovation 3: Auxiliary-Loss-Free Load Balancing

Traditional MoE models face a critical problem: some experts become overused while others are underutilized. This is typically solved with an auxiliary loss that penalizes imbalance, but this loss can hurt model performance.

DeepSeek’s solution is elegant: adjust the routing dynamically without modifying the training loss.

The Bias Adjustment Mechanism

g_{i,t}’ = s_{i,t} if s_{i,t} + b_i ∈ TopK({s_{j,t} + b_j | 1 ≤ j ≤ N_r}, K_r)
g_{i,t}’ = 0 otherwise

Where: – b_i: Bias term for expert i (dynamically adjusted)

Key Insight:

The bias b_i is used only for routing decisions (TopK selection), not for computing the gating values. This means: – Routing is influenced by the bias (encouraging balance) – Gating values remain based on true affinity (preserving performance)

Dynamic Bias Update

At the end of each training step:

b_i(step+1) = b_i(step) – γ if expert i is overloaded
b_i(step+1) = b_i(step) + γ if expert i is underloaded
b_i(step+1) = b_i(step) if expert i is balanced

Where: – γ: Bias update speed (hyperparameter, typically 0.01)

Load Criteria:

An expert is considered: – Overloaded if it receives more than (1 + ε) × (K_r / N_r) × batch_size tokens – Underloaded if it receives fewer than (1 – ε) × (K_r / N_r) × batch_size tokens – Balanced otherwise

Where ε is a tolerance parameter (typically 0.2)

How It Works:

If an expert is overloaded, we decrease its bias, making it less likely to be selected
If an expert is underloaded, we increase its bias, making it more likely to be selected
Over time, this naturally balances the load without forcing artificial constraints

Analogy:

Think of it like dynamic pricing: if a restaurant is too crowded, prices go up slightly (negative bias), discouraging some customers. If it’s empty, prices go down (positive bias), attracting more customers. The quality of the food (true affinity) doesn’t change, but the selection behavior adjusts.

Complementary Sequence-Wise Balance Loss

While the auxiliary-loss-free strategy handles batch-level balance, a small auxiliary loss prevents extreme imbalance within individual sequences:

L_Bal = α ∑_{i=1}^{N_r} f_i P_i

Where:

f_i = (N_r / (K_r T)) ∑_{t=1}^T 𝟙[s_{i,t} ∈ TopK({s_{j,t} | 1 ≤ j ≤ N_r}, K_r)]

P_i = (1/T) ∑_{t=1}^T s_{i,t}’

s_{i,t}’ = s_{i,t} / ∑_{j=1}^{N_r} s_{j,t}

Breaking It Down:

f_i: Fraction of tokens in the sequence routed to expert i
P_i: Average routing probability for expert i
α: Balance loss coefficient (typically 0.001-0.01)
T: Sequence length
𝟙[·]: Indicator function (1 if condition is true, 0 otherwise)

Why This Loss?

The product f_i × P_i is minimized when experts are balanced. If an expert has high routing probability (P_i) but low actual usage (f_i), or vice versa, the loss increases. This gently encourages balance within each sequence without the strong performance penalty of traditional auxiliary losses.

Multi-Token Prediction

DeepSeek-V3 also employs a multi-token prediction objective:

L = -∑_{t=1}^T ∑_{k=1}^K λ_k log P(x_{t+k} | x_{≤t})

Where: – K: Number of future tokens to predict (typically 2-4) – λ_k: Weight for predicting k steps ahead (typically decreasing)

Why Predict Multiple Tokens?

Better Representations: Forces the model to learn representations that are useful for multiple future predictions
Speculative Decoding: The additional prediction heads can be used for speculative decoding during inference, generating multiple tokens in parallel
Improved Performance: Empirically shown to improve benchmark scores

DeepSeek-V3 Complete Specifications

Model Architecture: – Total parameters: 671B – Active parameters per token: 37B (5.5%) – Layers: 61 – Model dimension (d): 7,168 – Number of attention heads (n_h): 128 – FFN hidden dimension: 18,432 – Number of shared experts (N_s): 2 – Number of routed experts (N_r): 256 – Activated routed experts (K_r): 8 – Context length: 128K tokens – KV compression dimension (d_c): 1,536 – Query compression dimension (d_c’): 1,536

Training Details: – Training tokens: 14.8 trillion – Training cost: $5.576 million (2.788M H800 GPU hours) – Training time: ~2 months – Batch size: 9,216 sequences – Learning rate: Peak 4.2 × 10⁻⁴, cosine decay – Precision: FP8 mixed precision

Performance: – MMLU: 88.5 – MMLU-Pro: 75.9 – GPQA: 59.1 – MATH-500: 90.2 – Codeforces: 78.3 percentile

Part 4: Google Gemini – The Multimodal Frontier

The Gemini Philosophy: Unified Multimodal Understanding

While GPT and DeepSeek primarily process text, Gemini was designed from the ground up to understand multiple modalities—text, images, audio, and video—in a unified architecture. This isn’t just about processing different types of data separately; it’s about understanding relationships across modalities.

Gemini Architecture Diagram

Figure 3: Google Gemini Multimodal Architecture. The model processes multiple input modalities (text, images, audio, video) through specialized tokenizers, then combines them in a unified token sequence. Sparse MoE layers with cross-modal attention enable understanding across modalities. Optional extended reasoning provides step-by-step verification and refinement.

Sparse Mixture-of-Experts Foundation

Like DeepSeek, Gemini 2.5 uses a sparse MoE architecture, but optimized for multimodal processing:

FFN_MoE(x) = ∑_{i=1}^N w_i(x) Expert_i(x)

Where:

w_i(x) = softmax(Router(x))_i if i ∈ TopK(Router(x), K)
w_i(x) = 0 otherwise

Components: – N: Total number of experts (not publicly disclosed, estimated 100-200) – K: Number of activated experts per token (estimated 10-20) – Router(x): Learned routing function – Expert_i(·): i-th expert FFN

Routing Function:

Router(x) = Softmax(W_r x + b_r)

Where: – W_r ∈ ℝ^(N × d): Routing weight matrix – b_r ∈ ℝ^N: Routing bias vector

Multimodal Token Processing

The revolutionary aspect of Gemini is how it processes different modalities in a unified token space:

h_multimodal = Transformer([TokenEmbed_text(x_text);
TokenEmbed_image(x_image);
TokenEmbed_audio(x_audio);
TokenEmbed_video(x_video)])

Where [·; ·] denotes concatenation in the sequence dimension.

Text Tokenization

Standard subword tokenization:

tokens_text = BPE(text)
embeddings_text = E_text[tokens_text]

Where: – BPE: Byte-Pair Encoding algorithm – E_text ∈ ℝ^(V × d): Text embedding matrix – V: Vocabulary size (~256K tokens)

Image Tokenization

Images are divided into patches and embedded:

patches = Reshape(image, patch_size)
tokens_image = Linear(Flatten(patches))
embeddings_image = tokens_image + PE_2D

Where: – patch_size: Typically 14×14 or 16×16 pixels – Linear: Learned linear projection to model dimension – PE_2D: 2D positional encoding (preserves spatial relationships)

For a 224×224 image with 14×14 patches: – Number of patches: (224/14)² = 256 patches – Each patch becomes one token – Total: 256 image tokens

Audio Tokenization

Audio is processed in overlapping windows:

spectrograms = STFT(audio)
tokens_audio = CNN(spectrograms)
embeddings_audio = tokens_audio + PE_temporal

Where: – STFT: Short-Time Fourier Transform (converts audio to frequency domain) – CNN: Convolutional neural network for feature extraction – PE_temporal: Temporal positional encoding

Typical Configuration: – Window size: 25ms – Hop length: 10ms – For 1 second of audio: ~100 audio tokens

Video Tokenization

Video combines spatial and temporal processing:

frames = Sample(video, frame_rate)
tokens_video = [TokenEmbed_image(frame) for frame in frames]
embeddings_video = tokens_video + PE_spatiotemporal

Where: – Sample: Samples frames at specified rate (typically 1-4 fps) – PE_spatiotemporal: 3D positional encoding (spatial + temporal)

For a 1-minute video at 1 fps: – 60 frames – 256 tokens per frame – Total: 15,360 video tokens

The key to multimodal understanding is allowing different modalities to attend to each other:

Attention_multimodal(Q_modality_A, K_modality_B, V_modality_B) =
softmax(Q_modality_A K_modality_B^T / √d_k) V_modality_B

Example:

When processing the query “What color is the car in the image?”: – Text tokens attend to image tokens to locate the car – Image tokens attend to text tokens to understand what’s being asked – The model generates a response by integrating both modalities

Long Context Processing

Gemini 2.5 Pro supports context lengths exceeding 1 million tokens through several optimizations:

Efficient Attention Mechanisms

Sparse Attention Patterns:

Instead of full O(n²) attention, use structured sparsity:

Attention_sparse(Q, K, V) = Attention(Q, K_local ∪ K_global, V_local ∪ V_global)

Where: – K_local, V_local: Keys and values from nearby positions (e.g., within 512 tokens) – K_global, V_global: Keys and values from selected global positions (e.g., every 64th token)

This reduces complexity from O(n²) to O(n × √n) or better.

Memory-Efficient KV Cache

Quantization:

K_quantized = Quantize(K, bits=8)
V_quantized = Quantize(V, bits=8)

Storing keys and values in 8-bit precision instead of 16-bit reduces memory by 50% with minimal quality loss.

Selective Caching:

importance_i = ∑_{t=recent} attention_weight_{t,i}
cache_positions = TopK(importance, cache_size)

Only cache the most important positions based on recent attention patterns.

Extended Reasoning: The Thinking Process

Gemini 2.5 incorporates “thinking” capabilities through extended reasoning:

Output = Chain(Input) → Verify → Refine → Answer

Chain-of-Thought Generation:

P(reasoning_step_i | Input, reasoning_step_{<i})

The model generates intermediate reasoning steps before the final answer.

Verification:

confidence = Evaluate(reasoning_chain)
if confidence < threshold:
reasoning_chain = Regenerate(Input, reasoning_chain)

The model evaluates its own reasoning and regenerates if confidence is low.

Mathematical Formulation:

L_reasoning = -∑_{i=1}^M log P(r_i | x, r_{<i}) – log P(y | x, r_{1:M})

Where: – r_i: i-th reasoning step – M: Number of reasoning steps – y: Final answer – x: Input

Distillation for Smaller Models

Smaller Gemini models (Flash, Flash-Lite) learn from larger models through distillation:

L_distillation = KL(P_student(y|x) || P_teacher(y|x))

Where:

KL(P || Q) = ∑_y P(y) log(P(y) / Q(y))

K-Sparse Approximation:

To reduce storage, only the top-K logits from the teacher are stored:

P_teacher_sparse(y|x) = TopK(P_teacher(y|x), K)
Normalize(P_teacher_sparse)

Typical K: 256-1024 (out of ~256K vocabulary)

Training Objectives

Gemini uses multiple training objectives:

1. Next-Token Prediction (Primary):

L_next = -∑_{t=1}^T log P(x_t | x_{<t}, context_multimodal)

2. Multimodal Alignment:

L_align = -log P(text | image) – log P(image | text)

Ensures text and image representations are aligned.

3. Instruction Following:

L_instruct = -log P(response | instruction, context)

4. Preference Learning:

L_preference = -log σ(r(x, y_preferred) – r(x, y_rejected))

Where: – σ: Sigmoid function – r(x, y): Reward model score

Gemini 2.5 Pro Specifications

Model Architecture: – Architecture: Sparse MoE Transformer – Total parameters: Not disclosed (estimated 500B-1T) – Active parameters: Estimated 100-200B per token – Context length: 1M+ tokens (2M for some versions) – Supported modalities: Text, Image, Audio, Video – Output modalities: Text, Audio (experimental) – Output length: Up to 64K tokens

Training Details: – Training data cutoff: January 2025 – Training infrastructure: TPUv5p pods – Precision: Mixed precision (BF16/FP8) – Training scale: Multiple datacenters, thousands of accelerators

Performance Highlights: – MMLU: ~90 (estimated) – Codeforces: High percentile – Multimodal benchmarks: State-of-the-art – Video understanding: Up to 3 hours of video – Long context: Excellent performance up to 1M tokens

Part 5: Anthropic Claude – Constitutional AI and Safety-First Design

The Claude Philosophy: Helpfulness Through Constitutional Principles

Anthropic’s Claude represents a different approach to LLM development, prioritizing safety and alignment with human values from the ground up. While the exact architectural details remain proprietary, Claude’s distinguishing feature is its training methodology: Constitutional AI (CAI), which uses AI feedback guided by explicit principles rather than relying solely on human feedback.

Claude Architecture Diagram

Figure 4: Anthropic Claude Architecture. Built on a Transformer foundation, Claude employs Constitutional AI (CAI) for alignment, combining supervised learning, AI feedback (RLAIF), and reinforcement learning from human feedback (RLHF). The model processes multimodal inputs (text and images) and uses a constitution of ethical principles to guide outputs toward helpful, harmless, and honest responses.

Base Architecture: Transformer with Multimodal Capabilities

Claude 3 models (Opus, Sonnet, and Haiku) are built on the Transformer architecture with multimodal capabilities, similar to Gemini. The models can process both text and visual inputs.

Model Family Specifications:

Model	Capability Level	Speed	Context Length	Use Case
Claude 3 Opus	Highest intelligence	Moderate	200K tokens	Complex reasoning, coding, research
Claude 3 Sonnet	Balanced	Fast	200K tokens	General-purpose, balanced performance
Claude 3 Haiku	Fast and efficient	Fastest	200K tokens	Quick tasks, high-volume processing

Context Window:

Claude 3 models support a 200,000 token context window, which is approximately 150,000 words or about 500 pages of text. This enables:

Context_capacity = 200K tokens ≈ 150K words

Constitutional AI: Training with Principles

The core innovation in Claude is Constitutional AI, which modifies the standard RLHF (Reinforcement Learning from Human Feedback) pipeline.

Standard RLHF Process

Traditional RLHF involves three stages:

Stage 1: Supervised Fine-Tuning (SFT)

L_SFT = -∑_{(x,y)∈D_SFT} log P_θ(y | x)

Where: – D_SFT: Dataset of (prompt, response) pairs curated by humans – P_θ(y | x): Probability of generating response y given prompt x – θ: Model parameters

Stage 2: Reward Model Training

L_RM = -E_{(x,y_w,y_l)∈D_comp} [log σ(r_φ(x, y_w) – r_φ(x, y_l))]

Where: – D_comp: Dataset of comparison pairs (prompt, winning response, losing response) – r_φ: Reward model with parameters φ – σ: Sigmoid function – y_w, y_l: Winning and losing responses as judged by humans

Stage 3: Reinforcement Learning

L_RL = E_{x∼D,y∼π_θ} [r_φ(x, y)] – β KL(π_θ || π_ref)

Where: – π_θ: Policy (the LLM being trained) – π_ref: Reference policy (initial supervised model) – β: KL divergence coefficient (prevents the model from deviating too far) – KL: Kullback-Leibler divergence

Constitutional AI: RLAIF (RL from AI Feedback)

Constitutional AI modifies this process by using AI-generated feedback based on a constitution:

The Constitution:

A set of principles that guide the model’s behavior. Examples from Claude’s constitution include:

“Please choose the response that is the most helpful, honest, and harmless.”
“Please choose the response that is least intended to build a relationship with the user.”
“Please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.”
Principles derived from the UN Declaration of Human Rights
Principles for accessibility and disability rights (from Collective Constitutional AI)

Stage 1: Supervised Learning (SL-CAI)

For each harmful prompt x_h:
1. Generate initial response: y_0 = LLM(x_h)
2. Critique using constitution: c = LLM(“Critique: ” + x_h + y_0 + constitution_i)
3. Revise response: y_revised = LLM(“Revise: ” + x_h + y_0 + c + constitution_i)
4. Fine-tune on revised responses

This creates a dataset of (harmful_prompt, safe_response) pairs without human labeling.

Stage 2: AI Feedback for Preference Modeling

Instead of human comparisons, the model generates its own preference data:

For each prompt x:
1. Generate multiple responses: {y_1, y_2, …, y_n}
2. For each pair (y_i, y_j):
Ask LLM: “Which response better follows principle P from the constitution?”
3. Create preference dataset D_AI from AI judgments
4. Train reward model on D_AI

Mathematical Formulation:

L_RLAIF = -E_{(x,y_i,y_j)∈D_AI} [log σ(r_φ(x, y_better) – r_φ(x, y_worse))]

Where: – D_AI: Dataset of AI-generated preference comparisons – y_better, y_worse: Responses ranked by AI according to constitutional principles

Stage 3: Reinforcement Learning

Same as standard RLHF, but using the reward model trained on AI feedback:

L_RL-CAI = E_{x∼D,y∼π_θ} [r_φ(x, y)] – β KL(π_θ || π_ref)

Advantages of Constitutional AI

1. Scalability

AI feedback is cheaper and faster than human feedback:

Cost_human >> Cost_AI
Time_human >> Time_AI

2. Consistency

AI feedback based on explicit principles is more consistent than human judgments:

Variance_AI < Variance_human (for principle-based judgments)

3. Transparency

The constitution makes the training objectives explicit and auditable.

4. Harmlessness Without Helpfulness Trade-off

Traditional RLHF often creates a trade-off between helpfulness and harmlessness. Constitutional AI reduces this:

Maximize: Helpfulness + Harmlessness + Honesty
Subject to: Constitutional principles

Multimodal Processing

Claude 3 models process images alongside text:

Image Input Processing:

image_tokens = VisionEncoder(image)
combined_input = Concat([text_tokens, image_tokens])
output = Transformer(combined_input)

Where: – VisionEncoder: Converts images to token representations – Transformer: Processes the combined multimodal sequence

Supported Image Formats: – JPEG, PNG, GIF, WebP – Up to 10MB per image – Maximum resolution: 8000×8000 pixels

Image Understanding Capabilities: – Chart and graph interpretation – Document analysis (tables, diagrams) – Photo analysis and description – Visual reasoning

Training Infrastructure

Hardware: – Amazon Web Services (AWS) – Google Cloud Platform (GCP)

Frameworks: – PyTorch – JAX – Triton (for optimized kernels)

Training Data: – Proprietary mix of public internet data (as of August 2023) – Third-party licensed data – Data from labeling services – Internally generated data – No user data from Claude conversations

Performance Characteristics

Claude 3 Opus Benchmarks:

Benchmark	Score	Comparison
MMLU	86.8%	Comparable to GPT-4
GPQA	50.4%	State-of-the-art
MATH	60.1%	Strong mathematical reasoning
HumanEval	84.9%	Excellent coding capability
MMMU	59.4%	Multimodal understanding

Context Length Performance:

Claude maintains strong performance across its full 200K token context:

Recall@200K > 95% (on needle-in-haystack tests)

Safety and Alignment Features

1. Reduced False Refusals

Claude 3 models are better at distinguishing between harmful and harmless requests:

False_refusal_rate_Claude3 < False_refusal_rate_Claude2

2. Improved Honesty

The models are more likely to express uncertainty when appropriate:

P(admit_uncertainty | uncertain) > P(confident_wrong_answer | uncertain)

3. Steerable Personality

Users can guide Claude’s tone and style while maintaining safety:

Output = f(Input, Personality_guidance, Constitutional_constraints)

Claude 3 Model Family Comparison

Computational Efficiency:

Quality: Opus > Sonnet > Haiku
Speed: Haiku > Sonnet > Opus
Cost: Haiku < Sonnet < Opus

Use Case Optimization:

Opus: When quality is paramount (research, complex analysis, difficult coding)
Sonnet: When balancing quality and speed (general-purpose applications)
Haiku: When speed and cost matter most (high-volume processing, simple tasks)

Key Innovations Summary

1. Constitutional AI (CAI) – Uses AI feedback guided by explicit principles – Reduces reliance on expensive human labeling – Improves consistency and transparency – Mathematical formulation combines supervised learning, RLAIF, and RLHF

2. Harmlessness Without Helpfulness Loss – Maintains high capability while being safe – Reduces false refusals compared to earlier models – Explicit optimization for helpful + harmless + honest

3. Long Context Understanding – 200K token context window – Strong performance across full context length – Enables processing of entire books, codebases, or long documents

4. Multimodal Capabilities – Native image understanding – Vision-language reasoning – Document and chart analysis

5. Model Family Design – Three models (Opus, Sonnet, Haiku) for different use cases – Optimized for different points on the quality-speed-cost curve – Allows users to choose the right tool for their needs

Part 6: Comparative Analysis

Mathematical Complexity Comparison

Operation	GPT (Dense)	DeepSeek (MoE + MLA)	Gemini (MoE)	Claude (Dense)
Attention per Layer	O(n² × d)	O(n² × d_c), d_c ≪ d	O(n² × d) sparse	O(n² × d)
FFN per Layer	O(n × d²)	O(n × d² × (N_s + K_r)/N_total)	O(n × d² × K/N)	O(n × d²)
Parameters per Token	100%	5.5% (37B/671B)	~10-20%	~100% (dense)
KV Cache per Token	O(n_h × d_h) = O(d)	O(d_c + d_h^R) ≈ 0.02 × O(d)	O(d) quantized	O(d)
Context Length	128K	128K	1M+	200K

Training Efficiency Comparison

Metric	GPT-3	DeepSeek-V3	Gemini 2.5 Pro	Claude 3 Opus
Total Parameters	175B	671B	~500B-1T (est.)	Not disclosed
Active Parameters	175B	37B	~100-200B (est.)	Not disclosed (dense)
Training Tokens	~300B	14.8T	Not disclosed	Not disclosed
Training Cost	Not disclosed	$5.576M	Not disclosed	Not disclosed
Training Method	Standard RLHF	MoE + RLAIF	Distillation + RLHF	Constitutional AI (RLAIF)
FLOPs per Token (Training)	High	Low (MoE)	Medium	Medium-High (dense)
FLOPs per Token (Inference)	~2 × 175B	~2 × 37B	~2 × 100-200B (est.)	Not disclosed

Architectural Philosophy Comparison

OpenAI GPT: – Philosophy: Simplicity and scale – Key Innovation: Demonstrating that scaled Transformers develop emergent capabilities – Trade-off: Maximum capability at maximum cost – Best For: Applications where cost is secondary to reliability and performance

DeepSeek: – Philosophy: Efficiency without compromise – Key Innovations: – MLA for 98% KV cache reduction – Auxiliary-loss-free load balancing – Multi-token prediction – Trade-off: Architectural complexity for computational efficiency – Best For: Cost-sensitive deployments, organizations with limited resources

Gemini: – Philosophy: Unified multimodal intelligence – Key Innovations: – Native multimodal processing – Extended reasoning capabilities – Distillation from thinking models – Trade-off: Complexity in multimodal integration for broader capability – Best For: Applications requiring cross-modal understanding, long-context processing

Claude: – Philosophy: Safety and alignment through constitutional principles – Key Innovations: – Constitutional AI (RLAIF) – Harmlessness without helpfulness loss – Reduced false refusals – Trade-off: Dense architecture (higher inference cost) for safety guarantees – Best For: Applications where safety, alignment, and ethical behavior are paramount

Performance Characteristics

Latency: – GPT: Moderate (all parameters active) – DeepSeek: Low (fewer active parameters, efficient attention) – Gemini: Variable (depends on modality and context length)

Memory: – GPT: High (full KV cache, all parameters) – DeepSeek: Low (compressed KV cache, sparse activation) – Gemini: High for long contexts (but optimized)

Throughput: – GPT: Moderate (limited by memory bandwidth) – DeepSeek: High (sparse activation enables larger batch sizes) – Gemini: Moderate to high (depends on modality mix)

Use Case Suitability

GPT is Ideal For: – General-purpose text generation – Applications requiring proven reliability – Organizations with substantial computational resources – Use cases where consistency is critical

DeepSeek is Ideal For: – Cost-sensitive deployments – High-throughput applications – Organizations with limited GPU resources – Applications requiring long context at reasonable cost

Gemini is Ideal For: – Multimodal applications (image + text, video + text) – Applications requiring very long context (>100K tokens) – Complex reasoning tasks – Agentic systems combining multiple capabilities

Part 6: The Future of LLM Architectures

Emerging Trends

1. Convergence of Approaches

We’re seeing convergence toward hybrid architectures: – GPT models exploring sparse activation – DeepSeek expanding into multimodal – All models adopting MoE for efficiency

2. Efficiency as a First-Class Concern

The success of DeepSeek demonstrates that efficiency innovations can match or exceed brute-force scaling:

Performance ∝ (Scale × Efficiency × Data Quality)

Not just:

Performance ∝ Scale

3. Multimodal as Default

Future models will likely be multimodal by default, as the world’s information exists in multiple modalities.

4. Extended Reasoning

The “thinking” capabilities in Gemini 2.5 and similar models represent a shift toward more deliberate, verifiable reasoning:

Answer = f(Input) → Answer = Reasoning(Input) → Verify → Answer

Open Research Questions

1. Optimal Sparsity Patterns

What is the optimal balance between: – Number of experts – Expert size – Activation rate – Routing mechanism

2. Scaling Laws for MoE

Traditional scaling laws:

Loss ∝ (Parameters)^(-α) × (Data)^(-β)

How do these change for MoE models where active parameters ≠ total parameters?

3. Cross-Modal Understanding

How can we better measure and improve cross-modal reasoning? Current benchmarks focus on single-modality performance.

4. Efficient Long Context

Can we achieve O(n) or O(n log n) attention without sacrificing quality?

Candidates: – Linear attention mechanisms – State space models – Hierarchical attention

Mathematical Frontiers

1. Better Attention Mechanisms

Current research explores alternatives to softmax attention:

Attention_linear(Q, K, V) = φ(Q) (φ(K)^T V)

Where φ is a feature map that allows O(n) complexity.

2. Adaptive Computation

Allow models to use more computation for harder problems:

Computation(x) = f(Difficulty(x))

3. Continuous Learning

Current models are static after training. Future models might continuously update:

θ_{t+1} = θ_t + η ∇L(x_t; θ_t)

While maintaining stability and preventing catastrophic forgetting.

Conclusion: The Elegant Mathematics of Machine Intelligence

We’ve journeyed through the mathematical foundations of three remarkable AI architectures. What emerges is a picture of elegant simplicity giving rise to extraordinary complexity.

The Core Insights

1. Attention is Universal

The attention mechanism—a simple weighted sum based on compatibility scores—turns out to be sufficient for modeling complex relationships in language, vision, and beyond:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

This formula, simple enough to fit on a single line, powers systems that can write poetry, prove theorems, and understand images.

2. Sparsity is Powerful

DeepSeek demonstrated that we don’t need to activate all parameters for every token. Selective activation through MoE:

Output = ∑_{i ∈ TopK} w_i Expert_i(Input)

This achieves comparable performance to dense models while using a fraction of the computation.

3. Compression Preserves Information

MLA showed that we can compress keys and values by 98%:

c_t^KV = W^DKV h_t (compression)
k_t^C = W^UK c_t^KV (decompression)

And still maintain performance, because the essential information is preserved in a low-dimensional subspace.

4. Multimodality is Natural

Gemini demonstrated that different modalities can be processed in a unified token space:

h = Transformer([tokens_text; tokens_image; tokens_audio; tokens_video])

The same attention mechanism that relates words to words can relate images to text, audio to video, and any modality to any other.

The Human Element

Despite the sophisticated mathematics, these systems ultimately serve human needs. They help us: – Communicate across language barriers – Access and synthesize information – Solve complex problems – Create new forms of art and expression – Understand the world in new ways

A Note of Wonder

There’s something profound about the fact that intelligence—whether natural or artificial—can be approximated by mathematical functions. The same principles that describe the motion of planets and the behavior of particles also describe the processing of information and the generation of meaning.

We’ve seen that: – A simple attention formula captures the essence of focus and relevance – Sparse activation mirrors how human experts specialize – Compression reveals the low-dimensional structure of information – Unified processing across modalities reflects how humans integrate sensory information

Looking Forward

The mathematics we’ve explored represents our current understanding, but the field continues to evolve rapidly. Future architectures will likely: – Be more efficient (doing more with less) – Be more capable (understanding deeper relationships) – Be more accessible (available to more people and organizations) – Be more aligned (better serving human values and needs)

The journey from simple formulas to intelligent behavior is a testament to both the power of mathematics and the ingenuity of human creativity. As we continue to refine these architectures, we move closer to systems that can truly augment human intelligence and help us tackle the great challenges of our time.

Acknowledgments

This document was prepared to illuminate the mathematical foundations of modern AI for a broad audience. We are grateful to the research teams at OpenAI, DeepSeek, and Google DeepMind for their groundbreaking work and for publishing detailed technical reports that make this kind of analysis possible.

Special thanks to the authors of “Attention Is All You Need” (Vaswani et al., 2017), whose elegant architecture laid the foundation for the current generation of AI systems.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008. https://arxiv.org/abs/1706.03762
DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Ding, W., Li, M., Xiao, Y., Wang, P., Huang, K., Sui, Y., Ruan, C., Zheng, Z., Yu, K., Cheng, X., Guo, X., Gu, S., … & Bi, J. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066.
Gemini Team, Google. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805. https://arxiv.org/abs/2312.11805
Gemini Team, Google. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. Technical Report. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
Clark, A., de Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S., et al. (2022). Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169.

Document prepared by:

Hene Aku Kwapong, PhD, MBA
MIT Practice School Alumni

With the assistance of:

Jeremie Kwapong
New England Innovation Academy

October 9, 2025

Cambricon: China’s Strategy to Challenge NVIDIA in AI Accelerators

December 4, 2025

THE MATHEMATICAL EVOLUTION FROM NOAM CHOMSKY TO CHATGPT

November 5, 2025

Bridging Statistics and AI: Tackling Data Complexity with Dimensionality Reduction

November 5, 2025

The Intersection of Language and Mathematics: Unveiling Latent Structures in AI

November 5, 2025

Hene Aku Kwapong

An executive, board director, and entrepreneur with 25+yr experience leading transformative initiatives across capital markets, banking, & technology, making him valuable asset to companies navigating complex challenges

View All Post