Document prepared with the help of:
Contents
Jeremie Kwapong, New England Innovation Academy
Date: October 9, 2025
Introduction: Decoding the Language of Intelligent Machines
In 2017, a team of researchers at Google published a paper with a bold title: “Attention Is All You Need.” This work introduced the Transformer architecture, which would revolutionize artificial intelligence and give birth to the large language models that now shape our digital world. But what makes these systems work? What mathematical principles allow a computer to understand context, generate coherent text, and even reason about complex problems?
This document explores the mathematical foundations of three groundbreaking AI architectures: OpenAI’s GPT models, which demonstrated the power of scaling; DeepSeek’s efficient design, which proved that smart engineering can rival brute force; and Google’s Gemini, which extended AI’s reach into the multimodal realm of images, audio, and video.
We’ll present the actual mathematical formulations that power these systems, but explain them in a way that reveals their elegant simplicity. Our goal is to make sophisticated concepts accessible without sacrificing accuracy or depth.
Part 1: The Foundational Architecture – The Transformer
Before diving into the specific innovations of GPT, DeepSeek, and Gemini, we must understand the Transformer architecture that underlies them all. Introduced by Vaswani et al. in 2017, the Transformer solved a fundamental problem: how to process sequences of information in parallel while still capturing relationships between distant elements.
The Attention Mechanism: Learning What Matters
Imagine reading a detective novel. When you encounter the phrase “the butler did it,” your mind instantly connects this to earlier mentions of the butler, the crime scene, and various clues scattered throughout hundreds of pages. You’re not giving equal weight to every word you’ve read—you’re selectively attending to relevant information.
The Transformer does exactly this through its attention mechanism. The mathematics, while precise, capture this intuitive process beautifully.
Scaled Dot-Product Attention
The core attention formula is:
Let’s unpack this step by step:
The Three Matrices: – Q (Query): Represents “what am I looking for?” For each word, this encodes what information it needs from other words. – K (Key): Represents “what information do I have?” Each word advertises what it can offer. – V (Value): Represents “what is my actual content?” The information that will be retrieved.
The Computation:
- QK^T: We multiply queries by keys (transposed). This creates a matrix where each entry measures how relevant each key is to each query. Think of it as computing compatibility scores.
- Division by √d_k: This scaling factor is crucial. Without it, when d_k (the dimension of our keys) is large, the dot products can become very large, pushing the softmax function into regions where gradients become tiny. The square root scaling keeps the values in a reasonable range. It’s like adjusting the volume on a stereo—too loud and you get distortion, too quiet and you miss important details.
- softmax(…): This converts our compatibility scores into probabilities that sum to
1. It’s a way of saying “distribute your attention across all the words, but focus more on the relevant ones.”
- Multiply by V: Finally, we use these attention weights to compute a weighted sum of the values. We’re gathering information from all words, but in proportion to their relevance.
Example in Action:
Consider the sentence: “The cat sat on the mat because it was comfortable.”
When processing “it,” the attention mechanism computes: – High attention to “mat” (it could refer to the mat) – High attention to “cat” (it could refer to the cat) – Lower attention to “sat,” “on,” “the” (less relevant for resolving the pronoun)
The context (“because it was comfortable”) helps the model determine that “it” likely refers to the mat, not the cat.
Multi-Head Attention: Multiple Perspectives
A single attention mechanism is powerful, but multiple attention mechanisms working in parallel are even better. This is multi-head attention:
Why Multiple Heads?
Different heads can learn to attend to different types of relationships: – One head might focus on syntactic relationships (subject-verb agreement) – Another might capture semantic relationships (synonyms, antonyms) – Yet another might track long-range dependencies (pronouns to their antecedents)
The Mathematics:
Each head has its own learned projection matrices: – W_i^Q ∈ ^(d_model × d_k): Projects the input into query space for head i – W_i^K ∈
^(d_model × d_k): Projects the input into key space for head i – W_i^V ∈
^(d_model × d_v): Projects the input into value space for head i
After computing attention for each head independently, we concatenate all the outputs and project them back to the model dimension using W^O ∈ ^(hd_v × d_model).
Typical Configuration: – Number of heads (h): 8 to 96 (depending on model size) – d_k
= d_v = d_model / h (typically 64 for each head in a 512-dimensional model)
This design ensures that the total computational cost is similar to single-head attention with full dimensionality, but we gain the benefit of multiple specialized attention patterns.
Positional Encoding: Understanding Order
Unlike recurrent neural networks that process sequences step by step, Transformers process all positions simultaneously. This parallelism is great for speed, but it creates a problem: the model has no inherent sense of word order. “Dog bites man” and “Man bites dog” would look identical!
The solution is positional encoding, which adds position information to each word’s representation:
Understanding the Formula:
pos: The position of the word in the sequence (0, 1, 2, …)
i: The dimension index (0, 1, 2, …, d_model/2)
2i and 2i+1: Even and odd dimensions use sine and cosine respectively
Why Sine and Cosine?
This choice is brilliant for several reasons:
- Unique Fingerprints: Each position gets a unique pattern of sine and cosine values across all dimensions.
- Relative Positions: The model can learn to attend to relative positions. Due to trigonometric identities, PE(pos+k) can be expressed as a linear function of PE(pos), making it easy for the model to learn patterns like “attend to the word 3 positions back.”
- Extrapolation: The model can potentially handle sequences longer than those seen during training, since the sinusoidal pattern continues smoothly.
- No Learning Required: Unlike learned positional embeddings, these are fixed functions, reducing the number of parameters.
Visualization:
Imagine each position as a unique musical chord. Position 0 might be a C major chord, position 1 a slightly different chord, and so on. Each dimension contributes a different frequency to this chord, creating a rich, unique signature for each position.
Position-Wise Feed-Forward Networks
After attention, each position’s representation is processed through a simple two-layer network:
Breaking It Down:
- First Linear Layer (xW₁ + b₁): Expands the representation from d_model dimensions to d_ff dimensions (typically d_ff = 4 × d_model). This expansion gives the network more capacity to process information.
- ReLU Activation (max(O, …)): The Rectified Linear Unit acts as a gatekeeper, setting negative values to zero. This non-linearity is crucial—without it, the entire network would collapse to a single linear transformation.
- Second Linear Layer (W₂ + b₂): Projects back down to d_model dimensions.
Why “Position-Wise”?
The same FFN is applied to each position independently. It’s like having the same expert analyze each word separately, after the attention mechanism has already allowed words to share information with each other.
Typical Dimensions: – Input/Output: d_model = 512 to 12,288 (depending on model size) – Hidden layer: d_ff = 2,048 to 49,152 (typically 4× the model dimension)
Layer Normalization and Residual Connections
Two additional components stabilize training:
Residual Connections:
The “+x” is a residual connection—we add the input back to the output. This allows gradients to flow directly through the network during training, preventing the vanishing gradient problem in deep networks.
Layer Normalization:
Normalizes the activations to have mean 0 and variance 1, then applies learned scaling and shifting. This keeps the values in a reasonable range and stabilizes training.
The Complete Transformer Block
Putting it all together, a Transformer block consists of:
- Multi-head self-attention
- Residual connection and layer normalization
- Position-wise feed-forward network
- Another residual connection and layer normalization
This block is stacked multiple times (6 to 96+ layers) to create the full model.
Part 2: OpenAI GPT – Scaling the Transformer
The GPT Philosophy: Simplicity at Scale
OpenAI’s approach with GPT (Generative Pre-trained Transformer) was elegantly simple: take the Transformer decoder, scale it up massively, train it on enormous amounts of text, and let capabilities emerge from scale.
Architecture Details
GPT uses the standard Transformer architecture with a few key choices:
Decoder-Only Architecture:
Unlike the original Transformer (which had both encoder and decoder), GPT uses only the decoder. This means: – It processes text left-to-right – Each position can only attend to previous positions (causal masking) – This makes it naturally suited for text generation
Causal Masking:
The attention mechanism is modified to prevent “looking ahead”:
The mask M ensures that position i can only attend to positions j where j ≤ i. The -∞ values become 0 after softmax, effectively blocking attention to future positions.
The Mathematics of Scale
Model Dimensions (GPT-3 as example): – Layers: 96 – d_model: 12,288 – Number of heads: 96 – d_k = d_v: 128 per head – d_ff: 49,152 – Context length: 2,048 tokens (later extended to 128K in GPT-4) – Total parameters: ~175 billion (GPT-3)
Attention Complexity:
For a sequence of length n: – Attention computation: O(n² × d_model) – Memory for attention scores: O(n² × h)
This quadratic scaling with sequence length is why context length was initially limited.
Training Objective: Next-Token Prediction
GPT is trained with a simple but powerful objective:
What This Means:
For each position t in the training text: 1. Given all previous tokens (x_1 through x_{t- 1}) 2. Predict the probability distribution over the next token (x_t) 3. Maximize the log probability of the actual next token
This seemingly simple task forces the model to learn: – Grammar and syntax – Factual knowledge – Reasoning patterns – Common sense – And much more
Why It Works:
The key insight is that predicting the next word requires understanding context, relationships, and patterns. To predict the next word in “The capital of France is “, the model must know geography. To predict the next word in “She was happy because “, the model must understand causation and emotion.
Computational Requirements
Training: – GPT-3 training: ~3.14 × 10²³ FLOPs – Training time: Several months on thousands of GPUs – Training data: ~300 billion tokens
Inference: – For each token generated: ~2 × (number of parameters) FLOPs – All parameters are active for every token – Memory: Must store all parameters plus KV cache
Strengths and Limitations
Strengths: – Proven, reliable architecture – Excellent general-purpose performance – Strong few-shot learning capabilities – Predictable scaling behavior
Limitations: – High computational cost (all parameters always active) – Quadratic attention complexity limits context length – Large memory footprint – Expensive to
train and run
Part 3: DeepSeek – The Efficiency Revolution
The DeepSeek Philosophy: Matching Performance with Fraction of Cost
DeepSeek asked a provocative question: “Can we achieve GPT-level performance while using only 5% of the computational resources per token?” Their answer came through two major innovations: Multi-Head Latent Attention (MLA) for memory efficiency and DeepSeekMoE for computational efficiency.
Innovation 1: Multi-Head Latent Attention (MLA)
The KV cache problem is one of the biggest bottlenecks in LLM inference. For each token generated, the model must store the keys and values from all previous tokens. With long contexts and many attention heads, this memory requirement becomes enormous.
The Compression Strategy
MLA solves this through low-rank compression. Instead of storing full keys and values, it stores compressed representations:
Step 1: Compress Keys and Values
Where: – h_t ∈ ^d: The input hidden state for token t – W^DKV ∈
^(d_c × d): Down-projection matrix – c_t^KV ∈
^d_c: Compressed latent vector (d_c ≪ d_h × n_h)
Typical Values: – d (model dimension): 5,120 – n_h (number of heads): 128 – d_h (dimension per head): 128 – d_c (compressed dimension): 512
This means instead of storing 128 × 128 = 16,384 values per token, we store only 512— a 97% reduction!
Step 2: Decompress Keys
Where: – W^UK ∈ ^(d_h×n_h × d_c): Up-projection matrix for keys – k_t^C: Decompressed keys for all heads (concatenated)
Step 3: Add Rotary Position Information
Where: – W^KR ∈ ^(d_h^R × d): Matrix for RoPE keys – RoPE(·): Rotary Position Embedding function – k_t^R: Position-dependent component (shared across heads) – [·; ·]: Concatenation
Why Separate Position Information?
RoPE (Rotary Position Embedding) is more efficient than sinusoidal encoding. By keeping position information separate and shared across heads, we save even more memory.
Step 4: Decompress Values
Where: – W^UV ∈ ^(d_h×n_h × d_c): Up-projection matrix for values
Query Compression
Queries are also compressed, but this is for reducing activation memory during training, not inference:
Where: – c_t^Q ∈ ^(d_c’): Compressed query latent (d_c’ ≪ d_h × n_h) – W^DQ ∈
^(d_c’ × d): Query down-projection – W^UQ ∈ ^(d_h×n_h × d_c’): Query up- projection – W^QR ∈
^(d_h×n_h × d_c’): Query RoPE matrix
Final Attention Computation
Where: – W^O ∈ ^(d × d_h×n_h): Output projection matrix – o_{t,i}: Attention output for head i at position t
Memory Savings:
For DeepSeek-V3 with 128 heads: – Standard MHA: 128 × 128 × 2 = 32,768 values per token – MLA: 512 + 128 = 640 values per token – Reduction: 98%
Innovation 2: DeepSeekMoE Architecture
The second major innovation is the Mixture-of-Experts (MoE) architecture, which activates only a small fraction of parameters for each token.
The Basic MoE Formula
Where: – u_t: Input to the FFN layer – N_s: Number of shared experts (always activated) – N_r: Number of routed experts (selectively activated) – FFN_i^{(s)}(·): i-th shared expert – FFN_i^{(r)}(·): i-th routed expert – g_{i,t}: Gating value for routed expert i at token t – h_t’: Output of the FFN layer
DeepSeek-V3 Configuration: – N_s = 2 shared experts – N_r = 256 routed experts – K_r
= 8 routed experts activated per token – Expert size: Each expert has ~15M parameters
Total Computation: – Shared experts: 2 experts (always active) – Routed experts: 8 experts (selected from 256) – Total active: 10 experts per token – Percentage: 10/258 ≈ 3.9% of FFN parameters
The Routing Mechanism
Step 1: Compute Affinity Scores
Where: – e_i ∈ ^d: Centroid vector for expert i (learned parameter) – s_{i,t}: Token- to-expert affinity score (between 0 and 1) – Sigmoid: Ensures scores are in (0, 1) range
Intuition:
Each expert has a “specialty” represented by its centroid vector e_i. The affinity score measures how well the current token matches that specialty. Think of it as asking “Is this expert relevant for processing this token?”
Step 2: Select Top-K Experts
Only the K_r experts with highest affinity scores are selected.
Step 3: Normalize Gating Values
The gating values are normalized so they sum to 1, ensuring the output is a proper weighted combination.
Innovation 3: Auxiliary-Loss-Free Load Balancing
Traditional MoE models face a critical problem: some experts become overused while others are underutilized. This is typically solved with an auxiliary loss that penalizes imbalance, but this loss can hurt model performance.
DeepSeek’s solution is elegant: adjust the routing dynamically without modifying the training loss.
The Bias Adjustment Mechanism
Where: – b_i: Bias term for expert i (dynamically adjusted)
Key Insight:
The bias b_i is used only for routing decisions (TopK selection), not for computing the gating values. This means: – Routing is influenced by the bias (encouraging balance) – Gating values remain based on true affinity (preserving performance)
Dynamic Bias Update
At the end of each training step:
Where: – γ: Bias update speed (hyperparameter, typically 0.01)
Load Criteria:
An expert is considered: – Overloaded if it receives more than (1 + ε) × (K_r / N_r) × batch_size tokens – Underloaded if it receives fewer than (1 – ε) × (K_r / N_r) × batch_size tokens – Balanced otherwise
Where ε is a tolerance parameter (typically 0.2)
How It Works:
- If an expert is overloaded, we decrease its bias, making it less likely to be selected
- If an expert is underloaded, we increase its bias, making it more likely to be selected
- Over time, this naturally balances the load without forcing artificial constraints
Analogy:
Think of it like dynamic pricing: if a restaurant is too crowded, prices go up slightly (negative bias), discouraging some customers. If it’s empty, prices go down (positive bias), attracting more customers. The quality of the food (true affinity) doesn’t change, but the selection behavior adjusts.
Complementary Sequence-Wise Balance Loss
While the auxiliary-loss-free strategy handles batch-level balance, a small auxiliary loss prevents extreme imbalance within individual sequences:
Where:
Breaking It Down:
f_i: Fraction of tokens in the sequence routed to expert i
P_i: Average routing probability for expert i
α: Balance loss coefficient (typically 0.001-0.01)
T: Sequence length
[·]: Indicator function (1 if condition is true, 0 otherwise)
Why This Loss?
The product f_i × P_i is minimized when experts are balanced. If an expert has high routing probability (P_i) but low actual usage (f_i), or vice versa, the loss increases. This gently encourages balance within each sequence without the strong performance penalty of traditional auxiliary losses.
Multi-Token Prediction
DeepSeek-V3 also employs a multi-token prediction objective:
Where: – K: Number of future tokens to predict (typically 2-4) – λ_k: Weight for predicting k steps ahead (typically decreasing)
Why Predict Multiple Tokens?
- Better Representations: Forces the model to learn representations that are useful for multiple future predictions
- Speculative Decoding: The additional prediction heads can be used for speculative decoding during inference, generating multiple tokens in parallel
- Improved Performance: Empirically shown to improve benchmark scores
DeepSeek-V3 Complete Specifications
Model Architecture: – Total parameters: 671B – Active parameters per token: 37B (5.5%) – Layers: 61 – Model dimension (d): 7,168 – Number of attention heads (n_h): 128
– FFN hidden dimension: 18,432 – Number of shared experts (N_s): 2 – Number of routed experts (N_r): 256 – Activated routed experts (K_r): 8 – Context length: 128K tokens – KV compression dimension (d_c): 1,536 – Query compression dimension (d_c’): 1,536
Training Details: – Training tokens: 14.8 trillion – Training cost: $5.576 million (2.788M H800 GPU hours) – Training time: ~2 months – Batch size: 9,216 sequences – Learning rate: Peak 4.2 × 10⁻⁴, cosine decay – Precision: FP8 mixed precision
Performance: – MMLU: 88.5 – MMLU-Pro: 75.9 – GPQA: 59.1 – MATH-500: 90.2 –
Codeforces: 78.3 percentile
Part 4: Google Gemini – The Multimodal Frontier
The Gemini Philosophy: Unified Multimodal Understanding
While GPT and DeepSeek primarily process text, Gemini was designed from the ground up to understand multiple modalities—text, images, audio, and video—in a unified architecture. This isn’t just about processing different types of data separately; it’s about understanding relationships across modalities.
Sparse Mixture-of-Experts Foundation
Like DeepSeek, Gemini 2.5 uses a sparse MoE architecture, but optimized for multimodal processing:
Where:
Components: – N: Total number of experts (not publicly disclosed, estimated 100-200)
– K: Number of activated experts per token (estimated 10-20) – Router(x): Learned routing function – Expert_i(·): i-th expert FFN
Routing Function:
Where: – W_r ∈ ^(N × d): Routing weight matrix – b_r ∈
^N: Routing bias vector
Multimodal Token Processing
The revolutionary aspect of Gemini is how it processes different modalities in a unified token space:
Where [·; ·] denotes concatenation in the sequence dimension. Text Tokenization
Standard subword tokenization:
Where: – BPE: Byte-Pair Encoding algorithm – E_text ∈ ^(V × d): Text embedding matrix – V: Vocabulary size (~256K tokens)
Image Tokenization
Images are divided into patches and embedded:
Where: – patch_size: Typically 14×14 or 16×16 pixels – Linear: Learned linear projection to model dimension – PE_2D: 2D positional encoding (preserves spatial relationships)
For a 224×224 image with 14×14 patches: – Number of patches: (224/14)² = 256 patches – Each patch becomes one token – Total: 256 image tokens
Audio Tokenization
Audio is processed in overlapping windows:
Where: – STFT: Short-Time Fourier Transform (converts audio to frequency domain) – CNN: Convolutional neural network for feature extraction – PE_temporal: Temporal positional encoding
Typical Configuration: – Window size: 25ms – Hop length: 10ms – For 1 second of audio: ~100 audio tokens
Video Tokenization
Video combines spatial and temporal processing:
Where: – Sample: Samples frames at specified rate (typically 1-4 fps) –
PE_spatiotemporal: 3D positional encoding (spatial + temporal)
For a 1-minute video at 1 fps: – 60 frames – 256 tokens per frame – Total: 15,360 video tokens
Cross-Modal Attention
The key to multimodal understanding is allowing different modalities to attend to each other:
Example:
When processing the query “What color is the car in the image?”: – Text tokens attend to image tokens to locate the car – Image tokens attend to text tokens to understand what’s being asked – The model generates a response by integrating both modalities
Long Context Processing
Gemini 2.5 Pro supports context lengths exceeding 1 million tokens through several optimizations:
Efficient Attention Mechanisms Sparse Attention Patterns:
Instead of full O(n²) attention, use structured sparsity:
Where: – K_local, V_local: Keys and values from nearby positions (e.g., within 512 tokens) – K_global, V_global: Keys and values from selected global positions (e.g., every 64th token)
This reduces complexity from O(n²) to O(n × √n) or better.
Memory-Efficient KV Cache Quantization:
Storing keys and values in 8-bit precision instead of 16-bit reduces memory by 50% with minimal quality loss.
Selective Caching:
Only cache the most important positions based on recent attention patterns.
Extended Reasoning: The Thinking Process
Gemini 2.5 incorporates “thinking” capabilities through extended reasoning:
Chain-of-Thought Generation:
The model generates intermediate reasoning steps before the final answer.
Verification:
The model evaluates its own reasoning and regenerates if confidence is low.
Mathematical Formulation:
Where: – r_i: i-th reasoning step – M: Number of reasoning steps – y: Final answer – x: Input
Distillation for Smaller Models
Smaller Gemini models (Flash, Flash-Lite) learn from larger models through distillation:
Where:
K-Sparse Approximation:
To reduce storage, only the top-K logits from the teacher are stored:
Typical K: 256-1024 (out of ~256K vocabulary)
Training Objectives
Gemini uses multiple training objectives:
-
Next-Token Prediction (Primary):
- Multimodal Alignment:
Ensures text and image representations are aligned.
-
Instruction Following:
- Preference Learning:
Where: – σ: Sigmoid function – r(x, y): Reward model score
Gemini 2.5 Pro Specifications
Model Architecture: – Architecture: Sparse MoE Transformer – Total parameters: Not disclosed (estimated 500B-1T) – Active parameters: Estimated 100-200B per token – Context length: 1M+ tokens (2M for some versions) – Supported modalities: Text,
Image, Audio, Video – Output modalities: Text, Audio (experimental) – Output length: Up to 64K tokens
Training Details: – Training data cutoff: January 2025 – Training infrastructure: TPUv5p pods – Precision: Mixed precision (BF16/FP8) – Training scale: Multiple datacenters, thousands of accelerators
Performance Highlights: – MMLU: ~90 (estimated) – Codeforces: High percentile – Multimodal benchmarks: State-of-the-art – Video understanding: Up to 3 hours of video – Long context: Excellent performance up to 1M tokens
Part 5: Comparative Analysis
Mathematical Complexity Comparison
Operation | GPT (Dense) | DeepSeek (MoE + MLA) | Gemini (MoE) |
Attention per Layer | O(n² × d) | O(n² × d_c), d_c ≪ d | O(n² × d) with sparse patterns |
FFN per Layer | O(n × d²) | O(n × d² × (N_s + K_r)/N_total) | O(n × d² × K/N) |
Parameters per Token | 100% | 5.5% (37B/671B) | ~10-20% |
KV Cache per Token | O(n_h × d_h) = O(d) | O(d_c + d_h^R) ≈ 0.02 × O(d) | O(d) with quantization |
Context Length | 128K | 128K | 1M+ |
Training Efficiency Comparison
Metric | GPT-3 | DeepSeek-V3 | Gemini 2.5 Pro |
Total Parameters | 175B | 671B | ~500B-1T (est.) |
Active Parameters | 175B | 37B | ~100-200B (est.) |
Training Tokens | ~300B | 14.8T | Not disclosed |
Training Cost | Not disclosed | $5.576M | Not disclosed |
FLOPs per Token (Training) | High | Low (due to MoE) | Medium |
FLOPs per Token (Inference) | ~2 × 175B | ~2 × 37B | ~2 × 100-200B (est.) |
Architectural Philosophy Comparison
OpenAI GPT: – Philosophy: Simplicity and scale – Key Innovation: Demonstrating that scaled Transformers develop emergent capabilities – Trade-off: Maximum capability at maximum cost – Best For: Applications where cost is secondary to reliability and performance
DeepSeek: – Philosophy: Efficiency without compromise – Key Innovations: – MLA for 98% KV cache reduction – Auxiliary-loss-free load balancing – Multi-token prediction – Trade-off: Architectural complexity for computational efficiency – Best For: Cost- sensitive deployments, organizations with limited resources
Gemini: – Philosophy: Unified multimodal intelligence – Key Innovations: – Native multimodal processing – Extended reasoning capabilities – Extreme long context (1M+ tokens) – Trade-off: Complexity in handling multiple modalities – Best For: Applications requiring vision, audio, or video understanding
Performance Characteristics
Latency: – GPT: Moderate (all parameters active) – DeepSeek: Low (fewer active parameters, efficient attention) – Gemini: Variable (depends on modality and context length)
Memory: – GPT: High (full KV cache, all parameters) – DeepSeek: Low (compressed KV cache, sparse activation) – Gemini: High for long contexts (but optimized)
Throughput: – GPT: Moderate (limited by memory bandwidth) – DeepSeek: High (sparse activation enables larger batch sizes) – Gemini: Moderate to high (depends on modality mix)
Use Case Suitability
GPT is Ideal For: – General-purpose text generation – Applications requiring proven reliability – Organizations with substantial computational resources – Use cases where consistency is critical
DeepSeek is Ideal For: – Cost-sensitive deployments – High-throughput applications – Organizations with limited GPU resources – Applications requiring long context at reasonable cost
Gemini is Ideal For: – Multimodal applications (image + text, video + text) – Applications requiring very long context (>100K tokens) – Complex reasoning tasks – Agentic systems combining multiple capabilities
Part 6: The Future of LLM Architectures
Emerging Trends
-
Convergence of Approaches
We’re seeing convergence toward hybrid architectures: – GPT models exploring sparse activation – DeepSeek expanding into multimodal – All models adopting MoE for efficiency
-
Efficiency as a First-Class Concern
The success of DeepSeek demonstrates that efficiency innovations can match or exceed brute-force scaling:
Not just:
-
Multimodal as Default
Future models will likely be multimodal by default, as the world’s information exists in multiple modalities.
-
Extended Reasoning
The “thinking” capabilities in Gemini 2.5 and similar models represent a shift toward more deliberate, verifiable reasoning:
Open Research Questions
-
Optimal Sparsity Patterns
What is the optimal balance between: – Number of experts – Expert size – Activation rate
– Routing mechanism
-
Scaling Laws for MoE
Traditional scaling laws:
How do these change for MoE models where active parameters ≠ total parameters?
-
Cross-Modal Understanding
How can we better measure and improve cross-modal reasoning? Current benchmarks focus on single-modality performance.
-
Efficient Long Context
Can we achieve O(n) or O(n log n) attention without sacrificing quality?
Candidates: – Linear attention mechanisms – State space models – Hierarchical attention
Mathematical Frontiers
-
Better Attention Mechanisms
Current research explores alternatives to softmax attention:
Where φ is a feature map that allows O(n) complexity.
- Adaptive Computation
Allow models to use more computation for harder problems:
-
Continuous Learning
Current models are static after training. Future models might continuously update:
While maintaining stability and preventing catastrophic forgetting.
Conclusion: The Elegant Mathematics of Machine Intelligence
We’ve journeyed through the mathematical foundations of three remarkable AI architectures. What emerges is a picture of elegant simplicity giving rise to extraordinary complexity.
The Core Insights
-
Attention is Universal
The attention mechanism—a simple weighted sum based on compatibility scores— turns out to be sufficient for modeling complex relationships in language, vision, and beyond:
This formula, simple enough to fit on a single line, powers systems that can write poetry, prove theorems, and understand images.
-
Sparsity is Powerful
DeepSeek demonstrated that we don’t need to activate all parameters for every token. Selective activation through MoE:
This achieves comparable performance to dense models while using a fraction of the computation.
-
Compression Preserves Information
MLA showed that we can compress keys and values by 98%:
And still maintain performance, because the essential information is preserved in a low-dimensional subspace.
-
Multimodality is Natural
Gemini demonstrated that different modalities can be processed in a unified token space:
The same attention mechanism that relates words to words can relate images to text, audio to video, and any modality to any other.
The Human Element
Despite the sophisticated mathematics, these systems ultimately serve human needs. They help us: – Communicate across language barriers – Access and synthesize information – Solve complex problems – Create new forms of art and expression – Understand the world in new ways
A Note of Wonder
There’s something profound about the fact that intelligence—whether natural or artificial—can be approximated by mathematical functions. The same principles that describe the motion of planets and the behavior of particles also describe the processing of information and the generation of meaning.
We’ve seen that: – A simple attention formula captures the essence of focus and relevance – Sparse activation mirrors how human experts specialize – Compression reveals the low-dimensional structure of information – Unified processing across modalities reflects how humans integrate sensory information
Looking Forward
The mathematics we’ve explored represents our current understanding, but the field continues to evolve rapidly. Future architectures will likely: – Be more efficient (doing more with less) – Be more capable (understanding deeper relationships) – Be more accessible (available to more people and organizations) – Be more aligned (better serving human values and needs)
The journey from simple formulas to intelligent behavior is a testament to both the power of mathematics and the ingenuity of human creativity. As we continue to refine these architectures, we move closer to systems that can truly augment human intelligence and help us tackle the great challenges of our time.
Acknowledgments
This document was prepared to illuminate the mathematical foundations of modern AI for a broad audience. We are grateful to the research teams at OpenAI, DeepSeek, and Google DeepMind for their groundbreaking work and for publishing detailed technical reports that make this kind of analysis possible.
Special thanks to the authors of “Attention Is All You Need” (Vaswani et al., 2017), whose elegant architecture laid the foundation for the current generation of AI systems.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008. https://arxiv.org/abs/1706.03762
- DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437
- Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Ding, W., Li, M., Xiao,
Y., Wang, P., Huang, K., Sui, Y., Ruan, C., Zheng, Z., Yu, K., Cheng, X., Guo, X., Gu, S.,
… & Bi, J. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066.
- Gemini Team, Google. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805. https://arxiv.org/abs/2312.11805
- Gemini Team, Google. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. Technical Report. https://storage.googleapis.com/deepmind- media/gemini/gemini_v2_5_report.pdf
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.
- Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. arXiv preprint arXiv:1701.06538.
- Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.
- Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
- Clark, A., de Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S., et al. (2022). Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169.
Document prepared by:
Hene Aku Kwapong, PhD, MBA
MIT Practice School Alumni
With the assistance of: Jeremie Kwapong
New England Innovation Academy
October 9, 2O25