The Mathematics Behind Modern AI: A Deep Dive into GPT, DeepSeek, and Gemini

Math behind AI
21 min read
Print

Document prepared with the help of:

Jeremie Kwapong, New England Innovation Academy

Date: October 9, 2025

Introduction: Decoding the Language of Intelligent Machines

In 2017, a team of researchers at Google published a paper with a bold title: “Attention Is All You Need.” This work introduced the Transformer architecture, which would revolutionize artificial intelligence and give birth to the large language models that now shape our digital world. But what makes these systems work? What mathematical principles allow a computer to understand context, generate coherent text, and even reason about complex problems?

This document explores the mathematical foundations of three groundbreaking AI architectures: OpenAI’s GPT models, which demonstrated the power of scaling; DeepSeek’s efficient design, which proved that smart engineering can rival brute force; and Google’s Gemini, which extended AI’s reach into the multimodal realm of images, audio, and video.

We’ll present the actual mathematical formulations that power these systems, but explain them in a way that reveals their elegant simplicity. Our goal is to make sophisticated concepts accessible without sacrificing accuracy or depth.

Part 1: The Foundational Architecture – The Transformer

Before diving into the specific innovations of GPT, DeepSeek, and Gemini, we must understand the Transformer architecture that underlies them all. Introduced by Vaswani et al. in 2017, the Transformer solved a fundamental problem: how to process sequences of information in parallel while still capturing relationships between distant elements.

The Attention Mechanism: Learning What Matters

Imagine reading a detective novel. When you encounter the phrase “the butler did it,” your mind instantly connects this to earlier mentions of the butler, the crime scene, and various clues scattered throughout hundreds of pages. You’re not giving equal weight to every word you’ve read—you’re selectively attending to relevant information.

The Transformer does exactly this through its attention mechanism. The mathematics, while precise, capture this intuitive process beautifully.

Scaled Dot-Product Attention

The core attention formula is:

Let’s unpack this step by step:

The Three Matrices: Q (Query): Represents “what am I looking for?” For each word, this encodes what information it needs from other words. – K (Key): Represents “what information do I have?” Each word advertises what it can offer. – V (Value): Represents “what is my actual content?” The information that will be retrieved.

The Computation:

  1. QK^T: We multiply queries by keys (transposed). This creates a matrix where each entry measures how relevant each key is to each query. Think of it as computing compatibility scores.
  2. Division by √d_k: This scaling factor is crucial. Without it, when d_k (the dimension of our keys) is large, the dot products can become very large, pushing the softmax function into regions where gradients become tiny. The square root scaling keeps the values in a reasonable range. It’s like adjusting the volume on a stereo—too loud and you get distortion, too quiet and you miss important details.
  3. softmax(…): This converts our compatibility scores into probabilities that sum to

1. It’s a way of saying “distribute your attention across all the words, but focus more on the relevant ones.”

  1. Multiply by V: Finally, we use these attention weights to compute a weighted sum of the values. We’re gathering information from all words, but in proportion to their relevance.

Example in Action:

Consider the sentence: “The cat sat on the mat because it was comfortable.”

When processing “it,” the attention mechanism computes: – High attention to “mat” (it could refer to the mat) – High attention to “cat” (it could refer to the cat) – Lower attention to “sat,” “on,” “the” (less relevant for resolving the pronoun)

The context (“because it was comfortable”) helps the model determine that “it” likely refers to the mat, not the cat.

Multi-Head Attention: Multiple Perspectives

A single attention mechanism is powerful, but multiple attention mechanisms working in parallel are even better. This is multi-head attention:

Why Multiple Heads?

Different heads can learn to attend to different types of relationships: – One head might focus on syntactic relationships (subject-verb agreement) – Another might capture semantic relationships (synonyms, antonyms) – Yet another might track long-range dependencies (pronouns to their antecedents)

The Mathematics:

Each head has its own learned projection matrices: – W_i^Q ∈ ^(d_model × d_k): Projects the input into query space for head i – W_i^K ∈ ^(d_model × d_k): Projects the input into key space for head i – W_i^V ∈ ^(d_model × d_v): Projects the input into value space for head i

After computing attention for each head independently, we concatenate all the outputs and project them back to the model dimension using W^O ∈ ^(hd_v × d_model).

Typical Configuration: – Number of heads (h): 8 to 96 (depending on model size) – d_k

= d_v = d_model / h (typically 64 for each head in a 512-dimensional model)

This design ensures that the total computational cost is similar to single-head attention with full dimensionality, but we gain the benefit of multiple specialized attention patterns.

Positional Encoding: Understanding Order

Unlike recurrent neural networks that process sequences step by step, Transformers process all positions simultaneously. This parallelism is great for speed, but it creates a problem: the model has no inherent sense of word order. “Dog bites man” and “Man bites dog” would look identical!

The solution is positional encoding, which adds position information to each word’s representation:

Understanding the Formula:

pos: The position of the word in the sequence (0, 1, 2, …)

i: The dimension index (0, 1, 2, …, d_model/2)

2i and 2i+1: Even and odd dimensions use sine and cosine respectively

Why Sine and Cosine?

This choice is brilliant for several reasons:

  1. Unique Fingerprints: Each position gets a unique pattern of sine and cosine values across all dimensions.
  2. Relative Positions: The model can learn to attend to relative positions. Due to trigonometric identities, PE(pos+k) can be expressed as a linear function of PE(pos), making it easy for the model to learn patterns like “attend to the word 3 positions back.”
  3. Extrapolation: The model can potentially handle sequences longer than those seen during training, since the sinusoidal pattern continues smoothly.
  4. No Learning Required: Unlike learned positional embeddings, these are fixed functions, reducing the number of parameters.

Visualization:

Imagine each position as a unique musical chord. Position 0 might be a C major chord, position 1 a slightly different chord, and so on. Each dimension contributes a different frequency to this chord, creating a rich, unique signature for each position.

Position-Wise Feed-Forward Networks

After attention, each position’s representation is processed through a simple two-layer network:

Breaking It Down:

  1. First Linear Layer (xW+ b): Expands the representation from d_model dimensions to d_ff dimensions (typically d_ff = 4 × d_model). This expansion gives the network more capacity to process information.
  2. ReLU Activation (max(O, …)): The Rectified Linear Unit acts as a gatekeeper, setting negative values to zero. This non-linearity is crucial—without it, the entire network would collapse to a single linear transformation.
  3. Second Linear Layer (W+ b): Projects back down to d_model dimensions.

Why “Position-Wise”?

The same FFN is applied to each position independently. It’s like having the same expert analyze each word separately, after the attention mechanism has already allowed words to share information with each other.

Typical Dimensions: – Input/Output: d_model = 512 to 12,288 (depending on model size) – Hidden layer: d_ff = 2,048 to 49,152 (typically 4× the model dimension)

Layer Normalization and Residual Connections

Two additional components stabilize training:

Residual Connections:

The “+x” is a residual connection—we add the input back to the output. This allows gradients to flow directly through the network during training, preventing the vanishing gradient problem in deep networks.

Layer Normalization:

Normalizes the activations to have mean 0 and variance 1, then applies learned scaling and shifting. This keeps the values in a reasonable range and stabilizes training.

The Complete Transformer Block

Putting it all together, a Transformer block consists of:

  1. Multi-head self-attention
  2. Residual connection and layer normalization
  3. Position-wise feed-forward network
  4. Another residual connection and layer normalization

This block is stacked multiple times (6 to 96+ layers) to create the full model.

Part 2: OpenAI GPT – Scaling the Transformer

The GPT Philosophy: Simplicity at Scale

OpenAI’s approach with GPT (Generative Pre-trained Transformer) was elegantly simple: take the Transformer decoder, scale it up massively, train it on enormous amounts of text, and let capabilities emerge from scale.

Architecture Details

GPT uses the standard Transformer architecture with a few key choices:

Decoder-Only Architecture:

Unlike the original Transformer (which had both encoder and decoder), GPT uses only the decoder. This means: – It processes text left-to-right – Each position can only attend to previous positions (causal masking) – This makes it naturally suited for text generation

Causal Masking:

The attention mechanism is modified to prevent “looking ahead”:

The mask M ensures that position i can only attend to positions j where j ≤ i. The -∞ values become 0 after softmax, effectively blocking attention to future positions.

The Mathematics of Scale

Model Dimensions (GPT-3 as example): – Layers: 96 – d_model: 12,288 – Number of heads: 96 – d_k = d_v: 128 per head – d_ff: 49,152 – Context length: 2,048 tokens (later extended to 128K in GPT-4) – Total parameters: ~175 billion (GPT-3)

Attention Complexity:

For a sequence of length n: – Attention computation: O(n² × d_model) – Memory for attention scores: O(n² × h)

This quadratic scaling with sequence length is why context length was initially limited.

Training Objective: Next-Token Prediction

GPT is trained with a simple but powerful objective:

What This Means:

For each position t in the training text: 1. Given all previous tokens (x_1 through x_{t- 1}) 2. Predict the probability distribution over the next token (x_t) 3. Maximize the log probability of the actual next token

This seemingly simple task forces the model to learn: – Grammar and syntax – Factual knowledge – Reasoning patterns – Common sense – And much more

Why It Works:

The key insight is that predicting the next word requires understanding context, relationships, and patterns. To predict the next word in “The capital of France is “, the model must know geography. To predict the next word in “She was happy because “, the model must understand causation and emotion.

Computational Requirements

Training: – GPT-3 training: ~3.14 × 10²³ FLOPs – Training time: Several months on thousands of GPUs – Training data: ~300 billion tokens

Inference: – For each token generated: ~2 × (number of parameters) FLOPs – All parameters are active for every token – Memory: Must store all parameters plus KV cache

Strengths and Limitations

Strengths: – Proven, reliable architecture – Excellent general-purpose performance – Strong few-shot learning capabilities – Predictable scaling behavior

Limitations: – High computational cost (all parameters always active) – Quadratic attention complexity limits context length – Large memory footprint – Expensive to

train and run

Part 3: DeepSeek – The Efficiency Revolution

The DeepSeek Philosophy: Matching Performance with Fraction of Cost

DeepSeek asked a provocative question: “Can we achieve GPT-level performance while using only 5% of the computational resources per token?” Their answer came through two major innovations: Multi-Head Latent Attention (MLA) for memory efficiency and DeepSeekMoE for computational efficiency.

Innovation 1: Multi-Head Latent Attention (MLA)

The KV cache problem is one of the biggest bottlenecks in LLM inference. For each token generated, the model must store the keys and values from all previous tokens. With long contexts and many attention heads, this memory requirement becomes enormous.

The Compression Strategy

MLA solves this through low-rank compression. Instead of storing full keys and values, it stores compressed representations:

Step 1: Compress Keys and Values

Where: h_t ∈ ^d: The input hidden state for token t – W^DKV ∈ ^(d_c × d): Down-projection matrix – c_t^KV ∈ ^d_c: Compressed latent vector (d_c ≪ d_h × n_h)

Typical Values: – d (model dimension): 5,120 – n_h (number of heads): 128 – d_h (dimension per head): 128 – d_c (compressed dimension): 512

This means instead of storing 128 × 128 = 16,384 values per token, we store only 512— a 97% reduction!

Step 2: Decompress Keys

Where: W^UK ∈ ^(d_h×n_h × d_c): Up-projection matrix for keys – k_t^C: Decompressed keys for all heads (concatenated)

Step 3: Add Rotary Position Information

Where: W^KR ∈ ^(d_h^R × d): Matrix for RoPE keys – RoPE(·): Rotary Position Embedding function – k_t^R: Position-dependent component (shared across heads) – [·; ·]: Concatenation

Why Separate Position Information?

RoPE (Rotary Position Embedding) is more efficient than sinusoidal encoding. By keeping position information separate and shared across heads, we save even more memory.

Step 4: Decompress Values

Where: W^UV ∈ ^(d_h×n_h × d_c): Up-projection matrix for values

Query Compression

Queries are also compressed, but this is for reducing activation memory during training, not inference:

Where: c_t^Q ∈ ^(d_c’): Compressed query latent (d_c’ ≪ d_h × n_h) – W^DQ ∈

^(d_c’ × d): Query down-projection – W^UQ ∈ ^(d_h×n_h × d_c’): Query up- projection – W^QR ∈ ^(d_h×n_h × d_c’): Query RoPE matrix

Final Attention Computation

Where: W^O ∈ ^(d × d_h×n_h): Output projection matrix – o_{t,i}: Attention output for head i at position t

Memory Savings:

For DeepSeek-V3 with 128 heads: – Standard MHA: 128 × 128 × 2 = 32,768 values per token – MLA: 512 + 128 = 640 values per token – Reduction: 98%

Innovation 2: DeepSeekMoE Architecture

The second major innovation is the Mixture-of-Experts (MoE) architecture, which activates only a small fraction of parameters for each token.

The Basic MoE Formula

Where: u_t: Input to the FFN layer – N_s: Number of shared experts (always activated) – N_r: Number of routed experts (selectively activated) – FFN_i^{(s)}(·): i-th shared expert – FFN_i^{(r)}(·): i-th routed expert – g_{i,t}: Gating value for routed expert i at token t – h_t’: Output of the FFN layer

DeepSeek-V3 Configuration: – N_s = 2 shared experts – N_r = 256 routed experts – K_r

= 8 routed experts activated per token – Expert size: Each expert has ~15M parameters

Total Computation: – Shared experts: 2 experts (always active) – Routed experts: 8 experts (selected from 256) – Total active: 10 experts per token – Percentage: 10/258 ≈ 3.9% of FFN parameters

The Routing Mechanism

Step 1: Compute Affinity Scores

Where: e_i ∈ ^d: Centroid vector for expert i (learned parameter) – s_{i,t}: Token- to-expert affinity score (between 0 and 1) – Sigmoid: Ensures scores are in (0, 1) range

Intuition:

Each expert has a “specialty” represented by its centroid vector e_i. The affinity score measures how well the current token matches that specialty. Think of it as asking “Is this expert relevant for processing this token?”

Step 2: Select Top-K Experts

Only the K_r experts with highest affinity scores are selected.

Step 3: Normalize Gating Values

The gating values are normalized so they sum to 1, ensuring the output is a proper weighted combination.

Innovation 3: Auxiliary-Loss-Free Load Balancing

Traditional MoE models face a critical problem: some experts become overused while others are underutilized. This is typically solved with an auxiliary loss that penalizes imbalance, but this loss can hurt model performance.

DeepSeek’s solution is elegant: adjust the routing dynamically without modifying the training loss.

The Bias Adjustment Mechanism

Where: b_i: Bias term for expert i (dynamically adjusted)

Key Insight:

The bias b_i is used only for routing decisions (TopK selection), not for computing the gating values. This means: – Routing is influenced by the bias (encouraging balance) – Gating values remain based on true affinity (preserving performance)

Dynamic Bias Update

At the end of each training step:

Where: γ: Bias update speed (hyperparameter, typically 0.01)

Load Criteria:

An expert is considered: – Overloaded if it receives more than (1 + ε) × (K_r / N_r) × batch_size tokens – Underloaded if it receives fewer than (1 – ε) × (K_r / N_r) × batch_size tokens – Balanced otherwise

Where ε is a tolerance parameter (typically 0.2)

How It Works:

  1. If an expert is overloaded, we decrease its bias, making it less likely to be selected
  2. If an expert is underloaded, we increase its bias, making it more likely to be selected
  3. Over time, this naturally balances the load without forcing artificial constraints

Analogy:

Think of it like dynamic pricing: if a restaurant is too crowded, prices go up slightly (negative bias), discouraging some customers. If it’s empty, prices go down (positive bias), attracting more customers. The quality of the food (true affinity) doesn’t change, but the selection behavior adjusts.

Complementary Sequence-Wise Balance Loss

While the auxiliary-loss-free strategy handles batch-level balance, a small auxiliary loss prevents extreme imbalance within individual sequences:

Where:

Breaking It Down:

f_i: Fraction of tokens in the sequence routed to expert i

P_i: Average routing probability for expert i

α: Balance loss coefficient (typically 0.001-0.01)

T: Sequence length

[·]: Indicator function (1 if condition is true, 0 otherwise)

Why This Loss?

The product f_i × P_i is minimized when experts are balanced. If an expert has high routing probability (P_i) but low actual usage (f_i), or vice versa, the loss increases. This gently encourages balance within each sequence without the strong performance penalty of traditional auxiliary losses.

Multi-Token Prediction

DeepSeek-V3 also employs a multi-token prediction objective:

Where: K: Number of future tokens to predict (typically 2-4) – λ_k: Weight for predicting k steps ahead (typically decreasing)

Why Predict Multiple Tokens?

  1. Better Representations: Forces the model to learn representations that are useful for multiple future predictions
  2. Speculative Decoding: The additional prediction heads can be used for speculative decoding during inference, generating multiple tokens in parallel
  3. Improved Performance: Empirically shown to improve benchmark scores

DeepSeek-V3 Complete Specifications

Model Architecture: – Total parameters: 671B – Active parameters per token: 37B (5.5%) – Layers: 61 – Model dimension (d): 7,168 – Number of attention heads (n_h): 128

– FFN hidden dimension: 18,432 – Number of shared experts (N_s): 2 – Number of routed experts (N_r): 256 – Activated routed experts (K_r): 8 – Context length: 128K tokens – KV compression dimension (d_c): 1,536 – Query compression dimension (d_c’): 1,536

Training Details: – Training tokens: 14.8 trillion – Training cost: $5.576 million (2.788M H800 GPU hours) – Training time: ~2 months – Batch size: 9,216 sequences – Learning rate: Peak 4.2 × 10⁴, cosine decay – Precision: FP8 mixed precision

Performance: – MMLU: 88.5 – MMLU-Pro: 75.9 – GPQA: 59.1 – MATH-500: 90.2 –

Codeforces: 78.3 percentile

Part 4: Google Gemini – The Multimodal Frontier

The Gemini Philosophy: Unified Multimodal Understanding

While GPT and DeepSeek primarily process text, Gemini was designed from the ground up to understand multiple modalities—text, images, audio, and video—in a unified architecture. This isn’t just about processing different types of data separately; it’s about understanding relationships across modalities.

Sparse Mixture-of-Experts Foundation

Like DeepSeek, Gemini 2.5 uses a sparse MoE architecture, but optimized for multimodal processing:

Where:

Components: N: Total number of experts (not publicly disclosed, estimated 100-200)

K: Number of activated experts per token (estimated 10-20) – Router(x): Learned routing function – Expert_i(·): i-th expert FFN

Routing Function:

Where: W_r ∈ ^(N × d): Routing weight matrix – b_r ∈ ^N: Routing bias vector

Multimodal Token Processing

The revolutionary aspect of Gemini is how it processes different modalities in a unified token space:

Where [·; ·] denotes concatenation in the sequence dimension. Text Tokenization

Standard subword tokenization:

Where: BPE: Byte-Pair Encoding algorithm – E_text ∈ ^(V × d): Text embedding matrix – V: Vocabulary size (~256K tokens)

Image Tokenization

Images are divided into patches and embedded:

Where: patch_size: Typically 14×14 or 16×16 pixels – Linear: Learned linear projection to model dimension – PE_2D: 2D positional encoding (preserves spatial relationships)

For a 224×224 image with 14×14 patches: – Number of patches: (224/14)² = 256 patches – Each patch becomes one token – Total: 256 image tokens

Audio Tokenization

Audio is processed in overlapping windows:

Where: STFT: Short-Time Fourier Transform (converts audio to frequency domain) – CNN: Convolutional neural network for feature extraction – PE_temporal: Temporal positional encoding

Typical Configuration: – Window size: 25ms – Hop length: 10ms – For 1 second of audio: ~100 audio tokens

Video Tokenization

Video combines spatial and temporal processing:

Where:        –        Sample:        Samples        frames        at        specified        rate        (typically        1-4        fps)        –

PE_spatiotemporal: 3D positional encoding (spatial + temporal)

For a 1-minute video at 1 fps: – 60 frames – 256 tokens per frame – Total: 15,360 video tokens

Cross-Modal Attention

The key to multimodal understanding is allowing different modalities to attend to each other:

Example:

When processing the query “What color is the car in the image?”: – Text tokens attend to image tokens to locate the car – Image tokens attend to text tokens to understand what’s being asked – The model generates a response by integrating both modalities

Long Context Processing

Gemini 2.5 Pro supports context lengths exceeding 1 million tokens through several optimizations:

Efficient Attention Mechanisms Sparse Attention Patterns:

Instead of full O(n²) attention, use structured sparsity:

Where: K_local, V_local: Keys and values from nearby positions (e.g., within 512 tokens) – K_global, V_global: Keys and values from selected global positions (e.g., every 64th token)

This reduces complexity from O(n²) to O(n × √n) or better.

Memory-Efficient KV Cache Quantization:

Storing keys and values in 8-bit precision instead of 16-bit reduces memory by 50% with minimal quality loss.

Selective Caching:

Only cache the most important positions based on recent attention patterns.

Extended Reasoning: The Thinking Process

Gemini 2.5 incorporates “thinking” capabilities through extended reasoning:

Chain-of-Thought Generation:

The model generates intermediate reasoning steps before the final answer.

Verification:

The model evaluates its own reasoning and regenerates if confidence is low.

Mathematical Formulation:

Where: r_i: i-th reasoning step – M: Number of reasoning steps – y: Final answer – x: Input

Distillation for Smaller Models

Smaller Gemini models (Flash, Flash-Lite) learn from larger models through distillation:

Where:

K-Sparse Approximation:

To reduce storage, only the top-K logits from the teacher are stored:

Typical K: 256-1024 (out of ~256K vocabulary)

Training Objectives

Gemini uses multiple training objectives:

  1. Next-Token Prediction (Primary):

  1. Multimodal Alignment:

Ensures text and image representations are aligned.

  1. Instruction Following:

  1. Preference Learning:

Where: σ: Sigmoid function – r(x, y): Reward model score

Gemini 2.5 Pro Specifications

Model Architecture: – Architecture: Sparse MoE Transformer – Total parameters: Not disclosed (estimated 500B-1T) – Active parameters: Estimated 100-200B per token – Context length: 1M+ tokens (2M for some versions) – Supported modalities: Text,

Image, Audio, Video – Output modalities: Text, Audio (experimental) – Output length: Up to 64K tokens

Training Details: – Training data cutoff: January 2025 – Training infrastructure: TPUv5p pods – Precision: Mixed precision (BF16/FP8) – Training scale: Multiple datacenters, thousands of accelerators

Performance Highlights: – MMLU: ~90 (estimated) – Codeforces: High percentile – Multimodal benchmarks: State-of-the-art – Video understanding: Up to 3 hours of video – Long context: Excellent performance up to 1M tokens

Part 5: Comparative Analysis

Mathematical Complexity Comparison

Operation GPT (Dense) DeepSeek (MoE + MLA) Gemini (MoE)
Attention per Layer O(n² × d) O(n² × d_c), d_c ≪ d O(n² × d) with sparse patterns
FFN per Layer O(n × d²) O(n × d² × (N_s + K_r)/N_total) O(n × d² × K/N)
Parameters per Token 100% 5.5% (37B/671B) ~10-20%
KV Cache per Token O(n_h × d_h) = O(d) O(d_c + d_h^R) ≈ 0.02 × O(d) O(d) with quantization
Context Length 128K 128K 1M+

Training Efficiency Comparison

Metric GPT-3 DeepSeek-V3 Gemini 2.5 Pro
Total Parameters 175B 671B ~500B-1T (est.)
Active Parameters 175B 37B ~100-200B (est.)
Training Tokens ~300B 14.8T Not disclosed
Training Cost Not disclosed $5.576M Not disclosed
FLOPs per Token (Training) High Low (due to MoE) Medium
FLOPs per Token (Inference) ~2 × 175B ~2 × 37B ~2 × 100-200B (est.)

Architectural Philosophy Comparison

OpenAI GPT: Philosophy: Simplicity and scale – Key Innovation: Demonstrating that scaled Transformers develop emergent capabilities – Trade-off: Maximum capability at maximum cost – Best For: Applications where cost is secondary to reliability and performance

DeepSeek: Philosophy: Efficiency without compromise – Key Innovations: – MLA for 98% KV cache reduction – Auxiliary-loss-free load balancing – Multi-token prediction – Trade-off: Architectural complexity for computational efficiency – Best For: Cost- sensitive deployments, organizations with limited resources

Gemini: Philosophy: Unified multimodal intelligence – Key Innovations: – Native multimodal processing – Extended reasoning capabilities – Extreme long context (1M+ tokens) – Trade-off: Complexity in handling multiple modalities – Best For: Applications requiring vision, audio, or video understanding

Performance Characteristics

Latency: GPT: Moderate (all parameters active) – DeepSeek: Low (fewer active parameters, efficient attention) – Gemini: Variable (depends on modality and context length)

Memory: GPT: High (full KV cache, all parameters) – DeepSeek: Low (compressed KV cache, sparse activation) – Gemini: High for long contexts (but optimized)

Throughput: GPT: Moderate (limited by memory bandwidth) – DeepSeek: High (sparse activation enables larger batch sizes) – Gemini: Moderate to high (depends on modality mix)

Use Case Suitability

GPT is Ideal For: – General-purpose text generation – Applications requiring proven reliability – Organizations with substantial computational resources – Use cases where consistency is critical

DeepSeek is Ideal For: – Cost-sensitive deployments – High-throughput applications – Organizations with limited GPU resources – Applications requiring long context at reasonable cost

Gemini is Ideal For: – Multimodal applications (image + text, video + text) – Applications requiring very long context (>100K tokens) – Complex reasoning tasks – Agentic systems combining multiple capabilities

Part 6: The Future of LLM Architectures

  1. Convergence of Approaches

We’re seeing convergence toward hybrid architectures: – GPT models exploring sparse activation – DeepSeek expanding into multimodal – All models adopting MoE for efficiency

  1. Efficiency as a First-Class Concern

The success of DeepSeek demonstrates that efficiency innovations can match or exceed brute-force scaling:

Not just:

  1. Multimodal as Default

Future models will likely be multimodal by default, as the world’s information exists in multiple modalities.

  1. Extended Reasoning

The “thinking” capabilities in Gemini 2.5 and similar models represent a shift toward more deliberate, verifiable reasoning:

Open Research Questions

  1. Optimal Sparsity Patterns

What is the optimal balance between: – Number of experts – Expert size – Activation rate

– Routing mechanism

  1. Scaling Laws for MoE

Traditional scaling laws:

How do these change for MoE models where active parameters ≠ total parameters?

  1. Cross-Modal Understanding

How can we better measure and improve cross-modal reasoning? Current benchmarks focus on single-modality performance.

  1. Efficient Long Context

Can we achieve O(n) or O(n log n) attention without sacrificing quality?

Candidates: – Linear attention mechanisms – State space models – Hierarchical attention

Mathematical Frontiers

  1. Better Attention Mechanisms

Current research explores alternatives to softmax attention:

Where φ is a feature map that allows O(n) complexity.

  1. Adaptive Computation

Allow models to use more computation for harder problems:

  1. Continuous Learning

Current models are static after training. Future models might continuously update:

While maintaining stability and preventing catastrophic forgetting.

Conclusion: The Elegant Mathematics of Machine Intelligence

We’ve journeyed through the mathematical foundations of three remarkable AI architectures. What emerges is a picture of elegant simplicity giving rise to extraordinary complexity.

The Core Insights

  1. Attention is Universal

The attention mechanism—a simple weighted sum based on compatibility scores— turns out to be sufficient for modeling complex relationships in language, vision, and beyond:

This formula, simple enough to fit on a single line, powers systems that can write poetry, prove theorems, and understand images.

  1. Sparsity is Powerful

DeepSeek demonstrated that we don’t need to activate all parameters for every token. Selective activation through MoE:

This achieves comparable performance to dense models while using a fraction of the computation.

  1. Compression Preserves Information

MLA showed that we can compress keys and values by 98%:

And still maintain performance, because the essential information is preserved in a low-dimensional subspace.

  1. Multimodality is Natural

Gemini demonstrated that different modalities can be processed in a unified token space:

The same attention mechanism that relates words to words can relate images to text, audio to video, and any modality to any other.

The Human Element

Despite the sophisticated mathematics, these systems ultimately serve human needs. They help us: – Communicate across language barriers – Access and synthesize information – Solve complex problems – Create new forms of art and expression – Understand the world in new ways

A Note of Wonder

There’s something profound about the fact that intelligence—whether natural or artificial—can be approximated by mathematical functions. The same principles that describe the motion of planets and the behavior of particles also describe the processing of information and the generation of meaning.

We’ve seen that: – A simple attention formula captures the essence of focus and relevance – Sparse activation mirrors how human experts specialize – Compression reveals the low-dimensional structure of information – Unified processing across modalities reflects how humans integrate sensory information

Looking Forward

The mathematics we’ve explored represents our current understanding, but the field continues to evolve rapidly. Future architectures will likely: – Be more efficient (doing more with less) – Be more capable (understanding deeper relationships) – Be more accessible (available to more people and organizations) – Be more aligned (better serving human values and needs)

The journey from simple formulas to intelligent behavior is a testament to both the power of mathematics and the ingenuity of human creativity. As we continue to refine these architectures, we move closer to systems that can truly augment human intelligence and help us tackle the great challenges of our time.

Acknowledgments

This document was prepared to illuminate the mathematical foundations of modern AI for a broad audience. We are grateful to the research teams at OpenAI, DeepSeek, and Google DeepMind for their groundbreaking work and for publishing detailed technical reports that make this kind of analysis possible.

Special thanks to the authors of “Attention Is All You Need” (Vaswani et al., 2017), whose elegant architecture laid the foundation for the current generation of AI systems.

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008. https://arxiv.org/abs/1706.03762
  2. DeepSeek-AI.        (2024).        DeepSeek-V3        Technical        Report.        arXiv        preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437
  3. Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Ding, W., Li, M., Xiao,

Y., Wang, P., Huang, K., Sui, Y., Ruan, C., Zheng, Z., Yu, K., Cheng, X., Guo, X., Gu, S.,

… & Bi, J. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066.

  1. Gemini Team, Google. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805. https://arxiv.org/abs/2312.11805
  2. Gemini Team, Google. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. Technical Report. https://storage.googleapis.com/deepmind- media/gemini/gemini_v2_5_report.pdf
  3. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.
  4. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. arXiv preprint arXiv:1701.06538.
  5. Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.
  6. Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  7. Clark, A., de Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S., et al. (2022). Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169.

Document prepared by:

Hene Aku Kwapong, PhD, MBA

MIT Practice School Alumni

With the assistance of: Jeremie Kwapong

New England Innovation Academy

October 9, 2O25

RELATED CONTENT
AUTHOR
Picture of Hene Aku Kwapong
Hene Aku Kwapong

An executive, board director, and entrepreneur with 25+yr experience leading transformative initiatives across capital markets, banking, & technology, making him valuable asset to companies navigating complex challenges

View All Post