Artificial Intelligence

The Mathematics Behind Modern AI: A Deep Dive into GPT, DeepSeek, and Gemini

Hene Aku Kwapong

PhD (Columbia), MBA (MIT) | Managing Partner, The Songhai Group & Non-Executive Director, Ecobank Ghana

Hene Aku Kwapong
jeremiekwapong

jeremiekwapong

By Hene Aku Kwapong

21 min read

Subscribe to our mailing list to receives daily updates direct to your inbox!

The Real Challenge: Ghana’s Struggle to Access Physical US Dollars

Why the Bank of Ghana May Cut Interest Rates Soon

Why Ghana Urgently Needs a National “Thinking Center” for Development Projects

Constitution- Local Government Change

The Rivalry, Scandal, and Revolution That Forged an Akan State

The Language of Unity: Why Ghana Must Mandate a New National Tongue Alongside English

Ghana’s Reality: Rethinking Our Path to Development

Beyond The Text: Making Ghana’s Constitution Deliver Democracy (Slides)

Beyond the Text: Making Ghana’s Constitution Deliver Democracy (LiveStream)

My Vision for Ghana by Hene Kwapong

Document prepared with the help of:

Jeremie Kwapong, New England Innovation Academy

Date: October 9, 2025

Introduction: Decoding the Language of Intelligent Machines

In 2017, a team of researchers at Google published a paper with a bold title: “Attention Is All You Need.” This work introduced the Transformer architecture, which would revolutionize artiﬁcial intelligence and give birth to the large language models that now shape our digital world. But what makes these systems work? What mathematical principles allow a computer to understand context, generate coherent text, and even reason about complex problems?

This document explores the mathematical foundations of three groundbreaking AI architectures: OpenAI’s GPT models, which demonstrated the power of scaling; DeepSeek’s eﬃcient design, which proved that smart engineering can rival brute force; and Google’s Gemini, which extended AI’s reach into the multimodal realm of images, audio, and video.

We’ll present the actual mathematical formulations that power these systems, but explain them in a way that reveals their elegant simplicity. Our goal is to make sophisticated concepts accessible without sacriﬁcing accuracy or depth.

Part 1: The Foundational Architecture – The Transformer

Before diving into the speciﬁc innovations of GPT, DeepSeek, and Gemini, we must understand the Transformer architecture that underlies them all. Introduced by Vaswani et al. in 2017, the Transformer solved a fundamental problem: how to process sequences of information in parallel while still capturing relationships between distant elements.

The Attention Mechanism: Learning What Matters

Imagine reading a detective novel. When you encounter the phrase “the butler did it,” your mind instantly connects this to earlier mentions of the butler, the crime scene, and various clues scattered throughout hundreds of pages. You’re not giving equal weight to every word you’ve read—you’re selectively attending to relevant information.

The Transformer does exactly this through its attention mechanism. The mathematics, while precise, capture this intuitive process beautifully.

Scaled Dot-Product Attention

The core attention formula is:

Let’s unpack this step by step:

The Three Matrices: – Q (Query): Represents “what am I looking for?” For each word, this encodes what information it needs from other words. – K (Key): Represents “what information do I have?” Each word advertises what it can oﬀer. – V (Value): Represents “what is my actual content?” The information that will be retrieved.

The Computation:

QK^T: We multiply queries by keys (transposed). This creates a matrix where each entry measures how relevant each key is to each query. Think of it as computing compatibility scores.
Division by √d_k: This scaling factor is crucial. Without it, when d_k (the dimension of our keys) is large, the dot products can become very large, pushing the softmax function into regions where gradients become tiny. The square root scaling keeps the values in a reasonable range. It’s like adjusting the volume on a stereo—too loud and you get distortion, too quiet and you miss important details.
softmax(…): This converts our compatibility scores into probabilities that sum to

1. It’s a way of saying “distribute your attention across all the words, but focus more on the relevant ones.”

Multiply by V: Finally, we use these attention weights to compute a weighted sum of the values. We’re gathering information from all words, but in proportion to their relevance.

Example in Action:

Consider the sentence: “The cat sat on the mat because it was comfortable.”

When processing “it,” the attention mechanism computes: – High attention to “mat” (it could refer to the mat) – High attention to “cat” (it could refer to the cat) – Lower attention to “sat,” “on,” “the” (less relevant for resolving the pronoun)

The context (“because it was comfortable”) helps the model determine that “it” likely refers to the mat, not the cat.

Multi-Head Attention: Multiple Perspectives

A single attention mechanism is powerful, but multiple attention mechanisms working in parallel are even better. This is multi-head attention:

Why Multiple Heads?

Diﬀerent heads can learn to attend to diﬀerent types of relationships: – One head might focus on syntactic relationships (subject-verb agreement) – Another might capture semantic relationships (synonyms, antonyms) – Yet another might track long-range dependencies (pronouns to their antecedents)

The Mathematics:

Each head has its own learned projection matrices: – W_i^Q ∈ ^(d_model × d_k): Projects the input into query space for head i – W_i^K ∈ ^(d_model × d_k): Projects the input into key space for head i – W_i^V ∈ ^(d_model × d_v): Projects the input into value space for head i

After computing attention for each head independently, we concatenate all the outputs and project them back to the model dimension using W^O ∈ ^(hd_v × d_model).

Typical Conﬁguration: – Number of heads (h): 8 to 96 (depending on model size) – d_k

= d_v = d_model / h (typically 64 for each head in a 512-dimensional model)

This design ensures that the total computational cost is similar to single-head attention with full dimensionality, but we gain the beneﬁt of multiple specialized attention patterns.

Positional Encoding: Understanding Order

Unlike recurrent neural networks that process sequences step by step, Transformers process all positions simultaneously. This parallelism is great for speed, but it creates a problem: the model has no inherent sense of word order. “Dog bites man” and “Man bites dog” would look identical!

The solution is positional encoding, which adds position information to each word’s representation:

Understanding the Formula:

pos: The position of the word in the sequence (0, 1, 2, …)

i: The dimension index (0, 1, 2, …, d_model/2)

2i and 2i+1: Even and odd dimensions use sine and cosine respectively

Why Sine and Cosine?

This choice is brilliant for several reasons:

Unique Fingerprints: Each position gets a unique pattern of sine and cosine values across all dimensions.
Relative Positions: The model can learn to attend to relative positions. Due to trigonometric identities, PE(pos+k) can be expressed as a linear function of PE(pos), making it easy for the model to learn patterns like “attend to the word 3 positions back.”
Extrapolation: The model can potentially handle sequences longer than those seen during training, since the sinusoidal pattern continues smoothly.
No Learning Required: Unlike learned positional embeddings, these are ﬁxed functions, reducing the number of parameters.

Visualization:

Imagine each position as a unique musical chord. Position 0 might be a C major chord, position 1 a slightly diﬀerent chord, and so on. Each dimension contributes a diﬀerent frequency to this chord, creating a rich, unique signature for each position.

Position-Wise Feed-Forward Networks

After attention, each position’s representation is processed through a simple two-layer network:

Breaking It Down:

First Linear Layer (xW₁ + b₁): Expands the representation from d_model dimensions to d_ﬀ dimensions (typically d_ﬀ = 4 × d_model). This expansion gives the network more capacity to process information.
ReLU Activation (max(O, …)): The Rectiﬁed Linear Unit acts as a gatekeeper, setting negative values to zero. This non-linearity is crucial—without it, the entire network would collapse to a single linear transformation.
Second Linear Layer (W₂ + b₂): Projects back down to d_model dimensions.

Why “Position-Wise”?

The same FFN is applied to each position independently. It’s like having the same expert analyze each word separately, after the attention mechanism has already allowed words to share information with each other.

Typical Dimensions: – Input/Output: d_model = 512 to 12,288 (depending on model size) – Hidden layer: d_ﬀ = 2,048 to 49,152 (typically 4× the model dimension)

Layer Normalization and Residual Connections

Two additional components stabilize training:

Residual Connections:

The “+x” is a residual connection—we add the input back to the output. This allows gradients to ﬂow directly through the network during training, preventing the vanishing gradient problem in deep networks.

Layer Normalization:

Normalizes the activations to have mean 0 and variance 1, then applies learned scaling and shifting. This keeps the values in a reasonable range and stabilizes training.

The Complete Transformer Block

Putting it all together, a Transformer block consists of:

Multi-head self-attention
Residual connection and layer normalization
Position-wise feed-forward network
Another residual connection and layer normalization

This block is stacked multiple times (6 to 96+ layers) to create the full model.

Part 2: OpenAI GPT – Scaling the Transformer

The GPT Philosophy: Simplicity at Scale

OpenAI’s approach with GPT (Generative Pre-trained Transformer) was elegantly simple: take the Transformer decoder, scale it up massively, train it on enormous amounts of text, and let capabilities emerge from scale.

Architecture Details

GPT uses the standard Transformer architecture with a few key choices:

Decoder-Only Architecture:

Unlike the original Transformer (which had both encoder and decoder), GPT uses only the decoder. This means: – It processes text left-to-right – Each position can only attend to previous positions (causal masking) – This makes it naturally suited for text generation

Causal Masking:

The attention mechanism is modiﬁed to prevent “looking ahead”:

The mask M ensures that position i can only attend to positions j where j ≤ i. The -∞ values become 0 after softmax, eﬀectively blocking attention to future positions.

The Mathematics of Scale

Model Dimensions (GPT-3 as example): – Layers: 96 – d_model: 12,288 – Number of heads: 96 – d_k = d_v: 128 per head – d_ﬀ: 49,152 – Context length: 2,048 tokens (later extended to 128K in GPT-4) – Total parameters: ~175 billion (GPT-3)

Attention Complexity:

For a sequence of length n: – Attention computation: O(n² × d_model) – Memory for attention scores: O(n² × h)

This quadratic scaling with sequence length is why context length was initially limited.

Training Objective: Next-Token Prediction

GPT is trained with a simple but powerful objective:

What This Means:

For each position t in the training text: 1. Given all previous tokens (x_1 through x_{t- 1}) 2. Predict the probability distribution over the next token (x_t) 3. Maximize the log probability of the actual next token

This seemingly simple task forces the model to learn: – Grammar and syntax – Factual knowledge – Reasoning patterns – Common sense – And much more

Why It Works:

The key insight is that predicting the next word requires understanding context, relationships, and patterns. To predict the next word in “The capital of France is “, the model must know geography. To predict the next word in “She was happy because “, the model must understand causation and emotion.

Computational Requirements

Training: – GPT-3 training: ~3.14 × 10²³ FLOPs – Training time: Several months on thousands of GPUs – Training data: ~300 billion tokens

Inference: – For each token generated: ~2 × (number of parameters) FLOPs – All parameters are active for every token – Memory: Must store all parameters plus KV cache

Strengths and Limitations

Strengths: – Proven, reliable architecture – Excellent general-purpose performance – Strong few-shot learning capabilities – Predictable scaling behavior

Limitations: – High computational cost (all parameters always active) – Quadratic attention complexity limits context length – Large memory footprint – Expensive to

train and run

Part 3: DeepSeek – The Eﬃciency Revolution

The DeepSeek Philosophy: Matching Performance with Fraction of Cost

DeepSeek asked a provocative question: “Can we achieve GPT-level performance while using only 5% of the computational resources per token?” Their answer came through two major innovations: Multi-Head Latent Attention (MLA) for memory eﬃciency and DeepSeekMoE for computational eﬃciency.

Innovation 1: Multi-Head Latent Attention (MLA)

The KV cache problem is one of the biggest bottlenecks in LLM inference. For each token generated, the model must store the keys and values from all previous tokens. With long contexts and many attention heads, this memory requirement becomes enormous.

The Compression Strategy

MLA solves this through low-rank compression. Instead of storing full keys and values, it stores compressed representations:

Step 1: Compress Keys and Values

Where: – h_t ∈ ^d: The input hidden state for token t – W^DKV ∈ ^(d_c × d): Down-projection matrix – c_t^KV ∈ ^d_c: Compressed latent vector (d_c ≪ d_h × n_h)

Typical Values: – d (model dimension): 5,120 – n_h (number of heads): 128 – d_h (dimension per head): 128 – d_c (compressed dimension): 512

This means instead of storing 128 × 128 = 16,384 values per token, we store only 512— a 97% reduction!

Step 2: Decompress Keys

Where: – W^UK ∈ ^(d_h×n_h × d_c): Up-projection matrix for keys – k_t^C: Decompressed keys for all heads (concatenated)

Step 3: Add Rotary Position Information

Where: – W^KR ∈ ^(d_h^R × d): Matrix for RoPE keys – RoPE(·): Rotary Position Embedding function – k_t^R: Position-dependent component (shared across heads) – [·; ·]: Concatenation

Why Separate Position Information?

RoPE (Rotary Position Embedding) is more eﬃcient than sinusoidal encoding. By keeping position information separate and shared across heads, we save even more memory.

Step 4: Decompress Values

Where: – W^UV ∈ ^(d_h×n_h × d_c): Up-projection matrix for values

Query Compression

Queries are also compressed, but this is for reducing activation memory during training, not inference:

Where: – c_t^Q ∈ ^(d_c’): Compressed query latent (d_c’ ≪ d_h × n_h) – W^DQ ∈

^(d_c’ × d): Query down-projection – W^UQ ∈ ^(d_h×n_h × d_c’): Query up- projection – W^QR ∈ ^(d_h×n_h × d_c’): Query RoPE matrix

Final Attention Computation

Where: – W^O ∈ ^(d × d_h×n_h): Output projection matrix – o_{t,i}: Attention output for head i at position t

Memory Savings:

For DeepSeek-V3 with 128 heads: – Standard MHA: 128 × 128 × 2 = 32,768 values per token – MLA: 512 + 128 = 640 values per token – Reduction: 98%

Innovation 2: DeepSeekMoE Architecture

The second major innovation is the Mixture-of-Experts (MoE) architecture, which activates only a small fraction of parameters for each token.

The Basic MoE Formula

Where: – u_t: Input to the FFN layer – N_s: Number of shared experts (always activated) – N_r: Number of routed experts (selectively activated) – FFN_i^{(s)}(·): i-th shared expert – FFN_i^{(r)}(·): i-th routed expert – g_{i,t}: Gating value for routed expert i at token t – h_t’: Output of the FFN layer

DeepSeek-V3 Conﬁguration: – N_s = 2 shared experts – N_r = 256 routed experts – K_r

= 8 routed experts activated per token – Expert size: Each expert has ~15M parameters

Total Computation: – Shared experts: 2 experts (always active) – Routed experts: 8 experts (selected from 256) – Total active: 10 experts per token – Percentage: 10/258 ≈ 3.9% of FFN parameters

The Routing Mechanism

Step 1: Compute Aﬃnity Scores

Where: – e_i ∈ ^d: Centroid vector for expert i (learned parameter) – s_{i,t}: Token- to-expert aﬃnity score (between 0 and 1) – Sigmoid: Ensures scores are in (0, 1) range

Intuition:

Each expert has a “specialty” represented by its centroid vector e_i. The aﬃnity score measures how well the current token matches that specialty. Think of it as asking “Is this expert relevant for processing this token?”

Step 2: Select Top-K Experts

Only the K_r experts with highest aﬃnity scores are selected.

Step 3: Normalize Gating Values

The gating values are normalized so they sum to 1, ensuring the output is a proper weighted combination.

Innovation 3: Auxiliary-Loss-Free Load Balancing

Traditional MoE models face a critical problem: some experts become overused while others are underutilized. This is typically solved with an auxiliary loss that penalizes imbalance, but this loss can hurt model performance.

DeepSeek’s solution is elegant: adjust the routing dynamically without modifying the training loss.

The Bias Adjustment Mechanism

Where: – b_i: Bias term for expert i (dynamically adjusted)

Key Insight:

The bias b_i is used only for routing decisions (TopK selection), not for computing the gating values. This means: – Routing is inﬂuenced by the bias (encouraging balance) – Gating values remain based on true aﬃnity (preserving performance)

Dynamic Bias Update

At the end of each training step:

Where: – γ: Bias update speed (hyperparameter, typically 0.01)

Load Criteria:

An expert is considered: – Overloaded if it receives more than (1 + ε) × (K_r / N_r) × batch_size tokens – Underloaded if it receives fewer than (1 – ε) × (K_r / N_r) × batch_size tokens – Balanced otherwise

Where ε is a tolerance parameter (typically 0.2)

How It Works:

If an expert is overloaded, we decrease its bias, making it less likely to be selected
If an expert is underloaded, we increase its bias, making it more likely to be selected
Over time, this naturally balances the load without forcing artiﬁcial constraints

Analogy:

Think of it like dynamic pricing: if a restaurant is too crowded, prices go up slightly (negative bias), discouraging some customers. If it’s empty, prices go down (positive bias), attracting more customers. The quality of the food (true aﬃnity) doesn’t change, but the selection behavior adjusts.

Complementary Sequence-Wise Balance Loss

While the auxiliary-loss-free strategy handles batch-level balance, a small auxiliary loss prevents extreme imbalance within individual sequences:

Where:

Breaking It Down:

f_i: Fraction of tokens in the sequence routed to expert i

P_i: Average routing probability for expert i

α: Balance loss coeﬃcient (typically 0.001-0.01)

T: Sequence length

[·]: Indicator function (1 if condition is true, 0 otherwise)

Why This Loss?

The product f_i × P_i is minimized when experts are balanced. If an expert has high routing probability (P_i) but low actual usage (f_i), or vice versa, the loss increases. This gently encourages balance within each sequence without the strong performance penalty of traditional auxiliary losses.

Multi-Token Prediction

DeepSeek-V3 also employs a multi-token prediction objective:

Where: – K: Number of future tokens to predict (typically 2-4) – λ_k: Weight for predicting k steps ahead (typically decreasing)

Why Predict Multiple Tokens?

Better Representations: Forces the model to learn representations that are useful for multiple future predictions
Speculative Decoding: The additional prediction heads can be used for speculative decoding during inference, generating multiple tokens in parallel
Improved Performance: Empirically shown to improve benchmark scores

DeepSeek-V3 Complete Speciﬁcations

Model Architecture: – Total parameters: 671B – Active parameters per token: 37B (5.5%) – Layers: 61 – Model dimension (d): 7,168 – Number of attention heads (n_h): 128

– FFN hidden dimension: 18,432 – Number of shared experts (N_s): 2 – Number of routed experts (N_r): 256 – Activated routed experts (K_r): 8 – Context length: 128K tokens – KV compression dimension (d_c): 1,536 – Query compression dimension (d_c’): 1,536

Training Details: – Training tokens: 14.8 trillion – Training cost: $5.576 million (2.788M H800 GPU hours) – Training time: ~2 months – Batch size: 9,216 sequences – Learning rate: Peak 4.2 × 10⁻⁴, cosine decay – Precision: FP8 mixed precision

Performance: – MMLU: 88.5 – MMLU-Pro: 75.9 – GPQA: 59.1 – MATH-500: 90.2 –

Codeforces: 78.3 percentile

Part 4: Google Gemini – The Multimodal Frontier

The Gemini Philosophy: Uniﬁed Multimodal Understanding

While GPT and DeepSeek primarily process text, Gemini was designed from the ground up to understand multiple modalities—text, images, audio, and video—in a uniﬁed architecture. This isn’t just about processing diﬀerent types of data separately; it’s about understanding relationships across modalities.

Sparse Mixture-of-Experts Foundation

Like DeepSeek, Gemini 2.5 uses a sparse MoE architecture, but optimized for multimodal processing:

Where:

Components: – N: Total number of experts (not publicly disclosed, estimated 100-200)

– K: Number of activated experts per token (estimated 10-20) – Router(x): Learned routing function – Expert_i(·): i-th expert FFN

Routing Function:

Where: – W_r ∈ ^(N × d): Routing weight matrix – b_r ∈ ^N: Routing bias vector

Multimodal Token Processing

The revolutionary aspect of Gemini is how it processes diﬀerent modalities in a uniﬁed token space:

Where [·; ·] denotes concatenation in the sequence dimension. Text Tokenization

Standard subword tokenization:

Where: – BPE: Byte-Pair Encoding algorithm – E_text ∈ ^(V × d): Text embedding matrix – V: Vocabulary size (~256K tokens)

Image Tokenization

Images are divided into patches and embedded:

Where: – patch_size: Typically 14×14 or 16×16 pixels – Linear: Learned linear projection to model dimension – PE_2D: 2D positional encoding (preserves spatial relationships)

For a 224×224 image with 14×14 patches: – Number of patches: (224/14)² = 256 patches – Each patch becomes one token – Total: 256 image tokens

Audio Tokenization

Audio is processed in overlapping windows:

Where: – STFT: Short-Time Fourier Transform (converts audio to frequency domain) – CNN: Convolutional neural network for feature extraction – PE_temporal: Temporal positional encoding

Typical Conﬁguration: – Window size: 25ms – Hop length: 10ms – For 1 second of audio: ~100 audio tokens

Video Tokenization

Video combines spatial and temporal processing:

Where: – Sample: Samples frames at speciﬁed rate (typically 1-4 fps) –

PE_spatiotemporal: 3D positional encoding (spatial + temporal)

For a 1-minute video at 1 fps: – 60 frames – 256 tokens per frame – Total: 15,360 video tokens

The key to multimodal understanding is allowing diﬀerent modalities to attend to each other:

Example:

When processing the query “What color is the car in the image?”: – Text tokens attend to image tokens to locate the car – Image tokens attend to text tokens to understand what’s being asked – The model generates a response by integrating both modalities

Long Context Processing

Gemini 2.5 Pro supports context lengths exceeding 1 million tokens through several optimizations:

Eﬃcient Attention Mechanisms Sparse Attention Patterns:

Instead of full O(n²) attention, use structured sparsity:

Where: – K_local, V_local: Keys and values from nearby positions (e.g., within 512 tokens) – K_global, V_global: Keys and values from selected global positions (e.g., every 64th token)

This reduces complexity from O(n²) to O(n × √n) or better.

Memory-Eﬃcient KV Cache Quantization:

Storing keys and values in 8-bit precision instead of 16-bit reduces memory by 50% with minimal quality loss.

Selective Caching:

Only cache the most important positions based on recent attention patterns.

Extended Reasoning: The Thinking Process

Gemini 2.5 incorporates “thinking” capabilities through extended reasoning:

Chain-of-Thought Generation:

The model generates intermediate reasoning steps before the ﬁnal answer.

Veriﬁcation:

The model evaluates its own reasoning and regenerates if conﬁdence is low.

Mathematical Formulation:

Where: – r_i: i-th reasoning step – M: Number of reasoning steps – y: Final answer – x: Input

Distillation for Smaller Models

Smaller Gemini models (Flash, Flash-Lite) learn from larger models through distillation:

Where:

K-Sparse Approximation:

To reduce storage, only the top-K logits from the teacher are stored:

Typical K: 256-1024 (out of ~256K vocabulary)

Training Objectives

Gemini uses multiple training objectives:

Next-Token Prediction (Primary):

Multimodal Alignment:

Ensures text and image representations are aligned.

Instruction Following:

Preference Learning:

Where: – σ: Sigmoid function – r(x, y): Reward model score

Gemini 2.5 Pro Speciﬁcations

Model Architecture: – Architecture: Sparse MoE Transformer – Total parameters: Not disclosed (estimated 500B-1T) – Active parameters: Estimated 100-200B per token – Context length: 1M+ tokens (2M for some versions) – Supported modalities: Text,

Image, Audio, Video – Output modalities: Text, Audio (experimental) – Output length: Up to 64K tokens

Training Details: – Training data cutoﬀ: January 2025 – Training infrastructure: TPUv5p pods – Precision: Mixed precision (BF16/FP8) – Training scale: Multiple datacenters, thousands of accelerators

Performance Highlights: – MMLU: ~90 (estimated) – Codeforces: High percentile – Multimodal benchmarks: State-of-the-art – Video understanding: Up to 3 hours of video – Long context: Excellent performance up to 1M tokens

Part 5: Comparative Analysis

Mathematical Complexity Comparison

Operation	GPT (Dense)	DeepSeek (MoE + MLA)	Gemini (MoE)
Attention per Layer	O(n² × d)	O(n² × d_c), d_c ≪ d	O(n² × d) with sparse patterns
FFN per Layer	O(n × d²)	O(n × d² × (N_s + K_r)/N_total)	O(n × d² × K/N)
Parameters per Token	100%	5.5% (37B/671B)	~10-20%
KV Cache per Token	O(n_h × d_h) = O(d)	O(d_c + d_h^R) ≈ 0.02 × O(d)	O(d) with quantization
Context Length	128K	128K	1M+

Training Eﬃciency Comparison

Metric	GPT-3	DeepSeek-V3	Gemini 2.5 Pro
Total Parameters	175B	671B	~500B-1T (est.)
Active Parameters	175B	37B	~100-200B (est.)
Training Tokens	~300B	14.8T	Not disclosed
Training Cost	Not disclosed	$5.576M	Not disclosed
FLOPs per Token (Training)	High	Low (due to MoE)	Medium
FLOPs per Token (Inference)	~2 × 175B	~2 × 37B	~2 × 100-200B (est.)

Architectural Philosophy Comparison

OpenAI GPT: – Philosophy: Simplicity and scale – Key Innovation: Demonstrating that scaled Transformers develop emergent capabilities – Trade-oﬀ: Maximum capability at maximum cost – Best For: Applications where cost is secondary to reliability and performance

DeepSeek: – Philosophy: Eﬃciency without compromise – Key Innovations: – MLA for 98% KV cache reduction – Auxiliary-loss-free load balancing – Multi-token prediction – Trade-oﬀ: Architectural complexity for computational eﬃciency – Best For: Cost- sensitive deployments, organizations with limited resources

Gemini: – Philosophy: Uniﬁed multimodal intelligence – Key Innovations: – Native multimodal processing – Extended reasoning capabilities – Extreme long context (1M+ tokens) – Trade-oﬀ: Complexity in handling multiple modalities – Best For: Applications requiring vision, audio, or video understanding

Performance Characteristics

Latency: – GPT: Moderate (all parameters active) – DeepSeek: Low (fewer active parameters, eﬃcient attention) – Gemini: Variable (depends on modality and context length)

Memory: – GPT: High (full KV cache, all parameters) – DeepSeek: Low (compressed KV cache, sparse activation) – Gemini: High for long contexts (but optimized)

Throughput: – GPT: Moderate (limited by memory bandwidth) – DeepSeek: High (sparse activation enables larger batch sizes) – Gemini: Moderate to high (depends on modality mix)

Use Case Suitability

GPT is Ideal For: – General-purpose text generation – Applications requiring proven reliability – Organizations with substantial computational resources – Use cases where consistency is critical

DeepSeek is Ideal For: – Cost-sensitive deployments – High-throughput applications – Organizations with limited GPU resources – Applications requiring long context at reasonable cost

Gemini is Ideal For: – Multimodal applications (image + text, video + text) – Applications requiring very long context (>100K tokens) – Complex reasoning tasks – Agentic systems combining multiple capabilities

Part 6: The Future of LLM Architectures

Emerging Trends

Convergence of Approaches

We’re seeing convergence toward hybrid architectures: – GPT models exploring sparse activation – DeepSeek expanding into multimodal – All models adopting MoE for eﬃciency

Eﬃciency as a First-Class Concern

The success of DeepSeek demonstrates that eﬃciency innovations can match or exceed brute-force scaling:

Not just:

Multimodal as Default

Future models will likely be multimodal by default, as the world’s information exists in multiple modalities.

Extended Reasoning

The “thinking” capabilities in Gemini 2.5 and similar models represent a shift toward more deliberate, veriﬁable reasoning:

Open Research Questions

Optimal Sparsity Patterns

What is the optimal balance between: – Number of experts – Expert size – Activation rate

– Routing mechanism

Scaling Laws for MoE

Traditional scaling laws:

How do these change for MoE models where active parameters ≠ total parameters?

Cross-Modal Understanding

How can we better measure and improve cross-modal reasoning? Current benchmarks focus on single-modality performance.

Eﬃcient Long Context

Can we achieve O(n) or O(n log n) attention without sacriﬁcing quality?

Candidates: – Linear attention mechanisms – State space models – Hierarchical attention

Mathematical Frontiers

Better Attention Mechanisms

Current research explores alternatives to softmax attention:

Where φ is a feature map that allows O(n) complexity.

Adaptive Computation

Allow models to use more computation for harder problems:

Continuous Learning

Current models are static after training. Future models might continuously update:

While maintaining stability and preventing catastrophic forgetting.

Conclusion: The Elegant Mathematics of Machine Intelligence

We’ve journeyed through the mathematical foundations of three remarkable AI architectures. What emerges is a picture of elegant simplicity giving rise to extraordinary complexity.

The Core Insights

Attention is Universal

The attention mechanism—a simple weighted sum based on compatibility scores— turns out to be suﬃcient for modeling complex relationships in language, vision, and beyond:

This formula, simple enough to ﬁt on a single line, powers systems that can write poetry, prove theorems, and understand images.

Sparsity is Powerful

DeepSeek demonstrated that we don’t need to activate all parameters for every token. Selective activation through MoE:

This achieves comparable performance to dense models while using a fraction of the computation.

Compression Preserves Information

MLA showed that we can compress keys and values by 98%:

And still maintain performance, because the essential information is preserved in a low-dimensional subspace.

Multimodality is Natural

Gemini demonstrated that diﬀerent modalities can be processed in a uniﬁed token space:

The same attention mechanism that relates words to words can relate images to text, audio to video, and any modality to any other.

The Human Element

Despite the sophisticated mathematics, these systems ultimately serve human needs. They help us: – Communicate across language barriers – Access and synthesize information – Solve complex problems – Create new forms of art and expression – Understand the world in new ways

A Note of Wonder

There’s something profound about the fact that intelligence—whether natural or artiﬁcial—can be approximated by mathematical functions. The same principles that describe the motion of planets and the behavior of particles also describe the processing of information and the generation of meaning.

We’ve seen that: – A simple attention formula captures the essence of focus and relevance – Sparse activation mirrors how human experts specialize – Compression reveals the low-dimensional structure of information – Uniﬁed processing across modalities reﬂects how humans integrate sensory information

Looking Forward

The mathematics we’ve explored represents our current understanding, but the ﬁeld continues to evolve rapidly. Future architectures will likely: – Be more eﬃcient (doing more with less) – Be more capable (understanding deeper relationships) – Be more accessible (available to more people and organizations) – Be more aligned (better serving human values and needs)

The journey from simple formulas to intelligent behavior is a testament to both the power of mathematics and the ingenuity of human creativity. As we continue to reﬁne these architectures, we move closer to systems that can truly augment human intelligence and help us tackle the great challenges of our time.

Acknowledgments

This document was prepared to illuminate the mathematical foundations of modern AI for a broad audience. We are grateful to the research teams at OpenAI, DeepSeek, and Google DeepMind for their groundbreaking work and for publishing detailed technical reports that make this kind of analysis possible.

Special thanks to the authors of “Attention Is All You Need” (Vaswani et al., 2017), whose elegant architecture laid the foundation for the current generation of AI systems.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008. https://arxiv.org/abs/1706.03762
DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Ding, W., Li, M., Xiao,

Y., Wang, P., Huang, K., Sui, Y., Ruan, C., Zheng, Z., Yu, K., Cheng, X., Guo, X., Gu, S.,

… & Bi, J. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066.

Gemini Team, Google. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805. https://arxiv.org/abs/2312.11805
Gemini Team, Google. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. Technical Report. https://storage.googleapis.com/deepmind- media/gemini/gemini_v2_5_report.pdf
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. arXiv preprint arXiv:1701.06538.
Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity. Journal of Machine Learning Research, 23(120), 1-39.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
Clark, A., de Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoﬀmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S., et al. (2022). Uniﬁed scaling laws for routed language models. arXiv preprint arXiv:2202.01169.

Document prepared by:

Hene Aku Kwapong, PhD, MBA

MIT Practice School Alumni

With the assistance of: Jeremie Kwapong

New England Innovation Academy

October 9, 2O25

Hene Aku Kwapong

An executive, board director, and entrepreneur with 25+yr experience leading transformative initiatives across capital markets, banking, & technology, making him valuable asset to companies navigating complex challenges

View All Post

The Mathematics Behind Modern AI: A Deep Dive into GPT, DeepSeek, and Gemini

Hene Aku Kwapong

jeremiekwapong

jeremiekwapong

Recent Articles

Document prepared with the help of:

Introduction: Decoding the Language of Intelligent Machines

Part 1: The Foundational Architecture – The Transformer

The Attention Mechanism: Learning What Matters

Scaled Dot-Product Attention

The Computation:

Example in Action:

Multi-Head Attention: Multiple Perspectives

Why Multiple Heads?

The Mathematics:

Positional Encoding: Understanding Order

Understanding the Formula:

Why Sine and Cosine?

Visualization:

Position-Wise Feed-Forward Networks

Breaking It Down:

Why “Position-Wise”?

Layer Normalization and Residual Connections

Residual Connections:

Layer Normalization:

The Complete Transformer Block

Part 2: OpenAI GPT – Scaling the Transformer

The GPT Philosophy: Simplicity at Scale

Architecture Details

Decoder-Only Architecture:

Causal Masking:

The Mathematics of Scale

Attention Complexity:

Training Objective: Next-Token Prediction

What This Means:

Why It Works:

Computational Requirements

Strengths and Limitations

Part 3: DeepSeek – The Eﬃciency Revolution

The DeepSeek Philosophy: Matching Performance with Fraction of Cost

Innovation 1: Multi-Head Latent Attention (MLA)

The Compression Strategy

Step 1: Compress Keys and Values

Step 2: Decompress Keys

Step 3: Add Rotary Position Information

Why Separate Position Information?

Step 4: Decompress Values

Query Compression

Final Attention Computation

Memory Savings:

Innovation 2: DeepSeekMoE Architecture

The Basic MoE Formula

The Routing Mechanism

Intuition:

Step 2: Select Top-K Experts

Step 3: Normalize Gating Values

Innovation 3: Auxiliary-Loss-Free Load Balancing

The Bias Adjustment Mechanism

Key Insight:

Dynamic Bias Update

Load Criteria:

How It Works:

Analogy:

Complementary Sequence-Wise Balance Loss

Where:

Why This Loss?

Multi-Token Prediction

Why Predict Multiple Tokens?

DeepSeek-V3 Complete Speciﬁcations

Part 4: Google Gemini – The Multimodal Frontier

The Gemini Philosophy: Uniﬁed Multimodal Understanding

Sparse Mixture-of-Experts Foundation

Where:

Routing Function:

Multimodal Token Processing

Where [·; ·] denotes concatenation in the sequence dimension. Text Tokenization

Image Tokenization

Audio Tokenization

Video Tokenization

Cross-Modal Attention