Neural Nets

Understanding the Internal Mechanics of LLMs

/ 5 min read

Download As PDF

Large Language Models (LLMs) have transformed natural language processing and artificial intelligence over the past few years. While their impressive capabilities are readily apparent, the underlying mechanics that drive these systems remain mysterious to many. This article delves into the technical intricacies that power today’s cutting-edge LLMs.

The Transformer Architecture

Modern LLMs are built upon the Transformer architecture, first introduced in the seminal paper “Attention Is All You Need” (Vaswani et al., 2017). Unlike previous recurrent neural network approaches, Transformers process entire sequences simultaneously through self-attention mechanisms.

At its core, the Transformer relies on the multi-head attention mechanism:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where QQ (queries), KK (keys), and VV (values) are linear projections of the input embeddings, and dkd_k is the dimension of the keys, serving as a scaling factor to stabilize gradients during training.

Multi-head attention extends this further by computing attention multiple times in parallel:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

Where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The Scaling Laws: Why Size Matters

One of the most significant discoveries in LLM research is the emergence of scaling laws. These empirical relationships show how model performance improves predictably with increases in model size, dataset size, and compute resources.

Kaplan et al. (2020) identified that loss LL scales with model parameters NN approximately as:

L(N)(NcN)αNL(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}

Where NcN_c is a constant and αN0.076\alpha_N \approx 0.076, indicating that each doubling of model size reduces loss by approximately 5%.

Similarly, loss scales with dataset size DD as:

L(D)(DcD)αDL(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}

Where αD0.095\alpha_D \approx 0.095, suggesting that dataset size is slightly more important than model size.

Training Dynamics: Navigating Loss Landscapes

Training LLMs involves navigating an extraordinarily high-dimensional loss landscape. The objective function in language modeling is typically next-token prediction, formulated as:

L(θ)=1D(x,y)Dlogpθ(yx)\mathcal{L}(\theta) = -\frac{1}{|\mathcal{D}|} \sum_{(x,y) \in \mathcal{D}} \log p_\theta(y|x)

Where θ\theta represents model parameters, D\mathcal{D} is the training dataset, and pθ(yx)p_\theta(y|x) is the probability assigned by the model to the correct next token yy given context xx.

Most LLMs use variants of adaptive optimization algorithms, with AdamW being particularly popular:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t} θt=θt1ηm^tv^t+ϵηλθt1\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda \theta_{t-1}

Where gtg_t is the gradient, mtm_t and vtv_t are the first and second moment estimates, β1\beta_1 and β2\beta_2 are decay rates, η\eta is the learning rate, λ\lambda is the weight decay coefficient, and ϵ\epsilon is a small constant for numerical stability.

Tokenization: The Critical First Step

Before processing text, LLMs must convert it into numerical representations through tokenization. Modern LLMs typically use subword tokenization algorithms like Byte-Pair Encoding (BPE) or SentencePiece.

BPE starts with individual characters and iteratively merges the most frequent adjacent pairs to form a vocabulary of specified size. This process can be formalized as:

  1. Initialize vocabulary VV with all unique characters in corpus.
  2. Compute frequencies of all adjacent token pairs (a,b)(a, b) where a,bVa, b \in V.
  3. Find the most frequent pair (a,b)(a, b) and add merged token abab to VV.
  4. Replace all occurrences of (a,b)(a, b) with abab in the corpus.
  5. Repeat steps 2-4 until vocabulary reaches desired size or merge frequency falls below threshold.

Positional Encodings: Capturing Sequential Information

Unlike RNNs, Transformers have no inherent understanding of sequential order. To address this, position encodings are added to input embeddings. The original Transformer used sinusoidal encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Where pospos is the position index, ii is the dimension index, and dmodeld_{model} is the model dimension.

Later models adopted learned positional encodings, while more recent architectures like RoPE (Rotary Position Embedding) incorporate position information directly into attention calculations:

RoPE(q,k,m)=(cosmθsinmθsinmθcosmθ)(q0q1)(cosnθsinnθsinnθcosnθ)(k0k1)TRoPE(q, k, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} q_0 \\ q_1 \end{pmatrix} \cdot \begin{pmatrix} \cos n\theta & -\sin n\theta \\ \sin n\theta & \cos n\theta \end{pmatrix} \begin{pmatrix} k_0 \\ k_1 \end{pmatrix}^T

Computational Efficiency Innovations

Training and running LLMs requires enormous computational resources. Several innovations have made this more manageable:

Mixed Precision Training

Computing operations in lower precision (FP16 or BF16 instead of FP32) reduces memory requirements and computation time. This is formalized as:

Lossscaled=Loss×scale_factor\text{Loss}_{scaled} = \text{Loss} \times \text{scale\_factor} GradientsFP16=Compute_Gradients(Lossscaled)\text{Gradients}_{FP16} = \text{Compute\_Gradients}(\text{Loss}_{scaled}) GradientsFP32=GradientsFP16/scale_factor\text{Gradients}_{FP32} = \text{Gradients}_{FP16} / \text{scale\_factor}

Attention Optimizations

FlashAttention algorithms reduce memory bandwidth bottlenecks by recomputing attention on-the-fly rather than storing the full attention matrix:

Attention(Q,K,V)=softmax(QKTdk)V=P^V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V = \hat{P}V

Traditional implementation: O(N2)O(N^2) memory for the attention matrix P^\hat{P}.

FlashAttention: O(N)O(N) memory by tiling the computation into smaller blocks that fit in faster memory (SRAM).

Inference Optimization Techniques

Large LLMs present significant inference challenges. Key techniques to address these include:

KV Caching

During autoregressive generation, previous key-value pairs from attention layers are cached to avoid recomputation:

KV-Cachet={KV-Cachet1,(Kt,Vt)}\text{KV-Cache}_t = \{\text{KV-Cache}_{t-1}, (K_t, V_t)\}

Where KtK_t and VtV_t are the key-value pairs generated at time step tt.

Quantization

Reducing the precision of model weights and activations after training:

Wq=round(Wmin(W)max(W)min(W)×(2b1))×max(W)min(W)2b1+min(W)W_q = \text{round}\left(\frac{W - \min(W)}{\max(W) - \min(W)} \times (2^b - 1)\right) \times \frac{\max(W) - \min(W)}{2^b - 1} + \min(W)

Where WW is the original weight tensor, WqW_q is the quantized tensor, and bb is the bit width (e.g., 4 or 8 bits).

Understanding LLM Capabilities Through Scaling

Emergent abilities—capabilities not present in smaller models but appearing in larger ones—remain one of the most fascinating aspects of LLMs. These can be visualized through step function-like improvements:

P(correct)=σ(αlog(NN0))P(correct) = \sigma\left(\alpha \log\left(\frac{N}{N_0}\right)\right)

Where P(correct)P(correct) is the probability of correctly performing a task, NN is the number of parameters, N0N_0 is a threshold parameter count, and σ\sigma is the logistic function.

The Challenge of Context Windows

LLMs’ ability to process long contexts has improved dramatically. Extending context windows introduces quadratic computational complexity in standard attention:

Complexityattention=O(L2d)\text{Complexity}_{\text{attention}} = O(L^2 \cdot d)

Where LL is sequence length and dd is hidden dimension.

Techniques like sparse attention patterns reduce this to:

Complexitysparse=O(Llog(L)d)\text{Complexity}_{\text{sparse}} = O(L \cdot \log(L) \cdot d)

Conclusion

The internal mechanics of LLMs represent one of the most sophisticated achievements in AI engineering. From the elegant mathematics of attention to the empirical scaling laws that guide development, these systems combine theoretical insights with practical engineering solutions. As research continues, we can expect further refinements to these architectures that will expand their capabilities while addressing current limitations in reasoning, factuality, and computational efficiency.

Understanding these mechanics not only satisfies scientific curiosity but provides the foundation for developing more capable, efficient, and reliable AI systems in the future.