Training and Inference training and inference are the two core phases of Large Language Models (LLMs): training builds the model by teaching it language from vast data (like building a library), while inference is using that trained model in real-time to understand prompts and generate responses (like a librarian answering a question). Training creates the model's foundational knowledge and patterns, whereas inference applies that knowledge to new, unseen inputs, predicting the most likely next word to form coherent answers.Training (Building the Brain)What it is: The process of feeding massive datasetsGoal: To adjust billions of internal parameters (weights) so the model becomes good at predicting the next word in a sequence.Process: Computationally intensive, time-consuming, and requires huge resources (like months of GPU time).Analogy: Reading every book in a library to learn all the information and how language works.Inference (Putting it to Work)What it is: The stage where the trained model is deployed to serve users, taking a prompt and generating a response.Goal: To generate human-like text by predicting likely sequences of tokens (words/sub-words) based on the prompt and learned patterns.Process: Faster, uses less power than training (though still significant), and happens in milliseconds.Analogy: Asking the librarian (the model) a question and getting an instant, crafted answer.
Matrices First thing to do is brush up on matrices.Matrice MathDot Product 20 min inMatrices vrs Tensors
DL Basics An epoch is one full cycle. Brush up on Deep Learning then build an LLMs from scratch for a deep dive into the inner workings.Deep LearningLLM from scratchDeep Dive into LLMs like ChatGPTLet's reproduce GPT-2 (124M)LLM VisualizationTiktokenizer
Tranformer Model architecture.Attention is all you need
ReLU The formula for ReLU is straightforward:if the input (x) is greater than zero, the output is (x);if (x) is less than or equal to zero, the output is 0.Positive Input: relu(5) = 5Negative Input: relu(-2) = 0Zero Input: relu(0) = 0Mathematical Definition: f(x)=max(0,x)
Subsripts Subscripts are just a way to identify that we have multiple variables defined by a certain value like r. They behave the same way as other variables.SubscriptsMathematical Definition: ra=r1+r2
Learnable Parameter Matrix This symbol represents the learnable parameter matrix for the query (Q) transformation specific to the i-th attention head. WiQ W stands for Weight. It refers to a matrix of parameters that the model learns during the training process via backpropagation. Q stands for Query. It indicates that this specific weight matrix is used for the Query transformation as opposed to WiK for Key or WiV for Value. A query represents the information a particular token is looking for from other tokens in the sequence. i is an index. It specifies which of the multiple attention heads the matrix belongs to. In multi-head attention, each head has its own unique set of WQ , WK , and WV matrices.
Membership Sign ∈ The "is an element of" or "membership" sign. The expression x∈R means that x is a member (or element) of the set R.