Training and Inference training and inference are the two core phases of Large Language Models (LLMs): training builds the model by teaching it language from vast data (like building a library), while inference is using that trained model in real-time to understand prompts and generate responses (like a librarian answering a question). Training creates the model's foundational knowledge and patterns, whereas inference applies that knowledge to new, unseen inputs, predicting the most likely next word to form coherent answers.Training (Building the Brain)What it is: The process of feeding massive datasetsGoal: To adjust billions of internal parameters (weights) so the model becomes good at predicting the next word in a sequence.Process: Computationally intensive, time-consuming, and requires huge resources (like months of GPU time).Analogy: Reading every book in a library to learn all the information and how language works.Inference (Putting it to Work)What it is: The stage where the trained model is deployed to serve users, taking a prompt and generating a response.Goal: To generate human-like text by predicting likely sequences of tokens (words/sub-words) based on the prompt and learned patterns.Process: Faster, uses less power than training (though still significant), and happens in milliseconds.Analogy: Asking the librarian (the model) a question and getting an instant, crafted answer.
An Epoch in Artificial Intelligence (specifically in machine learning and deep learning) is defined as one complete pass of the entire training dataset through the neural network. It is a complete cycle where the model has been exposed to every example in the training dataset exactly once. An epoch almost always involves multiple iterations or batches, each consisting of both a forward pass and backpropagation.Key Components of an Epoch:Forward Pass The input data is passed through the neural network to generate predictions.Loss Calculation predictions are compared to actual targets to calculate the error. A loss function is a part of a cost function which is a type of an objective function.Backpropagation & Weight Update The model propagates this error back through the network to adjust parameters (weights and biases) to improve accuracy.Batching Because datasets are usually too large to process at once, data is divided into smaller "mini-batches"Iteration One iteration is the completion of a forward and backward pass for a single batch.Formula : 1 Epoch= Total Training Examples Batch Size Iterations Example If you have 10,000 images (dataset) and a batch size of 100:1 Iteration The model takes 100 images, performs a forward pass, calculates loss, and backpropagates to update weights.1 Epoch The model repeats this 100 times 10,000/100, meaning it has done 100 iterations (100 forward/backward passes) to cover the entire dataset once. After one epoch, the entire dataset has been used to train the model. Multiple epochs are typically required for the model to learn efficiently and improve performance.
slope is the rate of change for a line (rise over run): m=(y2-y1)(x2-x1)
Gradient A gradient represents the rate and direction of the steepest change in a multi-dimensional function, essentially a vector of partial derivatives pointing towards maximum increase, like slope on a hill, used to find optimal paths in machine learning (gradient descent): m=ΔyΔx or m=dydx If y is described as y=f(x) Lagrange notation m=f′(x1)
Algebra Brush up on simplifying. Algebra for derivatives
Questions: Why would an input to hidden and hidden to output use multiplication while a hidden to hidden use addition?Do hidden to hidden layers always use addition in neural networks?In backpropagation when using chain rule does the derivative that is the output of the inner function becomes the input for the outer function?
Derivative The limit definition of the derivative defines the derivative, or instantaneous rate of change, of a function f(x) as the limit of the difference quotient as the change in x (often denoted by h) approaches zero: f′(x)=limh→0f(x+h)-f(x)h A good tutorial on The Limit Definition of a Derivative
Chain Rule the chain rule states that knowing the instantaneous rate of change of z relative to y and that of y relative to x allows one to calculate the instantaneous rate of change of z relative to x as the product of the two rates of change.As put by George F. Simmons: "If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man." A function is composite if you can write it as f(g(x)). In other words, it is a function within a function, or a function of a function. ddx [f(g(x))]=f′(g(x))g′(x) In back propagation when using chain rule the derivative that is the output of the inner function becomes the input for the outer function.
Cost as inefficiency. In business, a high "cost" means wasted resources or poor performance. Similarly, in a Machine Learning(ML) Artificial Neural Network (ANN), a high "cost" indicates an inefficient model that needs correction. A neuron processes inputs, producing an output (prediction). The cost function compares this prediction to the actual correct answer, calculating the error.In neural networks, the "curving lines" (or more accurately, multi-dimensional surfaces) represent the cost function (or loss landscape), which maps the network's weights to the total error of its predictions. output - desired output = weightGradient descent works by acting as an optimization algorithm that traverses this curve, iteratively moving down the steepest slope to find the lowest point (local or global minimum), where the model's errors are minimized.Here is how gradient descent works in relation to these cost curves:1. Representing the Cost Curve The Curve (Loss Function): The curve represents the relationship between the network's parameters (weights and biases) and the cost (error). The goal is to reach the bottom of this "valley".Initialization: Gradient descent starts at a random point on this curve, which corresponds to the initial, untrained weights of the neural network. Steepness: When the curve is steep, the gradient is large, indicating a high error rate and a need for significant changes to the weights. When the curve is gentle, the gradient is small, indicating that the model is approaching a minimum.2. The Mechanics of Gradient Descent Gradient descent reduces the cost by iteratively updating the weights in the direction that reduces the error:Calculating the Gradient: At the current point on the curve, the algorithm calculates the gradient (the slope) of the cost function.Moving Downhill (Opposite Direction): Because the gradient points uphill (steepest increase), the algorithm moves in the opposite direction (negative gradient) to move downhill.Updating Weights: The weights are updated by taking a small step in that downhill direction: new weight=old weight-learning rate*gradient The Learning Rate α : This hyperparameter defines the size of the steps taken. A high learning rate takes large steps but risks overshooting the minimum, while a low learning rate takes small steps, ensuring stability but causing slower convergence.
Forward Pass and Back Propagation A simple tutorial on forward pass and back propagation Simple Neural Network in 3 Minutes
The pre-activation function in a neural network is the weighted sum of all inputs plus a bias term before the result is passed to the activation function.The mathematical expression read as Z equals the quantity W one times X one, plus the quantity W two times X two, continuing up to plus the quantity W N times X N, plus B is: z=(w1*x1)+(w2*x2)+⋯+(wn*xn)+b This can also be written using summation notation read as Z equals the weighted sum of the inputs plus the bias term: z= ∑ i = 1 4 (wi*xi)+b
Activation functions apply a non-linear transformation to the output of a neuron, allowing the network to learn complex patterns and relationships.
Sigmoid This function takes any real value as input and output values in the range of 0 to 1.The larger the input value the closer the output value will be to 1.0.The lower the input value the closer the output will be to 0.0.Mathematical Definition: f(x)=11+e-x
TanH Tanh function is very similar to the sigmoid activation function but with the output range of -1 to 1.The larger the input value the closer the output value will be to 1.0.The lower the input value the closer the output will be to -1.0.Mathematical Definition: f(x)=e-x-e-xe-x+e-x
ReLU The formula for ReLU is straightforward:if the input (x) is greater than zero, the output is (x);if (x) is less than or equal to zero, the output is 0.Positive Input: relu(5) = 5Negative Input: relu(-2) = 0Zero Input: relu(0) = 0Mathematical Definition: f(x)=max(0,x)
Softmax The softmax function is a mathematical function used in machine learning, especially in neural networks for multi-class classification, that converts a vector of raw scores (logits) into a probability distribution, ensuring all outputs are between 0 and 1 and sum to 1, representing the likelihood of each class. It works by exponentiating each input value, then dividing each result by the sum of all exponentiated values, effectively highlighting larger inputs as higher probabilities.
LinksDeep LearningLLM from scratchDeep Dive into LLMs like ChatGPTLet's reproduce GPT-2 (124M)LLM VisualizationTiktokenizerAttention is all you need
Learnable Parameter Matrix This symbol represents the learnable parameter matrix for the query (Q) transformation specific to the i-th attention head. WiQ W stands for Weight. It refers to a matrix of parameters that the model learns during the training process via backpropagation. Q stands for Query. It indicates that this specific weight matrix is used for the Query transformation as opposed to WiK for Key or WiV for Value. A query represents the information a particular token is looking for from other tokens in the sequence. i is an index. It specifies which of the multiple attention heads the matrix belongs to. In multi-head attention, each head has its own unique set of WQ , WK , and WV matrices.
Matrices First thing to do is brush up on matrices.Matrice MathDot Product 20 min inMatrices vrs Tensors
Dot Product In AI, the dot product, often seen as matrix multiplication (A ⋅ B or AB), calculates the alignment or similarity between vectors (data points) by multiplying corresponding elements and summing the results, forming a single number or new vector; it's fundamental for neural networks (weighted sums, activation), Transformers (self-attention), and recommendation systems (cosine similarity), enabling fast, parallel computation essential for deep learning's efficiency, as shown by built-in functions in libraries like NumPy.
Subsripts Subscripts are just a way to identify that we have multiple variables defined by a certain value like r. They behave the same way as other variables.SubscriptsMathematical Definition: ra=r1+r2
Membership Sign ∈ The "is an element of" or "membership" sign. The expression x∈R means that x is a member (or element) of the set R.