Skip to main content
CityZenUSA
LOve Your Gov

Main navigation

  • Home
  • Bills
  • Deficit
  • Calculators
User account menu
  • Log in
Deep Learning

1.0

Training and Inference training and inference are the two core phases of Large Language Models (LLMs): training builds the model by teaching it language from vast data (like building a library), while inference is using that trained model in real-time to understand prompts and generate responses (like a librarian answering a question). Training creates the model's foundational knowledge and patterns, whereas inference applies that knowledge to new, unseen inputs, predicting the most likely next word to form coherent answers.

Training (Building the Brain)

What it is: The process of feeding massive datasets
Goal: To adjust billions of internal parameters (weights) so the model becomes good at predicting the next word in a sequence.
Process: Computationally intensive, time-consuming, and requires huge resources (like months of GPU time).
Analogy: Reading every book in a library to learn all the information and how language works.

Inference (Putting it to Work)

What it is: The stage where the trained model is deployed to serve users, taking a prompt and generating a response.
Goal: To generate human-like text by predicting likely sequences of tokens (words/sub-words) based on the prompt and learned patterns.
Process: Faster, uses less power than training (though still significant), and happens in milliseconds.
Analogy: Asking the librarian (the model) a question and getting an instant, crafted answer.

1.1

Matrices First thing to do is brush up on matrices.

Matrice Math
Dot Product 20 min in
Matrices vrs Tensors

1.2

DL Basics An epoch is one full cycle. Brush up on Deep Learning then build an LLMs from scratch for a deep dive into the inner workings.

Deep Learning
LLM from scratch
Deep Dive into LLMs like ChatGPT
Let's reproduce GPT-2 (124M)
LLM Visualization
Tiktokenizer

1.3

Tranformer Model architecture.

Attention is all you need

1.4

ReLU The formula for ReLU is straightforward:
if the input (x) is greater than zero, the output is (x);
if (x) is less than or equal to zero, the output is 0.

Positive Input: relu(5) = 5
Negative Input: relu(-2) = 0
Zero Input: relu(0) = 0

Mathematical Definition: f(x)=max(0,x)

1.5

Subsripts Subscripts are just a way to identify that we have multiple variables defined by a certain value like r. They behave the same way as other variables.

Subscripts

Mathematical Definition: ra=r1+r2

1.6

Learnable Parameter Matrix This symbol represents the learnable parameter matrix for the query (Q) transformation specific to the i-th attention head.

WiQ

W stands for Weight. It refers to a matrix of parameters that the model learns during the training process via backpropagation.

Q stands for Query. It indicates that this specific weight matrix is used for the Query transformation as opposed to WiK for Key or WiV for Value. A query represents the information a particular token is looking for from other tokens in the sequence.

i is an index. It specifies which of the multiple attention heads the matrix belongs to. In multi-head attention, each head has its own unique set of WQ , WK , and WV matrices.

1.7

Membership Sign ∈ The "is an element of" or "membership" sign. The expression x∈R means that x is a member (or element) of the set R.

Support Me!

Education

  • AI
  • Math
  • AIMA Exercises
  • Taxes

AI

  • Generative AI
  • LLMs AI

Please report any issues you find so we can fix them. Thanks!

Enter type of issue.
Enter issue information.