9:00 - 5:00

Mon - Fri

(877) 567-3990

 

(562) 441-2195

The LLM Inference Pipeline

Understanding Tokenization, Embeddings, and Retrieval-Augmented Generation (RAG)

Image

GPT

Before we understand how LLM works internally, let’s have a quick overview of GPT. GPT stands for Generative Pre-Trained Transformer.
Let’s understand further.

Generative

Generative means to generate something. In traditional applications like search engine e.g. Google, Yahoo. When you make a search, it does not generate the content. They have crawled the entire web and they have indexed data stored. They find the relevant results and return them to you, that’s why they are known as search engines. On the other hand, the LLM can generate the predictive output (next sequences) on the spot based on the user’s input (prompt). That’s why LLM is generative in nature.

Pre-Trained

As we know LLM is generative in nature that does not means it has something built-in by birth. It requires some pre-training to generate the output, and this pre-training is based on some data like internet data, historical data, any content. Based on this pre-trained data, it can generate some output.

Transformer

The whole LLM process works based on Transformer architecture. Therefore, a transformer is generative in nature which works on pre-trained data. A transformer is a white paper published by Google in 2017. All AI applications and LLM models work based on Transformer architecture therefore we can say it as a heart of LLM. A transformer takes some input source like text, voice, image, videos and produces some output based on its neural network.

Let’s dive into the internal process of an LLM responding to a user's input that can be broken down into three main phases:

Step 1: Input Processing aka Encoding Phase

This is the initial stage where the user's plain-text input is transformed into a numerical format that the LLM can understand.

Input

The user sends a text query (e.g., "What is the capital of France?").

Tokenization

Tokenization is the crucial step in which a large language model (LLM) breaks down raw text into smaller, manageable units called tokens. These tokens can be words, parts of words, or even single characters, depending on the tokenization method. Once the text is tokenized, each token is assigned a unique numerical ID from the model's vocabulary dictionary, which is what the LLM understands and processes. Each model has its own tokenization vocabulary dictionary that maps each token to a numerical value. You can examine this using https://tiktokenizer.vercel.app/

The LLM's tokenizer breaks down the text into smaller units called tokens. Tokens can be words, or characters. For example, "What is the capital of France?" might be tokenized into ['What', ' is', ' the', ' capital', ' of', ' France', '?'].

Vector Embedding

As we know, a computer can't perform mathematical operations on token IDs, as they're just arbitrary numbers. Once the raw input (prompt) is tokenized, the model converts each token into vector embeddings into a high-dimensional vector space. Each token is now represented by a long list of numbers (a vector) that captures its semantic meaning and relationships with other tokens. This also allows the model to understand the semantic meaning and intent of your query rather than just recognizing individual words. Unlike older methods, an LLM's vector embedding for a word like "apple" can change based on the sentence it's in. In "I ate an apple," the vector would be closer to "fruit," while in "I have an Apple computer," it would be closer to "technology." This contextual awareness is a cornerstone of modern LLMs.

A vector embedding is a numerical representation of an object, such as a word, sentence, or even an image, in a high-dimensional space. In the context of large language models (LLMs), these vectors are crucial because they transform human language into a format that computers can understand and process.

Think of an embedding as a point in a multidimensional map. This map is designed so that objects with similar meanings are located closer together. For example, the vector for "king" would be mathematically closer to the vector for "queen" than to the vector for "banana." This spatial relationship allows the LLM to understand semantic similarity and context, which is key to its ability to generate coherent and meaningful responses.

Instead of a simple one-to-one mapping (like a dictionary), an embedding captures a word's rich meaning, including its relationships to other words. The process of creating these embeddings is a core part of an LLM's training, where it learns to assign these numerical values based on the massive amounts of text it processes.

Positional Encoding

Positional encoding is a technique used in transformer models to give the model information about the position of each token in a sequence. Unlike traditional recurrent neural networks (RNNs) that process data sequentially, transformers process all tokens in a sequence simultaneously. This parallel processing makes them very efficient but also means they lose the crucial information about word order. Positional encoding solves this problem by adding a unique numerical vector to each token's embedding. This vector contains information about its position, which allows the model to understand syntax and the relative position of words.

Positional encoding assigns a unique vector to each time step in a sequence. The most common method uses sine and cosine functions to generate these vectors. This is preferred over a simple integer count (e.g., 1, 2, 3...) because it has two key advantages:

  1. Unique Representation: Every time step gets a unique encoding.
  2. Relative Positioning: A model can easily learn the distance between any two positions. For example, the distance between positions 2 and 5 is the same as the distance between positions 7 and 10.

By adding these vectors to the token embeddings, the transformer model can distinguish between sentences like "The dog bit the man" and "The man bit the dog," which have the same words but completely different meanings due to word order.

The resulting vector now contains both the semantic meaning of the token and its positional information. This is the final numerical representation that is fed into the Transformer's attention and feed-forward layers.

Step 2: Inference

During this phase, the model processes the encoded input to generate a response. This is the core of the LLM's work.

  1. Transformer Blocks: The combined embeddings (with positional information) are fed into the transformer's encoder-decoder blocks.
  2. Attention Mechanism: Inside these blocks, the attention mechanism identifies the relationships between different tokens in the input. For example, it helps the model understand that "France" is related to "capital" in the context of the user's question. This allows the model to "pay attention" to the most relevant parts of the input.
  3. Prediction Loop: The model starts with a token-by-token prediction loop. Based on the processed input, it predicts the first word of the response. This new token is then added to the sequence, and the model predicts the next token. This process repeats until a complete response is formed.

RAG (Retrieval-Augmented Generation) takes place during the inference phase, specifically as a preliminary step to the final output generation. It's an augmentation process that happens after the user provides their input but before the LLM begins to generate a response.

RAG - Retrieval-Augmented Generation

LLM generative model responds to a query with confidence based on its training and sometime the answer could be wrong because of lack of source or sometime the information is outdated and that is a major challenge the LLM models face.

To overcome this situation, the RAG framework jumps in. Now instead of just relying on what the LLM knows, RAG adds a content source (this could be open like internet or closed like collection of documents etc.) and instructs the LLM generative model to look at the content source before responding anything. Now the LLM talks to content store to retrieve the information that is relevant to user’s query. This way RAG augments the model with new information and model can answer the same questions quickly. RAG also enforces the model to pay attention to the primary source data before giving a response and in fact now being able to give evidence.

RAG (Retrieval-Augmented Generation) is a framework that sits on top of the core LLM process. It's an architecture or technique that augments the LLM's capabilities.

Think of LLM as a very smart student who only reads certain textbooks up to a specific date. The processes like tokenization, embedding, and positional encoding—are how the student reads and understands those textbooks.

RAG, in this analogy, is like giving the student a personal librarian. When a user asks a question, the librarian first goes out and finds the most relevant, up-to-date information from a separate, external library (a database or knowledge base). The librarian then provides this information to the student along with the original question. The student can now use this fresh, new context to formulate a much more accurate and comprehensive answer than they could have with just their existing knowledge.

Why RAG is a Separate Process

RAG is a distinct phase because it involves steps that happen outside the LLM's core model. The process typically looks like this:

  1. Retrieval: The user's query is used to search a separate, external knowledge base to find relevant information. This knowledge base is typically a vector database containing chunks of text from documents, articles, or other proprietary data.  This search process often uses its own embedding model to find semantically similar documents to the user's query.
  2. Augmentation: The retrieved information is not sent to the user. Instead, it's used to augment the original user prompt. This means the system creates a new, enriched prompt that includes both the user's question and the retrieved context.
  3. Generation: The augmented prompt is then sent to LLM. The LLM processes this prompt, using its internal mechanisms (tokenization, embedding, etc.), to generate a final response. This ensures the output is grounded in factual, relevant information rather than just the model's pre-trained knowledge.

RAG is an active, dynamic process that occurs in real-time for each user query. It's the key to making LLM’s responses more accurate, current, and verifiable without the need for expensive and time-consuming retraining.

Step 3: Output Generation

This is the final stage where the LLM's numerical output is converted back into human-readable text.

  1. Predicted Tokens: The prediction loop generates a sequence of new tokens.
  2. Detokenization: The tokenizer performs the reverse process (detokenization), converting the sequence of tokens back into a single, coherent string of text.
  3. Final Output: The user receives the final, formatted text response (e.g., "The capital of France is Paris.").