LLM / Text Vectors for Product Managers

Intro

Understanding how these things work matters.

Not because you’re going to build the next GPT yourself, but because understanding just enough of how LLMs and vector math work can change how you think about products, teams, and strategy. It can help inspire better solutions, make smarter tradeoffs when AI promises start sounding magical, and maybe even help you call BS when needed. Whatever strategic product decisions you may be making, your implementation team could be internal or perhaps a contract shop. In either case, there’s operations impact and costs that will likely impact your roadmap. If you also have P&L responsibility, you’re going to need to look at the costs here with regard to your business case. And if you don’t, chances are you may be the one who still has to justify the spend to others. As usual in product manager land, even if you’re not the one executing the actual work, you likely need to understand enough about the pieces to know what they can do and what this might cost.

tl;dr

LLMs turn text into numbers using math called vector embeddings. We’re going to look at this below.
These vectors live in a high-dimensional space, where “distance” equals “semantic similarity.” Again, we’ll look at an example below.
Transformers (not the ones from the comics/movies) are the model architectures that makes GPT-style LLMs so powerful.
All this lets us build apps that “understand” language enough to generate answers, categorize, summarize, translate, and more.
But it’s still math, not magic. And it’s expensive from a lot of perspectives. The question is where do we want to take the expense hit(s) and for what level of benefit.
If you want to try an interactive demo, I put one here: Search and LLM Math Step Through Here’s another for LLMs that puts mine to shame LLM Visualization

The Classic Question: How Deep does Product Need to Go?

Let’s face it. Some of us are reasonably bright. At least. (Hopefully.) And some of us may be wicked smart. Now, our developer counterparts, (the good ones anyway), are usually really smart. Or at least, super bright, and amazingly patient and persistent. Next up, we have the truly gifted. Yes, these are value judgements on my part. But here I’m talking about the computer and data scientists that actually figure out some of the core algorithms. They don’t just do the math. They invent the math. While we don’t need to be able to build our own engines to operate a vehicle, we need to know enough to understand what kind of vehicle and capabilities is worth building for our markets.

High Level Metaphors for GPTs to Fine Tuning to Vector RAGs

LET’S JUST NOT WORRY about the tech words too much yet. We’ll sort that out. Just think of things like this… The Generative Pre-trained Transformer is – as the word implies – PRE-trained. Think of it just like yourself. Consider your own brain. You are “wetware.” Growing up you went to school. Let’s call that YOUR pre-training. You now have factual knowledge, maybe some degree of belief system, (well… ideally anyway), and some processing power for doing certain types of problems. Then you get a job somewhere. You start to learn that some of the time, what you learned in theory doesn’t work in practice. Your boss tells you, “Hey, that works great in the book, but out here you may need to tweak this a bit. Here, let me show you.”

Your former knowledge is embedded really deep though. (It’s literally re-wired some of how your neural pathways are laid out.) So this new information, (which we’ll call Fine Tuning), gets layered on top. (This fine tuning isn’t just pre-training from existing examples, it might have involved supervised learning with specifically labeled examples for you.) Meanwhile, maybe your model tries hard to adjust some of those highest levels of neuron layers. And it does, to some degree. Maybe not that much, but at least a little. How much so depends on how deeply you study, and inculcate such knowledge into your world. And especially, if you use the knowledge. Many ideas and skills are at least somewhat perishable. If you learned to ride a bicycle really well, maybe you never forget. As to everything else? It varies. Anyway, you’re solving those work problems a little better with a combination of your theoretical knowledge and some of this fine tuning. You’re still getting some things wrong on occasion though. Try as you might, your deeply ingrained ideas bust through some of that fine-tuning. Not always. But sometimes. So now, you get a bit into some Retrieval Augmented Generation. (RAG) to add some more specific knowledge, maybe from contextually appropriate documents.

For your next task, you’re fortunately not just told what to do in generic terms. Rather, the question you’re asked is itself loaded with more background information. More context. You’re given a bunch of documents with details on your job site, measurements, work rules, and more. (Or whatever your specific problem space may entail.) These aren’t there to fine tune your understandings or anything. They’re fundamentally part of the question, part of the task. Maybe there’s specific flows for how things work, and examples showing a whole “Chain of Thought” (CoT) for some solutions. We’ll call this your “full prompt.” So when you’re prompted with both your original question/task, (which may be a small request), along with this additional augmented information, those additional bits and pieces are essentially treated as true. That is, any output should be offered within the context of this set of information.

At a very high level, this is how these tiers work.

Now, you’ve likely been hearing a lot about agents lately, reasoning models and so on. Really, all these do is add additional context, maybe use special tools to gather particular information, and “think” about their own answers, honing them somewhat, before offering a final conclusion. For this article, I did a quick draft, then went back several times, checking some facts along the way. Same thing. (Yes, professional AI folks are screaming at my rough descriptions right now, but for our purposes here, this is true enough.)

About How Original Search Works

Sorry, but we’re going to rewind a moment to build more context. While generative AI is arguably wholly different than search, in this case we’re talking primarily about information retrieval concepts related to text. It’s useful to consider early search techniques as we go deeper, even though Gen AI, (and embedding and vectors), will also be used for other object types. Please bear with me. I think it will help to understand what comes next. Early “search” was just keyword lookup. Databases scanned for exact matches or indexed fields. Great solutions for the time. But with significant limits. It wasn’t real smart by today’s standards. Not a lot of nuance. I remember using an older information research service in the late 1990s. I forget the details, but I’d made a typo in a search term and got 16 results. I’m not sure why I remember that number. I think it was because it struck me that if anyone in the future typed that word in seeking that info, they wouldn’t find it. The info was essentially gone forever. At least, until some linguistic similarity algorithms came along.

Fuzzy matching did show up soon enough. If you typed “Jon” instead of “John,” you might still find what you were looking for. This was thanks to tricks like Soundex, which encoded names phonetically. If “Jawn” and “John” sounded close enough, they got treated the same.

Still, the basic retrieval was index based and used some relatively simple math to determine recall and precision. For an in depth discussion of evaluation metrics, you can see my video presentation on Metrics for AI Product Managers. It was taught for an ML/AI class, but includes some earlier metrics. In short, Recall measures how many of the actual positive cases your model correctly identified. Precision measures how many of the predicted positive cases were actually correct. So if things were misspelled, recall would obviously suffer. It might also be interesting to remember that at least for the most part keyword spam didn’t exist in that early world. If a paper was added to a system, typically a professional indexer would carefully construct an abstract with as perfect a summary as possible. Taxonomists would perhaps set up specific labels and classifications. People with formal training in Library Science or Information Resources Management might be doing this. It wasn’t until much later simple indexing would be a lousy tool not just due to its inherent lacking, but a cottage industry and junior marketers spamming keywords however they could under the guise of Search Engine Optimization (SEO).

As we widen scope to think about folks like Marvin Minsky (who imagined computers might someday reason), and Steven Pinker (who made linguistics cool for mainstream audiences), you start to see a bridge forming between how humans use language and how machines might model it. (By the way, check out Steven Pinker’s works on Amazon. Great stuff.)

IDF / TF math

Enter TF-IDF: Term Frequency–Inverse Document Frequency.

It was one of the first scalable tricks to determine which words were “important” in a document. If “banana” shows up 15 times in Document A but only 3 times across a million other docs, that word probably says something specific about A.

TF = how often a word appears in a document
IDF = how rare that word is across all documents

Multiply those, and you get a signal. It’s simple, powerful, and it ‘kind of’ worked for a long time.

But it didn’t understand meaning. It just measured frequency and rarity.

I’m going to put in the math part now, but here’s the thing… here’s all it really does… “this word or phrase is more or less important based on how often it occurs in relation to both the document and other documents in the corpus.” In other words, if a word shows up a lot in this document but rarely anywhere else, it’s probably telling us something specific about this document. (Assuming honest writing and no keyword stuff to game a system.)

Here’s how TF-IDF works behind the scenes:

Term Frequency (TF)

This gives us the relative importance of the word inside that document.

Inverse Document Frequency (IDF)

IDF tells us how rare or common a term is across the entire collection of documents — also known as the corpus.

Where:

N = total number of documents in the corpus
df(t) = number of documents that contain term
read as: The inverse document frequency of term (t) equals the log of N over d f of t. (that is, the total Number of documents divided by the document frequency of t.)

This increases the score of rare words (which appear in fewer documents), and lowers the score of common words (like “the” or “and”).

If a term appears in every document, IDF = log(1) = 0, which zeroes out its TF-IDF score. That’s by design.

So a rare word that shows up in only a few documents will get a higher IDF score. Acommon word (like “the”, “and”, “is”) that appears in almost every document will get a low IDF score.

Here’s the problem with this. It’s a great basic measure. As long as everyone is playing nicely. Once upon a time, the people building information databases would take a document and it would just ideally be itself. That is, the information would be an honest attempt to faithfully represent the topic at hand, often in the form of scientific papers. As mentioned, a well-trained person would write a summary, called an Abstract, a concise summary meant to reflect the core content and intent of the document. If metadata is used, they would carefully select keywords. Also, they’d take great care in labeling any documents and placing them in appropriate places within a taxonomy.

Then came the World Wide Web. And publishers. Then Everybody. And what did those pages have on them? Advertising. The moment usage became money and “eyeballs” became the prize, this carefully and lovingly crafted information science became an arms race of what we now call Search Engine Optimization (SEO), and how to get usage dollars. All of a sudden, the best documents might not rank well due to some very simple gaming. Of course, not everyone was being completely greedy; often SEO is applied honestly to just try to be in the mix. The major insight by Google’s founders was trying another way. And we all know that story now. Rank ordering was using various methods of link cardinality from high authority pages. (Link analysis leading to what they called PageRank became a proxy for authority and relevance.) In short, what’s most popular within some kind of context. (Similar to citations of academic papers, co-citation analysis, and so on.) We don’t need to go deeper here. Research this history on your own if you like as it’s fascinating if you’re into information science. But for now, this brief history fills in a gap and now we move on.

Oh, just one last thing… Besides being easy to game on an open web, TF-IDF can also break down when we have synonyms; when words have multiple meanings. Or if word order and semantics matter. TF-IDF treats words and phrases independently. There’s no true context or nuance.

OK. That’s history. Thank you for persevering. NOW we can move on.

Defining Text Vectors

Great! We’re finally here. And here’s where things change. Here’s where we go from Web2, plain web with search, to Web3, which was supposed to be the semantic web. Oh, damn. Web3 got hijacked for decentralized blockchains and crypto. Fine. We’ll just say here’s where we go from TradSearch and whatnot to GenAI then.

Modern information management models don’t just count words. They embed them into vectors. These are lists of numbers that represent meaning based on context. These aren’t hand-crafted. They’re learned from data. This is the main differentiation from what came before. We’re no longer simply matching based on word or word form lookups of strings. (Recall that a text string is just a sequence of characters, (letters, numbers, or symbols), treated as data in programming and computing.)

In a little while, we’re going to look at exactly how we can do this with some Google Colab code samples. But for now, let’s just consider a couple examples.

Here’s a few examples used to explain vector ‘closeness’.

If “Paris” and “France” are close in vector space, then “Paris – France + Italy = Rome.”
This shows how relationships between countries and their capitals are captured mathematically.

If “iOS” is to “Apple” as “Android” is to “Google,” the model might solve: “iOS – Apple + Google = Android.” It learns product-brand relationships just from patterns in the data.

If “CEO” is to “company” as “principal” is to “school,” then: “CEO – company + school = principal.”
Vector math encodes structural roles and domains, even across different industries.

These vectors can live in hundreds or thousands of dimensions. We can remember graphing in two or maybe three dimensions as far back as grade school to high school. But beyond three dimensions, the graphic representations are too big for human brains to visualize, but perfect for machines to measure with math tools like cosine similarity and dot products. (Which is beyond both the scope of this article and my ability to fully explain. I’ll make a mild attempt at doing so anyway with some examples that use the math, but not pretend to try to explain more deeply why it works.)

How Words Get Turned Into Vectors → Indexing and Search

We’ll get to doing the vectors in a moment, but suffice it to say for now, once you’ve transformed all your documents (or product descriptions, customer service interactions, chat logs, etc.) into vectors, the next challenge is retrieval. How do you quickly find the most relevant vectors to a given input?

What’s a Vector Database?

A vector database is a system designed to store and search these high-dimensional embeddings efficiently. If you tried to brute-force this search across millions of documents, it would be too slow.

Enter ANN (Approximate Nearest Neighbor) search. These algorithms use clever shortcuts, such as hashing (video explanation) or tree structures (video explanation), to retrieve the top-N most relevant vectors quickly.

Popular vector DBs: Pinecone, Weaviate, FAISS (by Facebook), Milvus.

Vector Search in Practice

Here’s what happens when a user asks a question:

Their query is turned into a vector using the same embedding model. (Again, getting to that shortly. Just want to show why it’s useful first.)
The system searches the vector database for the “nearest neighbors” to that query vector.
The top matches (documents, FAQs, etc.) are returned as retrieval context.
This context is added to the prompt before calling the LLM (i.e., this is RAG in action). Main takeaway: RAG is used as part of a much more specific prompt into the large language model. It’s not changing the model. (That is, there’s no impact on the fundamental weights in the model.) What we’re doing is giving the model much more context for which we’re ‘kind of’ insisting the answer take into account. Newer ideas about using agents and reasoning can also use RAG techniques to maintain a kind of memory for completing tasks that may take multiple steps. (You can think of this as stuffing the model’s short-term memory with information it never trained on. Kind of like giving an AI assistant a briefing packet before asking for a response.)

The Great Embedding

When we say words are “turned into vectors,” we mean they are mapped to points in a high-dimensional numerical space. This mapping captures the meaning of a word based on how it’s used in context across millions or billions of examples.

At a technical level, these word embeddings are learned using neural networks, which are models that adjust weights to reduce prediction error. The objective is often something like: “Given the words around this word, predict what word goes in the blank.” Over time, this process encodes words with similar meanings into similar locations in vector space.

If all of this sounds complicated, think about what you already know. If I say, “It’s getting late and I’m getting hungry. After work, where do you want to go to eat ________”. Chances are good that blank gets filled in by “Dinner.” (Maybe not. But most probably.) This is all like the old Mad Libs fill in the blank game. Simple things are easy for us because our brains just do it. But for our computers to learn, we have to be explicit.

Let’s say a token (like “laptop”) is mapped to a 768-dimensional vector. That means it’s now represented as a list of 768 numbers, like this:

laptop → [0.21, -0.87, 1.02, ..., 0.43]

If you’re wondering about why the dimensional space is 768, don’t be overly concerned. There’s a reason for it having to do with standard embedding models, but the details are out of scope for this writeup. (And there are actually much larger dimensional spaces. We’re just using a small one here as a base example.) Really want to know? Here’s a discussion on Reddit about it.

Or at least just to kind of understand the basics, you can think of the dimensionality issues this way:

Aspect	Low Dimensions (e.g., 384)	High Dimensions (e.g., 1024–1536)
Information Capacity	Less capacity to represent fine-grained meaning	Can encode more subtle semantic nuances
Model Size & Memory	Smaller, faster, less memory-intensive	Slower, uses more memory/compute
Search Performance	Good enough for many tasks (e.g., FAQs, basic RAG)	Better for large-scale, nuanced search or reasoning
Risk of Overfitting	Lower	Higher, especially if used with small training data
Similarity Quality	May blur close meanings	Better at distinguishing similar but distinct texts
Cost (API-based)	Cheaper (smaller tokens/vectors)	More expensive due to vector size and processing needs

But how do so many numbers get assigned to just one word? This is kind of the tricky part. You’ve heard that models get trained on massive amounts of text data, right? This is part of that. Before the model ever sees “laptop” it’s been trained on billions of words. And it has a ‘goal’ like this: “Given the surrounding words, can I predict the next one.” To do this, it has to learn relationships between words; which ones tend to appear near each other, in what contexts, and how often. To start, each word, (or token really), gets a unique ID, (kind of like a vocabulary index), and is mapped to a vector of real numbers, starting at random. As the model gets feedback from training, it adjusts those numbers to reflect the word’s meaning in context. (The result will be words that appear in similar contexts end up with similar vectors.)

This, by the way, is what always confused me personally… how can just one word get a vector??? The answer is that any embedding model is based on some kind of corpus. So there’s always a relationship somehow. It might start with random numbers and then get trained, but a single word, in an of itself, can’t be meaningfully vectorized. Trying to vectorize a single word with no training data would be like asking someone who’s never seen or heard a piano to describe how it sounds, or the average speed of a car with zero trip data. (Maybe this never bothered you. But it bugged me. So there you go.)

Here are the steps:

Each word starts as an index. (So “laptop” might be word #5,843 in a vocabulary. That index becomes a one-hot vector… a long vector with all zeros except a 1 at position 5,843.
That one-hot vector is multiplied by an embedding matrix which is [vocabulary_size x embedding_size]. So, for example, a 50,000 word vocabulary x a 768-dimension vector is 50,000 rows x 768 columns. (Technically, we call this getting “big like moose. And this is still small!) Each row = one word’s embedding. Each column = one “feature” the model learns, e.g., things like “is-technology,” “is-device” and so on.
Each of the 768 numbers is a weight that the model learned over time. It gets adjusted gradually to reflect the word’s behavior in thousands or millions of contexts.

Have you ever seen a huge audio mixer board like what studios use for music or movie sound? Each word can maybe be thought of as having it’s own “mix.” The model learns how high or low to set each slider so that when it sees “laptop,” the sound it generates, (its predication), is as accurate as possible.

So, if “laptop” and “notebook” end up with vectors that are very close in direction, the model understands that they mean nearly the same thing — even if they’re different words.

Let’s Try It

I’ve built a Google Colab notebook to actually run some code to generate sentence embeddings and visualize a chart with how their similarity would look. Colab is a great, (and free), way to try out code snippets. If you want to learn more about it, you can try Getting Started with Google Colab: A Beginner’s Guide Google and Colab Tutorial for Beginners (YouTube). But for now, you can just go to the notebook below and follow the instructions. (Though really, you would need to learn enough to get your own HuggingFace API key, (which is free), and also if you want to see the Optional OpenAI embeddings, you’d also need an API key there.) You don’t need to do this. I’m just providing the notebook if you want to play with it. You can just look at the output below though.

Here’s the notebook: LLM_Vector_Embedding_Colab_Initial.ipynb and you can use it if you want to run these examples yourself. (Feel free to copy it to your own Google Drive.) If the link opens with a bunch of code gibberish, just click the “Open With…” button on the top of the screen.

But here’s the bottom line:

If we use the following four sentences:

sentences = [“The laptop is on the desk.”,
“A computer rests on a wooden table.”,
“The sky is clear and blue today.”,
“Apples and bananas are delicious.”]

We’d get vectors something like this… (it goes out to 384 columns):

Note that for this example, these are embeddings for whole sentences, not just words. Most GPTs with which you’re likely familiar embed at the token level. We’re using sentence embedding in this example as it’s somewhat easier to represent and understand how contextual similarity works at this level.

Okay. You might be saying, “That’s nice. So what?” Well, let’s look at the next section to find out.

Sentence Similarity

Now that we have things in vectors, we can calculate how similar they are in a number of different ways. In our Colab notebook, we’ll try two. PCA and Cosine.

Principal Component Analysis (PCA)

PCA is a way to take a big, complicated set of data and make it simpler without losing the important stuff. Imagine you have a bunch of points in 3D space. PCA helps you flatten that into 2D or 1D so you can see patterns more easily. It’s like taking a photo of something from the best angle to understand its shape.

Here’s how the closeness of our sentences look if we run them through PCA and plot them on a 2D graph.

See how the sentences with “laptop” and “computer” are close together even though these are different words? This was not done through rule by understanding synonyms. This was done by actually “learning” that these two concepts are tightly related.

Cosine Similarity

Cosine similarity measures how similar two vectors are by calculating the angle between them, with a value of 1 meaning they point in exactly the same direction. Anyway, the “distance” or angle between these vectors is what tells the model how semantically similar two tokens are. The value of cosine similarity ranges from -1 to 1:

Where:

A⋅B the dot product of the two vectors
‖A‖ and ‖B‖ are the magnitudes (lengths) of vectors A and B.
A similarity of 1 = perfect match, 0 = orthogonal (no relation), -1 = complete opposite
read as: “Cosine similarity of A and B equals A dot B over the norm of A times the norm of B.” Or more detailed… Cosine similarity of A and B equals the dot product of A and B, (which multiplies two vectors together), divided by the product of the magnitudes of A and B. (with ‘magnitude’ being the length of the vector.)

So when you hear people say something like “Paris – France + Italy = Rome,” they mean that the vector arithmetic produces a new point in space that’s very close to the vector for “Rome.” This kind of math shows how the model captures relationships, not just between words, but between concepts.

Here’s a visualization of what it looks like from our example:

Note: The visualizations were created using a tool called sklern. Scikit-learn (or sklearn) is a popular Python library that provides easy-to-use tools for machine learning, including classification, regression, clustering, and data preprocessing. (See the Colab file for details and if you want you can put in your own sentences in Step 3 to see how they come out.)

Edit: After posting this article, I got some recommendations for others within similar topics. For those who want to dive deeper into the linear algebra math, here’s an great article by Joshua Wheeler: Applications to Machine Learning – Image Compression (It’s focused on image object types, but the math is generally the same. The ’embedding’ models may be different, but the vectors and similarity functions are generally the same.)

The Magic

This is the magic: we’re not matching characters or words. We’re comparing meaning, learned from context.

Video Explanation of one way to do it: Word Embedding and Word2Vec, Clearly Explained!!!

By the way, PCA and cosine similarity are just two ways to measure how close two vectors are. There are others, like Euclidean distance, dot product, or Manhattan distance, depending on the use case. The key takeaway is this: we’ve turned text into math via high-dimensional vectors and now we’re just asking, “How close are these concepts to each other?” I don’t personally pretend to understand the depth of the math here. Just glossing over it though, will ideally be enough to grasp the high level concepts of how this all comes to actually work. How do you know which model to use? This is where you can either play with the options yourself for hours or months, or talk to your development lead or data science team lead. (Pro Tip: Ask the pros.)

If it helps, go back to middle or high school math for a second and think about the classic 2D Cartesian plane. We’re doing the same thing, just not in 2 dimensions anymore. Now it’s hundreds or thousands. It can definitely be a bit mind bending, but that’s what all the fancy math does. How close are these two vectors. It’s just doing it across more dimensions that we can visualize.

In 2D, we can see which lines are closer together. Which is for the most part also saying which are most similar. We’re just doing the same thing now with much more complicated lines of a sort. And seeing what’s closest to what.

Again, the challenge in understanding this is that the human mind kind of breaks down trying to see this beyond three dimensions. If you’re an ace mathematician, you can research this further and perhaps come to understand more of how it works. For the rest of us, we’re probably ok if we can at least make the jump to just understanding that it’s the “closeness” that falls out of the math. In the future, might there be other, better algorithms? Probably. Or maybe. But this type of solution is our current Flavor-of-the-Month and brought us the LLM frenzy.

LLMs

So how do we go from word vectors to full-blown text generation?

We train Large Language Models (LLMs) on mountains of text. Big mountains. As in, as much as they can sensibly get that’s representative of the kind of information they need to represent. These models learn the probability that a word follows another word, then chains of words, then whole ideas.

At their core, LLMs are glorified autocomplete machines; just way, way better. They don’t know facts. They know patterns.

And what makes them special isn’t just their size, it’s their architecture.

Neural Networks and Transformer Architecture (Clarified)

Neural networks are loosely inspired by the brain: they use layers of connected “neurons” (really, math functions) to process input. But they struggled with sequential data like sentences because they processed inputs in order, one at a time.

Transformers solved this using self-attention, which allows the model to consider the entire sentence at once and assign weights to which words matter most in context. Or rather, they can pay attention to different parts of sentences at the same time. When they do this they assign dynamic importance weights to each token in context. Example: Take the sentence: “The cat sat on the mat.” When processing the word “sat,” the transformer might give high weight to “cat” (to know who sat) and “mat” (to know where it sat), and lower weight to “the” or “on.” The model does this for every word—building a web of weighted relationships that reflect meaning through context.

Another Example:

In “The apple on the table is red,” the word “apple” might attend to “red.”
In “Apple’s stock rose,” the word “Apple” might attend to “stock.”

This attention mechanism lets the model “disambiguate” based on context.

This architecture scaled like crazy and gave rise to GPTs (Generative Pre-trained Transformers).

LLMs vs. Fine-Tuning vs. RAG (Clarified Metaphor + Summary Table)

So the GPTs themselves are very high level. Then they maybe get fine-tuned for particular subject areas using documents and data that is more specific to the selected topic area at hand. (Fine-tuning actually adjusts the weights to some level within the foundational model.) Then, Retrieval Augmented Generation (RAG) can be added on, which really just extends the query. (This is true whether using RAG or the more recent slightly similar Cache Augmented Generation (CAG). CAG is slightly different that such a cache stores vector representations of prior LLM completions. So RAG is used for injecting new knowledge, whereas CAG is more about including previous results and optimizing speed.) Finally, we can use agents that gather information from various sources or perform other tasks towards getting us to a final result.

Key takeaway: RAG information needs to be encoded using the same embedding as the prompt/query used to the RAG repository; which at this point is a vector database. But it doesn’t necessarily have to be the same as the foundation model. This can be useful as there may be reasons a different embedding model is useful for RAG. (Either for cost reasons or that one is functionally better than another for a given task.) You might be wondering, “waitaminute… what about the foundational model’s embedding? Doesn’t a new query have to match that?” Well, remember that after a RAG operation has commenced, it’s the full results that get made part of a new prompt and that is what gets submitted to the foundation model, which uses its own embeddings internally. So if a different embedding model is used in a RAG operation vs. that which is used in a foundational model, that’s fine.

Let’s try all of this in table form:

Concept	Real-world Analogy	When You’d Use It
Pretraining	Going to school	Learning general knowledge
Fine-tuning	On-the-job training	Specializing for a domain
RAG (Retrieval)	Asking for documents at work	Answering with the latest/project info
Agent w/ tools	Talking to an expert or calculator	Solving complex or multi-step problems

Reasoning Models

LLMs don’t actually “think,” but they simulate reasoning by chaining outputs together, especially when you “prompt” them well.

Some of the best models now use chain-of-thought or tool-using methods, where the LLM can call on itself or other systems to build complex answers step-by-step. Do note though, that there’s philosophical arguments as to whether Chain-of-Thought always works. It’s possible some LLMs are internally sort of ignoring some of it. They may ‘pretend’ to use the info, but not as much as one might might think. This is the stuff of deep research by those who are trying harder than ever to delve into observability within the neural nets of LLMs. (If you want to get a sense of this issue, see Chain of Thought is not what we thought it was… (YouTube).

Using these methods It’s like talking to yourself to figure something out. Except the “self” is a giant probabilistic engine with billions of parameters and no real understanding. This is another attempt to confer some understanding.

Still, it works. Or at least it seems to work well in a variety of cases.

Consider this attribution graph:
(Try opening the link to play with the interactive version.)

If this seems confusing, consider that a lot of digital folks know about marketing. One of the things marketers look at when it comes to advertising is attribution of various channels and consumer touch points towards a sale. Trying to figure out to what degree each element contributes is an attempt at attribution. This is conceptually similar. What concepts are most responsible for resulting in the response.

The Impact on Costs and Capabilities

Each token processed by an LLM requires compute. The longer your prompt (especially with RAG), the more tokens. That’s why efficient chunking, good embeddings, and smart filtering matter. Costs aren’t just about total queries. They’re about total tokens.

LLMs are expensive in three ways:

Training: It takes millions of dollars to train a state-of-the-art model.
Inference: Every time you ask a question, a lot of compute happens under the hood.
Storage & Memory: These models are huge. Hosting and managing them isn’t cheap.

But the upside is massive: scalable insight, automation, customer support, content generation, language translation, and more.

That’s why vector-based LLMs are showing up in everything from sales tools to healthcare to HR chatbots.

If we use a GPT to answer short questions similar to how we would use Google, the costs remain relatively low. However, if we use GPT to answer questions that require providing extensive context, such as personal data, the query can quickly accumulate thousands of tokens. That increases the cost significantly. But don’t worry, you can set a cost limit.

Quick Word on Tokens

Using the word token for linguistic purposes here is actually a rather perfect metaphor. This section is especially challenging and annoying to me because I’ve recently also written about crypto tokens. This isn’t that. This is a different use of the word token. The word token is actually a polyseme. That is, the word has different, but related meanings. Both usages imply something representative of a larger structure. So it is yet another great example of why all of this vector stuff is important. Much more so than traditional text search, vectorization can collapse a lot of the ambiguity that is tragically endemic to language. (Perhaps especially to English.) The word “token” in this case clearly isn’t a synonym or homophone. Ideally, this illustrates the point that language ambiguity (like polysemy) is why vector-based systems are so powerful and useful. Tokenization in NLP turns “ambiguous symbols” into measurable representations, helping us disambiguate based on context.

In any case, tokens are not really whole words, though most folks use the term as meaning they’re words as its generally close enough to think about them this way and do rough calculations of costs.

In this context, a token may be a word or word sequence. Think about English though. We have plurals, tenses, other forms of words. In traditional information retrieval, lemmatization is often used. Lemmas are base dictionary forms of words. (run –> running, was –> be, etc.) Vector based retrieval can largely replace lemmatization. OK, perhaps not always. (There are some edge cases like keyword filters or exact matching where lemmatization may still be the best option.)

I hadn’t meant to turn this into a full linguistics discussion, but suffice it to say, traditional Natural Language Processing (NLP) mostly relied on rules-based systems. (Search, document classifiers.) For these, matching word forms were critical because the machine had no understanding of meaning. (So “running shoes” and “ran in sneakers” would actually have little chance of connection with lemmatization or synonym mapping.)

With modern embedding models for vector retrieval though, the machines actually learn semantic relationships directly from massive corpora. Instead of worrying about base forms, they operate on contextual similarity. For example, “She is running fast” and “She ran quickly” will produce similar vectors, even if the verb form differs.

So what? Well… A vector model retrieves based on semantic closeness, not exact matches. When you embed and store documents as vectors and a user asks a question, the model retrieves based on semantic closeness, not exact matches.

One caveat to the above statement. The closeness is based on how the model was trained, not necessarily perfect semantic equivalence. So vector similarity depends heavily on the model’s training corpus and objective. The real takeaway here is that proximity in the vector space indicates learned similarity, which could reflect correlation more than any deep “understanding.”

If you want to really see how tokens translate into words for the purpose of estimates, see the Tokenizer App at OpenAI.

Recapping How Words Get Turned Into Vectors and Drive LLMs

Let’s try to finish this off.

Words go through tokenization.
Each token gets mapped to an embedding vector. (Which is a numerical representation based on its learned context as described earlier.)
These embeddings get fed into the transformer, which processes the whole sequence and returns predictions or completions.

So what we have is this: Language → Numbers → Magic (but really math).

Ah, one other thing. Text vectors also have to be indexed and this happens in the vector database.

What’s Next?

Prediction of what’s next in this category of ridiculously fast-changing updates is probably the utmost in arrogance. I won’t pretend to be able to do so beyond maybe next week! There are some things that seem clear to consider though. Vector databases and RAG will likely become an increasingly important part of the LLM focused product stack.

They do at least a few things. Provide more context, offer more up to date information, and perhaps most interestingly of all, provide the potential for more agentic workflows, which are still in very early stages. Regardless of how we define agents, (and those definitions are evolving), these helper tools will increasingly be going on about their business across a variety of information spaces. When they gather their bits and pieces in an attempt to complete their tasks, often they’ll be feeding some form of LLM with gathered information in an attempt to synthesize summaries, insights or otherwise offer a conclusion of sorts. A lot of those data sources are likely best retrieved via some sort of vectors. I think we’ll see more tooling to convert data from various forms to allow autonomous agents to go on about their day. Product managers and data architects should pay attention to this space as the tools here will likely be enabling for new capabilities.