Upgrading an AI with a RAG Vector DB

In a previous post about Building a PM Helper with AI, I showed how a fun personal AI project I’d built to be my personal Product Management tool searched across multiple sources before synthesizing answers. Unfortunately, I made both a strategic and a tactical error in that 1.0 version. The solution? Using Retrieval Augmented Generation with a Vector Database. What I’m going to do here is offer some super fast high level definitions as I go through the problem space, and maybe in future posts, go more deeply into RAG and Vector databases in terms of value.

tl;dr:

If your a product manager working with AI at any level, you will likely need to understand Retrieval Augmented Generation (RAG) to some degree. The following is a small, practical use case to help see the value in action.
For the most part, when using LLMs for your own custom work, you’re stuck with the foundation model.
Fine tuning can change the weights of the model to various levels, depending on how deep you want to go. These weights basically control how models transform input and can be in the billions. The deeper you want to have impact, the higher the cost. (You’re not likely fine-tuning for personal projects though. And if you are, it will all but certainly be with open source foundational models.)
Retrieval Augmented Generation (RAG) doesn’t change weights at all. RAG just passes more information into a Prompt, (which is a fancy name for an information query, unless you really add fuller instructions ), but is limited to something called a context window. Basically, how much info you can pass in. It’s like saying, “Here, read this before answering.” So theoretically RAG reduces the chances of hallucinations and offers more “truthy” answers. (Assuming good data in what you feed it.)

So What’s the Problem with My Solution?

The fundamental mistake I made in my personal coaching solution, (Building a PM Helper with AI), was having my AI agent run “just” a search query against my Confluence Wiki data to gather known, good info. (At least, known to be good based on my judgement as to data quality gathered over years.) My goal was – and remains – for my Agent to be able to use the high value and trusted data I’ve collected over the years regarding Digital Product Management. But use it as part of answer in an LLMs synthesized “transformed” answers, not just a set of search results. Using simple search relies on the search mechanism to return the best documents relevant to a query. Then I’d have to parse those results to deliver the content only, (as opposed to all the codes, html, etc. on a page), in order to feed the info to my chosen LLM. This relies heavily on a solid search engine. And maybe that’s ok. But possibly not. All I’d be getting would be a few top documents. More importantly, I also committed a simple tactical error. The node I used to retrieve data and parse the pages wasn’t getting all the page data. While that might be fixable relatively easily, it doesn’t solve the core issue. I was using keyword search, that the LLM determined via an AI Agent expanding the initial query, to get some subset of documents rather than converting my entire corpus of information into a format that could be more directly queried and synthesized into the LLM in use. The solution? Dump all my data into a vector database. (This is the crude way. A pro would be looking more at chunking, different embedding strategies, and metadata.) Now we still end up with a search happening, but it’s via text embeddings and vectors, which is closer to the core LLMs ‘native’ understanding. The minimum viable solution is just to dump the info into a vector database and start using it. Later, I can add version checks and upsert the database if there’s changes in the underlying corpora. And actually, since there’s nothing here that requires anything close to real time, it might actually be better, (that is less expensive at runtime), to use a wholly separate process to update the vector database, maybe just on a daily basis. Why? Because usually queries aren’t nearly as intensive as encoding. (And “intensive” equals CPU cycles equals costs.) Here is what the new flow looks like:

You can see now that the Search Agent, (one of four that goes against various info repositories), now explicitly checks the vector database before merging and aggregating content to produce a final answer, which is then sent, (via webhook), back to the main web chat application.

How Can RAG / Vector Help?

First, the quickie definitions. Retrieval Augmented Generation, (RAG), is where an AI Large Language Model, (LLM), goes and gets some trusted external information from somewhere else and uses it to offer a better response than it otherwise might. (There’s deeper and more precise explanations. Maybe I’ll get to them in a next post, but really, this area has been well covered so I’ll give links at the end of this section. So we have Vector Databases. Databases store stuff. Different types of databases represent information in different ways within their internal structure. Why so many database types? Great question. When you get all the way back to basics, what everything in computing generally gets back to is CPU cycles. And sure, storage matters as well. As it turns out, organizing things in certain ways can be more efficient than others for particular use cases. Hence, varying structures. (Relational databases, graph databases, vector databases.)

The advantage of Vector Databases for this use case is that they can store a selected set of information in a format most amenable for a large language model to use it. Specifically, to “somewhat intelligently” query this dataset for information pertinent to its current information seeking, processing and response mission. And upon retrieval, incorporate this information into its processing and response. It does this by ingesting the retrieved information and casting it as if were part of a new, fuller query that included the original. (Just also note that use of RAG is not persistent. That is, it doesn’t update an LLM or hang on to this info as part of your future prompts unless special work is done to allow for this.)

See Also…
RAG Explained (YouTube)
RAG vs. Fine Tuning (YouTube)
Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer (YouTube)
(Way deeper, but the first 6 minutes cover the basics)

OK, Let’s Do It!

I’m going to use Pinecone as my vector database. There are many options for this. In my case, I’m using free or super cheap levels because this is just a personal project for fun. If this was for a product level product requiring a lot more horsepower, I’d need to do a fuller vendor review and be looking at all in costs from upfront to operational, etc.

The first step is to get the API / Access to a Confluence Wiki, retrieve the desired page IDs, and extract the text information from those pages. I’m not worried about embedded graphics or other data types for this first effort. However, obvious next steps involve getting info from other data types, from images to PDFs to presentations, and so on.

Next, this info needs to be delivered to a Pinecone vector database using the same embedding engine as the LLM uses. You need to use the same embedding or else you get garbage. Using the wrong embedding would be like filing documents in a different language than your search system understands. Everything’s technically there, but when you go looking for answers, all you get is gibberish. The whole point of vector search in this case is to do a better job with similarity search for the goal of synthesizing answers vs. what simple search does.

Here’s what the flow looks like. Again, here I’m using n8n.com as my low-skill / almost no-skill agent tool. You could probably do this for a production ready product with a properly hosted instance of n8n. Though many would argue for a more pure code solution.

This part is what a lot of folks would call “not the fun part.” While there should be careful thought about this kind of flow, there’s not a lot of high-level ideation on product value or fancy AI models. It’s about data access and preparation. This fairly simple flow gets the information, makes a half-hearted attempt to clean it up a bit, chunks it into pieces, applies an embedding model and finally loads it to a vector database. (In my cases, the chunking and splitting decisions were all based on making things small enough to fit into free or inexpensive service tiers; not any real product quality issues.) Some people may enjoy this part. (I’m not really one of them.) From a product perspective, the main issue is to account for the time and cost of initial build and ongoing maintenance.

See Also:

Vector Databases simply explained! (Embeddings & Indexes) (YouTube)
What are Word Embeddings? (YouTube)
A Beginner’s Guide to Vector Embeddings (YouTube)
Word Embedding and Word2Vec, Clearly Explained!!! (YouTube)

The Result?

I’m running both my old workflow model and my new model in separate workflows so I can test the output. And when I say test here, that means I’m kind of looking at answers and seeing what I think is best. I am not developing a full on rubric for evaluation for what is just a mostly a just-for-fun thought experiment. (By the way, a rubric is a means to evaluate LLMs. If unfamiliar, please see my article Intro to AI Rubrics for Product Managers.) Many types of LLM and Agentic value assessments are subjective. There are ways to try to quantify even subjective assessments. If doing serious work, the creation of evaluation rubrics makes sense. But even so, this is a potentially expensive and time-consuming process. For a little personal test app? It’s easy enough just to create a separate workflow version and try out some examples. After which they can be evaluated for quality on a simpler personal judgement level.

Learnings

Token Limits

You’re going to run into them. Quickly. And there’s costs involved. The costs include not only the obvious raw cost of LLM processing, but the preparation work. There’s Extract-Transform-Load (ETL) type efforts to prep the data in the first place. Then most likely the need to split into batches prior to creating embeddings.

Even testing can eat up allotments fast. It’s kind of like those amusement park games with the claw. You know, the one where you put in at least $0.50 to move the claw to pick up what you want, but it rarely works? $10 later, you have a $0.30 piece of candy. Now, sensible adults don’t use this at all, except maybe it’s worth $5.00 to have your kid also learn this kind of thing doesn’t work. Well, it’s like that. Until you actually get some data through. Then things are fine.

But… now you have to really budget for what your true usage might be.

Edit: So the very evening on which I posted this, I caught up on the info regarding Meta’s launch of Llama 4 Scout; their smallest new model. (Which was yesterday, April 6. Or Hugging Face says April 5.) And they have other big models still baking and not quite ready for download.) And their smallest comes with a context window of 10 million tokens. That’s a lot. That’s about 15,000 pages of text. (We’ll leave aside other data types for now. What does this mean? It means that RAG can be even better. It means enhanced comprehension of extensive data. And probably better performance in complex tasks. Nevertheless, still keep in mind there’s always a hit somewhere. The larger the context window used, the higher the costs. Algorithms and chips may get better every day, but transformer models process context using attention mechanisms that have time and memory complexity with respect to the number of tokens. And it’s not linear. Using a much larger context window could be a lot more expensive to run.

APIs & Parameter Passing

The hardest part of making it all go, at least for this effort, was getting all the permissions set up properly and getting everyone’s formatting just so. For example, taking content out of Confluence Wiki is easy enough, but parsing the HTML out has to be done well enough so that in a following step, any stray entities that look like JSON are managed. Otherwise, the JavaScript used to feed the embeddings model can choke on the input, thinking there’s some kind of command in there or something. Alternatively, (such as if you actually want code to be getting passed along), you need to consider other methods for embedding. If your JavaScript skills are tragically rusty, (like mine), or wholly non-existent, this can be a struggle. However, as they say, perfect is the enemy of the good. And we can punt a bit here and rely on the LLM and tokenization to simply put the right textual connections together. (Which is what I did. I estimate I got 90% of the junk out and that was good enough for a noticeable lift in my goal results.)

Key Danger

Here’s a rookie mistake. I was getting some help from an LLM with one of my n8n nodes that wasn’t working well. I cut/pasted an error message into the LLM, but… by accident, I pasted an API key in there. I’d hit enter so fast it was just up there. Probably nothing happens, but… I had to revoke the key and create another.

Bottom Line

By adding RAG into the mix for my personal AI assistant chat bot, I got results that included clearly specific information in which I have high trust regarding quality. This is not the only use case for RAG. Others include making sure newer up to date information is available; if that’s a need. Or customer history specific information is considered. Or some industry specific information is available. Or… whatever your needs might be to pile on to a foundational model.

The point is, this is getting increasingly easier to do for even simple needs or wants. And it can be done on top of existing models. Or you can use free open source foundational models as well, that for many purposes may be plenty good enough for particular use cases. For example, even though language drifts over time, you don’t necessarily need a constantly updated foundational model. If using LLMs primarily for their core benefit of highly advanced natural language processing, you don’t necessarily need the latest variant of the most expensive paid options.

All of these elements along the Machine Learning Operations (MLOps) workflow are somewhat composable. This is one of the reasons we’ll likely see increasing use cases that merge crypto blockchains into agentic workflows. In order to operate, agents will need to interact with other agents. They’ll need to verify provenance of other agents as well as have a means to transact. And that’s likely better off done via some form of tokenomics. (A token based economy.)

But, that’s a whole other topic in itself.