Watch the episode 👇
More resources
You can browse a public version of the BrainTrust eval on the Coda help center content here: https://www.braintrustdata.com/app/braintrustdata.com/p/coda-help-desk
Here is the Open Source repository for the evaluation code and simple Next.js playground: https://github.com/braintrustdata/braintrust-examples/tree/main/help-docs/js
About the show
Check out more at www.bitsandbots.ai, or find the show on YouTube, Spotify, Apple, or your favorite podcast spot. We’re always curious for ideas and feedback, so please get in touch!
Episode transcript
David: Today, Ankur and I are doing a deep dive on RAG: retrieval, augmented generation. Here's a one minute teaser of the episode.
Retrieval Augmented Generations is taking a query, producing something to search with, looking at a database, finding relevant things, and then putting it into a chatty response. One of my personal favorite things is you could have a primary source that you could link to in the response. You could say, here's where you found that thing.
This is actually OpenAI's. memory of Coda. I think it'll be really useful to just baseline how well this works. We have a really nice data set of several hundred questions and answers about Coda, as well as exactly which documents those questions came from. The simple eval is asking the model, would this prompt the question, and then this generates an answer.
And it's comparing that answer, the original answer, using a scoring method called factuality. Seems like it got a lot [00:01:00] of these questions mostly right. There's, there's so many layers where there are models, including the grading itself.
David: Hello and welcome to Bits and Bots. Today, Anker and I are super excited to run through a very real rag scenario. What is Rag?
Ankur: It stands for Retrieval Augmented Generation, and the definition has changed every month for the past year.
David: what was the first definition?
Ankur: When I first heard the term, it wasn't called rag, it was just called retrieval augmented Generation, and it was referring to a bunch of hardcore cool research. I think there were a bunch of papers. The most popular was a paper from DeepMind called Retro, which was about, providing a model with additional context. While it was running the attention mechanism, it could retrieve, embeddings, inside of the model itself. And the original idea was, can we reduce the number of parameters that we need [00:02:00] to train a model on if we externalize the knowledge and still have it perform as well as a model with far more parameters. The original idea was like, Hey, can we get like g p t three level performance with a model that's much smaller if it externalizes the knowledge and still learns about how to do, reasoning as well? And the retro paper was really compelling. Maybe it has actually made its way into the real world and Google or, OpenAI or someone is doing something like that.
David: it's been super interesting to see how many people have gotten excited about this new riff on rag. One of the reasons, , I got excited about it was, for a long while, people thought in the enterprise space, the way you would have, a model that really deeply knew about you was you would fine tune it on your data.
Ankur: Right.
David: I was really excited about that idea. Fine tune on my data, know all my things. But as you get into thinking more deeply about it, you realize actually everyone has slightly different data. And a tiny percentage is data that truly everyone in your organization has access to.
Maybe there's your one-on-one [00:03:00] docs, those are private to two people.
There's your team docs. Maybe everyone can't have access to that. So while there is some sort of general organizational terminology and memory, actually access controls are really important. So this idea of. Can you, have a LLM that is conversing with a user, but is able to go and search for things that they specifically have access to, is super interesting.
Ankur: Yeah, for sure. I got interested for a slightly different reason, which is a more nerdy reason I think. I think RAG as it's currently framed is way more debug able than fine tuning and even the other definitions of rag and that's really powerful and I sometimes think about how these really large models are the new fabbed chip, and I think we thought fine tuning would be more like writing a program that modifies what instructions run on the chip. But it seems increasingly like fine tuning could be more like [00:04:00] getting yourself a custom chip and you don't have a custom coda.
Maybe you do have custom coda hardware. I would hope not. Or imagine not. I think it's still very much in flux and a lot of people are exploring fine tuning and I'm optimistic that a good chunk of them will find value. And I think, not to get too far ahead of ourselves, but after this episode, I think we're gonna try explore fine tuning literally with this use case and see what happens.
David: Yeah, super excited to compare fine tuning with retrieval augmented generation. Maybe before we jump into how to implement it, do you mind if I share a quick motivating example?
Ankur: Please.
David: so for viewers who haven't seen it before, this is Coda. Coda is an amazing document, like a Google Doc on steroids.
and we have this, Coda ai feature which is sort of like chat G p t, but knows about your work. And one of the challenges we see on this is you can ask it
general knowledge questions, maybe even about Coda. you could say, how large of an attachment can I.
Using Coda, [00:05:00] and it'll say something like this, the max attachment side is a hundred megabytes. The challenge is if you go and look at our help center,
you'll notice here it says the max single attachment size is actually 10 megabytes. and so actually that's the type that should be m b, um,
And so it is just completely hallucinating this and., it's hard to know why it's at a hundred megabytes. There's no attribution, there's no hyperlink, there's no source. It's learned this at some point. It's unclear when it has learned this information. When was the last time it crawled Coda's help center or other piece of information.
Man, wouldn't it be nice to have a primary source, have it not hallucinate, have a bunch of places I could read more about it and find some images. And that's one of the things that retrieval augmented generation really helps with.
Ankur: Yeah, I was just checking the sample dataset we're using. Does not have the attachment size thing, but I'll find something. I'll find something similarly precise.
David: Cool. Well, Ankur, whenever you're ready. Do you wanna, take us through an example of how to do retrieval [00:06:00] augmented generation?
Ankur: Yes.
So we're gonna use a really simple TypeScript app. This is just a wrapper around some basic stuff.
Well, this is actually OpenAI's memory of Coda. So currently it's prompted with just, really simple, prompt. You're a help desk assistant for Coda. answer the following question and then the question. I've set it up this way for a few reasons. One, it's really valuable to be able to just play with stuff like this. Two, I'm, on a mission now to make all of our cool content and demos and this kind of stuff in TypeScript, maybe surprisingly, almost 90% of our users are using TypeScript, not Python, but it seems like most of the content out there along these lines is still in Python. so there's certain things that are harder about TypeScript, but actually a lot of things that are easier. Either way, I'm gonna put a lot of effort, through these, sessions and stuff that we do with [00:07:00] Braintrust just to like, expand the universe of TypeScript sample code out there. With that said, this really simple, app set up. It's using Versal ai. Thank you friends at Versal. It's really convenient. a really simple,
next Js web app that just lets you type a question here. before we get too far down the rag, side of things, I think it'll be really useful to just baseline how well this works. So I, I have a really simple script here. simple Eval tss. This is Braintrust Code and essentially what we do, I'll, I'll jump into the script in a second, but we get a bunch of QA pairs and for now we just test 20 of them and then, run exactly that function with OpenAI's Chat library. Use a factuality scoring method to see, uh, how well it performs. So let's just dig into what that means really quickly. if I go to build data, it's basically a script [00:08:00] that is, downloading the Coda Help Center Docs. Thanks to Kenny Wong at Coda for setting this all up for us, it made this, much more straightforward. So we download the docs, which are an H T M L. Then we use a library to convert them into markdown and break them down into sections. and then we do this really cool trick, which is useful in question answering in general. But I think specifically in rag. And what, we do is, basically go through the snippets of sections and ask uh, in this case G P T 3.5. To generate question, answer pairs for each section. Now, the reason that's so powerful is that it's a significantly easier task for the model to look at, like one section of Coda's help center docs and generate good questions and answers than it is to throw like a whole help center document or a bunch of [00:09:00] documents at a model and ask it to come up with questions and answers.
And so we're basically like cheating so that we can generate Good question, answer pairs. And David and I didn't have to spend hours like handwriting them from scratch. and I think this is just in general a really useful technique in ai, and M L problems to get good data that you can use, uh, to test fine tune, uh, and, and so on.
David: And so you could think about this as sort of data augmentation. We're going from one data set, you're changing the shape and sort of making a few variations of it. do you actually have an example of the, before and after
Ankur: Yeah, for sure. So basically, what we're doing is taking these documents and breaking them down into sections.
David: So this is the h ml version of that same help center article, um, that I was showing earlier.
Ankur: Yeah, this is markdown, but Yeah. Yeah.
David: Oh, thanks. Yep.
Ankur: then we're breaking this into [00:10:00] sections.
So each time there's like a dash, dash, dash, dash, dash, we split things out. and then what we're doing is we're basically giving the model, you know, let's see if we can actually, look at a side by side. we're taking these sections and then basically saying, Hey, G P T 3.5. Given this section, can you generate a bunch of Questions and answers just on this section. And the idea is it's really easy for a model to just look at this text and generate question answer pairs. So if we look, over here, here are some of the QA pairs generated for this section. This is a really easy task for a model. 'cause it's literally just looking at like These characters of text, like this much text and, uh, trying to generate questions and [00:11:00] answers. But what we're gonna do is use these questions, um, in many different ways. Uh, it with much harder tasks for the model. So the first thing we're gonna do is actually use these questions and not give the model any context and see if it can recover something that's similar to this answer without having like this green stuff to help it.
Does that make sense?
David: Sounds great.
Ankur: great. So yeah, this is the, basically data munging code that does that. So, I'll allude to a a, a few tricks here and we're gonna publish all this code. Uh, so, uh, please look at it and, you know, rip it to shreds and, and, and so on. but we're, we're using function calling here, uh, because, the data structure that we're asking the model to generate is actually kind of subtle. we're asking it for each section to generate eight question answer pairs. For each pair, we want it to, uh, come up with two different phrasings, uh, of the question, for one answer. And so [00:12:00] what we're able to do with function calling is actually define a schema that makes it, uh, so that even G P T 3.5 can, pretty reliably output question, answer pairs in that format. the reason we do tricks like this, like two questions with the same answer, I. Is, eventually we will be able to use that when we do fine tuning to, have different cuts of the data so we can test for things like overfitting and have a train and test split that we know cover the same bodies of content but don't have overlapping, question text and stuff like that.
And so we're gonna use a little bit of that today. It's, it's set up really well so that we can use this data in a variety of ways. But function calling works out to be like a really effective technique to generate rich data structures like this, which you can use to do more interesting things with the data. And, uh, the last thing that we do is just kind of organize this into, uh, this format. We, you know, [00:13:00] figure out is it the train and test, train or test split? We give it a bunch of IDs so you know, which document it's part of, which section it's part of, and so on. and so, you know, I, I ran that ahead of time. It takes just a few minutes to run and then it generates these, all these QA pairs and now we have a really nice data set of several hundred questions and answers about, Coda as well as, uh, we know like exactly which documents those questions came from.
So now that we have this set up, Let's go and actually just run this eval thing and get a baseline. So what we're gonna do is run simple eval, which is doing no rag, it's just asking, you know, we're basically seeing like, how well has OpenAI memorized information about Coda?
so the simple eval is just asking the model this, would this prompt the question, and then this generates an answer and it's comparing that answer to the original answer. Using [00:14:00] a scoring method called Factuality. factuality is one of many different scoring tools that come with brain trust in our open source library called Auto Evals. basically, it is a, it's a model based scoring method, which looks at the original question, the expected answer and the output answer, and it comes up with a score that tells you how well the, new answer compares to the original answer.
David: Very cool.
Ankur: Cool. not too bad. So, seems like it got a lot of these questions mostly right. and some of them it got, you know, like totally wrong. Uh, you can actually go in here and like see all the gory details. We'll make this public so anyone can kind of take a look at these. Um, and, uh, yeah, here's for example, one of the questions. This is what the model generated, and this is the sort of expected response. One thing that we're not doing here, by the way, is grading the [00:15:00] outputs in terms of stuff like their length. I think those are the kinds of things that, academic benchmarks often miss. But as you know and as you've optimized really well in, like Coda AI for example, it's really important to make the results not only like factually correct, but ergonomic for people to use.
David: Yeah, so just reading that one, it looks like it has basically some steps on how to create a document and share it. Uh, when the question was how do you star a document, and then the expected one was totally different, so just totally ignored, ignored that part of the question.
Ankur: Yeah. And, and it seems like the, the model sort of arrived at the same conclusion. So here's its explanation for why it gave it a 0% score. Basically it's saying it's looking step by step at what the, the new answer was and saying like, none of these things match the expert answer. if we look at one, I think this one probably had a, a higher score.
So this one, the model had a hundred percent. Uh, you wanna take a quick look at this?
David: Yeah. So how [00:16:00] are Star Docs different from PIN Docs and it, its output? Um, had a few ways they contrast from each other. Here's what STAR Docs do, here's what PIN docs. In summary, Star docs are personal to each user. Pin docs are visible, all collaborators, and then the expected answer was one sentence that seems to communicate a similar thing, so we got it factually right, but yeah, definitely a little more verbose.
Ankur: Yep. Yep. And here's the explanation. So, yeah, I, we, we won't, uh, go too far in depth here, but this, uh, I think it's really important to establish a baseline like this. even if your product isn't as popular as Coda, because, um, in a second here, we'll be able to use it to see, uh, how RAG compares, Okay, great. So now let's do some rag. the first thing I'll say, in jumping into this is rag, generally involves a few different steps, when we build the data here. in the other thing we just like, you know, loaded those, uh, question answer pairs. Here, we're actually gonna do some more interesting stuff. We're gonna split the[00:17:00] Content back into these sections. Then we're gonna embed each section, and then we're going to create the world's dumbest vector database. vectors here is literally just an array of, of, of these objects. and the vector is an array of numbers. I really, really don't want to do this.
Like I tried to use three different vector database libraries while preparing for this. but most of them are still broken and none of them actually really support this TypeScript environment. Well, I mean, there's some APIs and stuff that you can use, but I didn't want to use that here and so I just hand implemented the vector search, which sounds much harder than it actually is. big shout out to Compute cosign similarity, uh, which could use some love. I think it doesn't have that many, uh, ,uh, weekly downloads or [00:18:00] stars. Um, actually let me go and star it while we're here. thank you. Uh, but you know, this stuff isn't that hard. and, and I think, it's, it's actually quite valuable to play around with implementing stuff at a lower level like this.
Compared to the Vector database code I had here a, you know, an hour or so ago, uh, this actually ended up being like a lot less code.
David: And for folks who aren't familiar, vectors are a matrix representation and embeddings usually mean you run, input through some part of a, a model to get into sort of deep vector, space.
And the idea is that it encodes some deeper idea of what your text is. And so once you have this matrix or this vector, you can, uh, throw it in a database, find similar vectors, do other operations with it. you don't actually have to do rag over vectors. could do it over plain text search if you want.
The idea is, you're just doing a search as part of your operation. Um, but embeddings are one powerful way to [00:19:00] represent a concept as opposed to just the raw string.
Ankur: For sure. And I would just add that embeddings are surprisingly easy to use, flawed in many ways, but, uh, much easier to use than, uh, often, often advertised. okay, so, here we have, um, we're, we're just, you know, computing, the, the embeddings, putting them in, array and then we're returning the same pairs. so when we actually run the task, now we're gonna do some more interesting stuff. first we're gonna compute an embedding of the input query, we'll be able to visually follow this example a little bit more. And then what we do is a vector search. So this is what you were just talking about. we don't have to do a vector search. but we could, and it makes it, much easier to, scale to more content in, in many ways vectors are more robust. and so, uh, here we do that in just a very simple way. We compute the similarity score, looking at the, the vector [00:20:00] of the, input query and the, and each embedding itself. And, uh, the similarity score is just a really simple computation called cosign distance, which, you know, compares the distance between two vectors. If they're the same, it's one if they're, you know, orthogonal at zero. and then we sort by the similarity score, and we select the top several. So in this case, we're gonna select three, and just to help audit the work that we're doing, we're actually gonna compute another score called the relevant score. And this is basically going to also use a prompt and it's gonna look at the question and the document and ask the model to estimate how relevant the question is to the document. And again, we're gonna use function calling, in this case because it helps the model. generate, more sort of continuous numbers, between zero and one. so we'll actually get a score for each document and that becomes a really [00:21:00] useful way to sift through, best vector searches from the worst vector searches and, and debug if we got a low factuality score, why and whether that had to do with the vector search process itself. Then we're gonna compute some scores in about the relevance. So it'll be useful to see like, in general how well the vector search performs relative to the other scores, kind of at an aggregate level. and then log some stuff. And then once we do that, it's really simple. We have, uh, the context, which is the sections returned from the, uh, uh, vector search and we run a Very slightly more complex, prompt over here where you know, the assistant thing is the same. And here we say, you know, given the following context, answer the following question. So, pretty much the same as above, except now with the context of whatever was returned by the vector search
David: Very cool. So the prompt is basically, here's a dump of some stuff you found. Answer the original question.
Ankur: exactly. [00:22:00] And let's see how it did. Awesome.
David: 63%, so we're up 3%.
Ankur: Yeah. so it did a little bit better. Uh, I'm sure there's a lot more we could improve here. Um, but let's actually look at one of these things to understand in depth what's going on. And by the way, just a quick plug, but, Braintrust makes it really easy for you to actually look at like examples that got better and examples that got worse.
So actually let's look at one that got better. Um, this one went from like 0% to 60%. So let's, let's try to understand what happened here.
David: this is the one where it just, uh, had seven random steps there for creating a document.
Ankur: Yeah. Um, so let's look at, uh, what the new answer is. It's the green thing to Star document in Coda. You can follow these steps over, over the document. Click the star icon
David: Oh wow. I hadn't realized that originally, but actually, notice in red, on the first line it says to start [00:23:00] a document in Coda. It's like with the original time it messed up. It thought you asked, how do you start? Document instead star.
Ankur: That's really interesting. Yeah. Um, yeah, I wonder if it just, uh, misinterpreted the question. Um, I wish there were a way to know. Um, so, uh, let, yeah, let's understand. Yeah, so basically it's dinging this answer very slightly because the returned answer is a superset of the expected answer. Um, this is one of those nuances in how you actually compute scores, that it can also be valuable to use these experiments and stuff as a way to figure out like, is your scoring method bogus or not?
In this case, I would say like, it should probably get a hundred percent score instead of 60%, but, you know, it's fine. Um, and, and, yeah. Let's actually, let's, let's look at the documents that it retrieved. I'm just gonna turn the diff mode off, uh, for a second. Um, but let's look at the documents that it [00:24:00] retrieved. To, um, see if they were actually relevant to answering the question. the question is what is the process to start a document? And these are the documents. These are the sections of markdown that got returned by the vector search. It seems pretty relevant, right?
David: Yeah, that looks good.
Ankur: Cool. By the way, one thing that I always do when I'm building rag stuff, uh, set, this is gonna be really rag jargony for a second, so please forgive me if this doesn't make sense. But we set the value of top K to be two. What that means is when we run the completion itself, We give two documents in the context. However, if you notice here, there's actually three documents that, we logged. Uh, the reason, um, I did that is so that we could look at the top two documents and the document that came after it to see if the third [00:25:00] document was better than the second document. And what that tells us, um, if the relevance is, you know, um, to be trusted. What that tells us is that the, even though the similarity score with the embedding for the second document was higher than the similarity score for the third document, although obviously not by much it looks like, you know, 0.3%, the third document worked out to be more relevant. Um, and sometimes people actually solve for these problems by like re-ranking the documents during the rag process. So that they don't fully rely on embeddings because embeddings are, are wonderful, but really hacky. And so yeah, what someone might do is actually compute the relevant score like I just did over the first five documents, and then take the top two documents, uh, with, uh, with the highest relevant scores and then use those in, uh, the rag thing itself.
And so, You know, again, we're not doing that here. [00:26:00] It's useful to track metrics and track information that help you potentially arrive at that as an idea, but I'm just kind of alluding to it because, um, it, it sort of, uh, hints at the insane level of complexity that can go into an application like this.
And I think the rich set of engineering trade-offs that are available when you're building one of these applications,
David: Very cool. Should we look at one of the cases where it got worse?
Ankur: Let's do it. Um, let's do one that's like kind of extreme. So I, I actually don't trust these ones that went from a hundred to 60 because I think 60, uh, let's actually just quickly look at this. Um, Yeah. So B the submitted answer is a superset of the expert answer, which is 0.6 I I, I feel like this may not be the fairest [00:27:00] criteria for this particular problem. Um, and so like, let's, let's not look at one of the ones that went from, uh, a hundred to 60. What is 40? 40 means? The submitted answer is a subset.
So yeah, let's look at this one. Um, in this one, Uh, the old answer was a superset. It got 60%, and the new answer is a subset. So the question is, where do starred docs appear in your doc list? My shortcuts? Yeah. This is crazy. So
David: Oh,
interesting. So the new, the new answer basically doesn't say, uh, so the new answer is actually correct,
but it doesn't at the top, which
the.
Ankur: so models can be wrong too when they're evaluating things. I think this is a, a good example of that. If I were encountering this in the real world, um, the next thing I would do, especially over this limited set, is actually rerun this evaluation with G P T four as the, [00:28:00] as the grading model and see if it, uh, does a better job at evaluating some of these things. I, but I, I think this is, this is pretty, uh, emblematic of, of what happens when you're, when you're working on these apps, like there's, there's so many layers where there are models, including the grading itself. That you have to actually audit stuff like this to, to get a, a pretty reasonable sense of whether, um, uh, uh, whether what you're seeing is, is real or not.
David: One thing I'd also add for, for product managers and designers, I know sometimes looking at this stuff can feel a little overwhelming. Where it's like you're not a data scientist. Um, but you know, if you go back up to the top of this page, like every change in quality and model will be two steps forward, one step backward, three steps sideways.
Like it's almost never that you get a unilateral win on quality when you make changes. And so I think one of the really interesting product questions, . Is what actually matters, [00:29:00] like which set of prompts and scenarios need to be good and which ones are okay to be a little bit worse. Um, and those are like pretty significant choices.
And so I always find it super helpful to go and look at examples like, like this and to basically make some calls like, you know, we're okay. With it being a little less comprehensive, the thing we want is for it to be always right and not hallucinate, or to be like, no, we'd actually rather, it's more elaborate, even if it fails some of the time.
And those are, you know, hard, hard trade offs, um, that you make in a tool like this.
Ankur: For sure. I think, um, the engineering version of that, I would say is like as, as engineers, we should always be paranoid about certain things, and I would always be paranoid about, um, aggregate metrics. And in fact, like I never trust aggregate metrics anymore when it comes to AI or ML stuff. I think you should only use aggregate metrics as a compass to help you figure out what are some examples that are worth looking at, uh, so that you can build data intuition or product [00:30:00] intuition about what's actually happening. But yeah, I, I totally agree. I, I really, I like the framing about like a few steps backwards, forwards, and, excuse me, sideways.
David: Awesome. Super cool. Thank you Ankur for putting this together and walking through it. Thank you again to Kenny Wong for collecting this dataset on the Coda Help Center. It's amazing I feel like the retrieval augmented generation sounds really complicated but When you get down to it seems actually really simple, which is taking a query, producing something to search with, whether it's embeddings or keywords, looking at a database, finding relevant things, and then putting it into a chatty response. is super cool and it's cool even on a small data set, in a toy setup like this, you can see cases where it's already starting to improve it and what the path to reliability looks like.
Oh, and one thing we forgot to show was, the backlink piece. One of my personal favorite things about retrieval, augment degeneration is you could have, not just a paragraph as input, but a webpage and a primary source that you could link to in the [00:31:00] response. You could say,
here's where you found that thing.
So even if you get something that is not a superset, that is a subset of the answer, know where to go to find a reliable, full answer.
Ankur: Yeah. And actually, even though that's not in the sort of ui, in the one that we built, you'll notice that we did actually log, all of that sort of raw information. .
David: Awesome. Cool. Any other closing thoughts? Remarks?
Ankur: I'll just say one more time. I encourage those of us who use TypeScript a lot to contribute more into the open realm of TypeScript. Pretty much everything here I had to build from scratch, including like, literally the vector search and stuff. And so, I think I'm looking forward to, this time next year being there being like a really, really rich ecosystem of AI tooling and TypeScript.
David: Very cool, improved with this set of examples. So we'll have links in the description to the episode to both the Braintrust eval page where you can go and look at these examples yourself as well as [00:32:00] to the TypeScript, repo where you can try this out on your own. Thanks. See you next week.