Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
The TWIML AI Podcast · 2026-06-09 · 52 min
Substance score
62 / 100
Five dimensions, 20 points each
What our scoring noted
Our reviewer’s read on each dimension, with quotes from the episode.
Insight Density
The episode contains genuine technical depth on multi-step retrieval pipelines, LLM-as-reranker, and RFT with human expert feedback as training signal. However, a significant portion is domain setup and background, diluting the per-minute yield of novel ideas.
we have a multi step process to essentially build up the relevant context for that query before we eventually send it off to like the final reasoning model
we don't need tax experts that speak every language. Um, you know, LLMs great, you know, notoriously great translators
Originality
The episode offers a domain-grounded rebuttal to 'RAG is dead' and a credible insight that reasoning models better utilize full context windows, but most of the architectural choices (dense+sparse, semantic chunking, LLM reranking) are established best practices rather than contrarian or first-principles thinking.
I think reasoning models are much more capable of reasoning over their full context, whereas non reasoning models, yeah, you got real degradation
LLMs compared to lawyers, like human, you know, tax lawyers are considerably cheaper, even the most expensive LLMs
Guest Caliber
Alex Bowcut is a legitimate technical practitioner who built and runs a production AI system at a Series A company backed by a16z, with a credible engineering background including CUDA kernel work and a prior startup acquired by Bain. Not a thought-leader, but not a marquee operator either.
we did our series A last year from Andreessen Horowitz
we saw uh, improvements with uh, during the Alpha program with OpenAI on RFT
Specificity & Evidence
The episode includes specific metrics (10-second review cycles, two orders of magnitude speedup), named customers (Lovable, Replit), a concrete jurisdiction example (Manitoba taxing SaaS from January 1 2026), and named tooling (Pinecone, OpenAI embeddings). However, accuracy improvement figures from RFT and retrieval changes are consistently described only as 'improvements' without any numbers.
Manitoba in Canada. Um, they changed, they began to tax SAS at the beginning of 2026. That was flagged very early in our system
right now that takes them around 10 seconds, 9 seconds to review each of them on average
Conversational Craft
Sam asks clarifying technical questions that usefully tease out architecture details (dense vs sparse distinction, confidence termination logic, undocumented model drift), but he rarely challenges quantitative claims or pushes on failure modes. The 'RAG is dead' framing is raised but not probed further when the guest gives a measured non-answer.
When you refer to dense and sparse, are you. It sounds like you're talking about embeddings versus full text search as opposed to like two tiers of embeddings or something like that.
You continue until you reach a certain level of confidence is that based on an LLM is judged type of scenario
Conversation analysis
Computed from the transcript - who did the talking, and the verbal tics along the way.
Share of words spoken
- Speaker C82%
- Speaker B16%
- Speaker A2%
Filler words
Episode notes
As context windows grow into the millions of tokens, many AI practitioners are questioning whether retrieval-augmented generation (RAG) is still necessary. If modern models can ingest entire libraries of documents, why bother with retrieval at all? In this episode, Alex Bowcut, Head of Engineering at Sphere, explains why the answer depends on the application. Sphere uses AI to automate global tax compliance - an environment where getting the answer right isn’t enough. Every conclusion must be backed by the correct legal citation, and every decision must withstand expert review. We explore how Sphere built TRAM (Tax Review and Assessment Model), a production AI system that combines retrieval, reasoning models, legal review workflows, reinforcement learning, and deterministic systems to help tax experts move nearly two orders of magnitude faster while maintaining accuracy.
Full transcript
52 minTranscribed and scored by The B2B Podcast Index.
Speaker A: As context windows get larger and larger, one question that keeps coming up is whether retrieval, Augmented Generation or RAG is becoming obsolete. If models can ingest millions of tokens of context and reason over enormous collections of documents, why bother with retrieval at all? The answer, it turns out, depends a lot on the application. I recently sat down with Alex Bokut, head of engineering at Sphere, which builds AI systems for sales tax automation and compliance. Exactly the kind of domain where getting the right answer isn't enough. You also need to know where it came from. I asked him this simple question.
Speaker B: What's your take on the whole rag is dead argument that some folks make?
Speaker C: I think for some use cases it's certainly true. I think for us and, or at least for this particular problem, because we are so sensitive to accuracy and we're so sensitive to the exact right citation as of today, I don't think, you know, agents are just searching over the file system, grepping over. It is at a point where we could switch over and not lose accuracy.
Speaker A: I'm, um, Sam Charrington and this is the TWIML AI podcast. For over a decade I've been exploring the ideas and innovations shaping the future of AI through conversations like this one that help you understand what's real, what's next, and what matters. Let's jump in.
Speaker C: A little bit about Sphere briefly. So, so this makes a little more sense. Sphere. Sphere is a, a revenue based compliance company. So we help companies with all of their revenue based compliance needs. The main one of those is, is sales tax, uh, in the US and internationally that's called vat, gst. Um, and the way, you know, there's other companies in this space, of course there's some big companies that have been around for quite a while.
Speaker B: Tax is not a new problem.
Speaker C: It's, it's not unfortunately for companies and for consumers, I suppose. Um, so, uh, this isn't a new problem. The incumbents, they face a particular problem which is in order to support, you know, every jurisdiction in the U.S. because every, of course, every U.S. state has different rules. In some U.S. states, even the cities have different rules. Then internationally, of, of course, every country and potentially province has their own rules as well. And so the companies need, um, the incumbents need a way to understand how are products taxed in each of these different jurisdictions. Um, and the way that traditionally they've done these sorts of things is they've hired these massive teams of essentially tax lawyers. They're tax experts, they'll call them tax content teams. But what these tax lawyers are doing essentially is looking through the legislation in, you know, Alabama, for example, and understanding how does Alabama tax SaaS and even more specifically than that, how does Alabama tax SaaS that maybe has an API connection and has servers that are hosted within the state itself. So it gets very granular there. And this is a huge, you know, this takes a huge amount of human time to do. Um, and you all, there's also like the moving target of it, which is of course like legislation updates, um, that can happen at any moment. And so you have to constantly be updating and looking through the legislation again to see if anything has changed and updating your tax engine essentially to make sure that you're applying the correct treatment in all of the jurisdictions. And so that's been a huge inhibitor to growth from the incumbents. And the reason why most of the big incumbents have stayed in the U.S. um, because, you know, they've kind of tackled this problem in the US and to extend it internationally. You know, it's just like too much of a Herculean task for them to, you know, it's too much manpower, too big of teams to handle it. And Sphere has taken a very different approach. I think the, the time that Sphere was started as a company was, was obviously advantageous. We were started during, you know, the AI era. And so, you know, this, this is a very classic like document based problem. It just from a high level, you have legislation and court rulings and bulletins from departments of revenue. These are all just documents. And these inform, you know, the answers of how products are taxed are found in these documents. And it's just a matter of finding the relevant passages, um, and understanding the relevant passages and then assigning a taxability. Right. How, how is this product taxed? And so in, in this new era that we're in, we looked at that problem and it was, you know, a problem that we thought was screaming to be solved by AI. So what we eventually built is what we call tram, which is the Tax Review and Assessment model, which is like a system of a few different things that I'm, I'm sure we'll get into. But essentially its job is to supercharge our tax experts. So what we found is that TRAM allows our internal tax experts to move almost two orders of magnitude faster through this process with less errors than the traditional just fully human focused approach.
Speaker B: And did you work in tax prior to joining Sphere?
Speaker C: I did not. Um, so I've learned a lot about tax and I think I was always like, you know, I would read, you know, U.S. supreme Court rulings, kind of just for fun, out of interest of, of kind of what's going on and how does the legal world work. But no, I didn't, I didn't come from like a traditional you know, PwC tax background or anything like that.
Speaker B: Uh, what was your background?
Speaker C: Yeah, just a, so a pure engineering background mostly in startups. So I, I started my career before moving m into startups. I started in the semiconductor industry and then working on uh, GPUs, writing cuda kernels, um, and then eventually left to start a startup with a friend um, which was like a web data collection startup. Um and we worked with um, like investment banks and private equity firms and things like this to help them collect really massive amounts of web data at scale to power their internal like analyses and signals. Um, and eventually that, that company got acquired by uh, Bain and company. Um, um and so I, I was at Bain for a bit and then started another startup and uh, eventually Nick reached out, uh, who's the founder of Sphere, Nick Rudder. Nick uh, reached out about Sphere, uh, and at the time it was just Nick in a little one person office here in San Francisco. And um, you know looking back maybe a little inadvisable but, but he convinced me it was a good idea to leave the other place I was at and come and join him its Sphere and kind of chase this dream um, that was he had like this nascent idea of what would become Tram. And uh, yeah, things have gone quite well. We, we've, we did our series A last year from Andreessen Horowitz, um, and continue to grow at kind of a remarkable level.
Speaker B: Talk a little bit about the, the data landscape that you have to deal with. I'm imagining significant amount of complexity due to the global nature of it. And you know whenever I've talked to folks that are doing uh, collection of legal data it like always surprises me how much of that stuff is in like non friendly formats like you know, photo like image based PDFs or you know I talked to folks, it's been a little while but you know they had to go send people to go scan stuff like how crazy is it still?
Speaker C: Yeah, it's not great, it's not great. Um, and a lot of what we do is working with government systems obviously which are, you know, they can be archaic. There's some that are better than others but yeah, there's a lot that are, that are quite old. So some, you know, sometimes things are well structured. There are you know, HTML pages that we can go and collect the legislation. And that's great. Sometimes they're PDFs, um, and those, but they're well structured PDFs and so we can, we can parse those easily. And so it's always great when the sources are like that and then yeah, there's this long tail of, you know, PDFs, but they're just images. Right. So you need to OCR them and use other techniques like that or even spreadsheets or text documents or word documents. Um, these things are much more common than I think I had hoped when I started working on the issue. And so yeah, the beginning of our pipeline, like this data collection process accounts for each of these different file formats. Um, um, but yeah, not, not the funnest problem to solve and, and you know, trying to especially you know, like spreadsheets and things like this, quite difficult to try and like pull good information and like retain context when things are stored in a spreadsheet.
Speaker B: Let's take a step back and dig into how your users use Tram. You, you mentioned that like you, uh, you know, two orders of magnitude near improvement in their process. What is that process and how are they using the system?
Speaker C: Yeah, so our end users, so if you, if you look at our website, folks like Lovable or Replit, uh, who use Sphere, who are customers of Sphere, they're not direct users of Tram. Tram is an internal tool that allows Sphere that has allowed Sphere to expand globally. Even though we're, we're a small company, uh, we're just a startup, uh, and then also have higher accuracy. And so, um, the way that Tram is used is by our tax experts. So there's a web app that the tax experts use where they go in and essentially just review the work that Tram has done. So there'll be a queue of work that needs review. So for example, maybe they need to review. You know, Tram is done what we call determinations, which is just, you know, this determining of whether a product is taxable or not and some other features around there in a particular jurisdiction. So the tax expert will go in and say, oh, I see that California, uh, digital goods needs to be reviewed. And so there's a list of different types of digital goods and the model's output on uh, whether it's taxable or not. Some reasoning the model gives on why it came to that conclusion. And, and then also importantly, uh, the citations that the model used in order to inform its decision there as well. And so the tax experts are able to look through really quickly. And you know, sometimes they do have to make Adjustments. The model isn't 100% accurate, but essentially they review the model's outputs, they can leave feedback, things like that, and eventually they click submit and the submission of that then puts those into our deterministic tax engine. So we have a tax engine that for example, we integrate with Stripe or a first party integration with Stripe. So our customers, you know, when you're checking out and buying something on the checkout page, we'll be calculating tax. And that part of it is deterministic. There's no AI there. It's, the AI has been done upstream from there.
Speaker B: So when you're, these tax experts are, you know, working on a piece of work, like what's the impetus for that? Is it, you know, there's, you know, they are, you know, solving a problem for a customer and that like, you know, drives work or is the work all driven from, you know, some new, you know, ingest, ingested piece of data from a jurisdiction that the system says, oh, this might have a change in, you know, the deterministic engine.
Speaker C: So it would be twofold. So one thing, uh, that would be like the impetus for them to go in would be um, we're, we're expanding uh, products that we want to cover. So every tax engine has done this. Uh, you, you are, you can't support every product type from day one. Right. You kind of have to break down. And you know, we're going to start with, with clothing or we're going to start with SaaS Sphere specifically. We started with like uh, electronic services. So anything that's not tangible essentially, so they would go through. We, we have this backlog essentially. We're like, hey, we want to add support for all of intangible goods, which we've already done. Once we've done that now we're like, okay, we want to add support for tangible goods. So we'll move through clothing and servers and things like this. So there's like this backlog and that's kind of driven by customer demand. Right. If we want to sell, if we want to sell to a clothing provider, of course we need to support clothing. Then the other thing would be updates in the law. So when uh, you know, legislation is changed or new bulletins are posted or new case law becomes posted, then we'll scrape that data. It goes through our ingestion process, eventually ends up through the system with if there's some, some action that needs to be taken, we'll make that recommendation to our tax experts. So an example might be last year, um, Maryland or maybe a Better example Manitoba in Canada. Um, they changed, they began to tax SAS at the beginning of 2026. That was flagged very early in our system. And so uh, that was reviewed and essentially pushed into the deterministic tax engine with a start date of January 1st so that we were prepared well in advance. And that's one of the nice things about having something like tram, this automated system is a lot of the time traditionally these things are done retroactively because you missed the update and then, you know, you, they, you scramble to try and add it but it's already too late. You know, it's past January 1st or whatever. So yeah, those would be the two different ways.
Speaker B: Got it, got it. I remember getting a flurry of emails maybe, I don't know, nine months a year ago from like every SaaS vendor I use, talking about some big change in the way taxes were going to be calculated.
Speaker C: Yeah, I mean it's uh, especially on the SaaS front, you know, over the last four, five, six years, US states and international jurisdictions are just getting more, you know, they want to, they want a piece of the cake. Right? They want to share the pie. And so they're changing rules to, to tax SaaS and they're also changing lots of other rules. Um, there are certain international countries that require like real time reporting to the, to the Department of Revenue or the tax authorities. So like as you're transacting, they want a copy of it. And I think, you know, it's kind of like tax rates, like tax rates only go up. Right? They very infrequently will, will, uh, will like California reduce their sales tax rate. And I think how involved the tax authorities want to be in transactions and how much information they want that also only increases, it's not going to decrease. And so that's something sphere I think is, is kind of at the forefront as well are these other features outside of just sales tax that um, are only becoming more common. And a big part of that is SaaS kind of like you mentioned.
Speaker B: So these human experts, in some ways it's what it sounds like they're doing is data labeling. Did you think of it like that?
Speaker C: We think of it more as like a legal review. Um, so in some ways I think what we do is somewhat similar to someone like Harvey. So Harvey AI, that's like a legal AI and you'll let Harvey draft like a first version of a brief maybe. Um, but then a lawyer or someone at the firm will go in and they'll review that brief and they'll Check it for correctness and things like that, they'll do a legal review. And I think that's, that's how we view what our tax experts are doing. It's not necessarily data labeling, it's review of correctness from a legal perspective, because these are legal claims essentially that we're making. You know, we, we are claiming that, you know, SAS is not taxable in Alabama or whatever.
Speaker A: Um, um.
Speaker C: So, yeah, I think it's more akin to a legal review.
Speaker B: And so talk a little bit about the kind of the pipeline in more detail. You, you ingest this information from, you know, lots of different jurisdictions, uh, presumably normalize it in some kind of way or at least try to extract the information out of those image PDFs. Uh, what happens next?
Speaker C: Yeah, once we have kind of a, you know, the text from the document, and ideally that's well structured and like HTML and some PDFs will give that to you. So we try and preserve as much structure as possible. The next step would be, uh, for non English legislation, we'll do an English translation. So it's been another big, uh, unlock is that, you know, we don't need tax experts that speak every language. Um, you know, LLMs great, you know, notoriously great translators. And so they're happy to translate these documents. So we'll create an English translation and that's kind of the, the starting point. Um, from there, you know, we can't. Many of these documents are very long, so we can't just like take the whole document and, you know, create an embedding for it and store that in a vector database or even necessarily with like a. More with like a TF idf, um, uh, like full text search database. You might not want to do that either. So what we do is we break up into sections, smaller sections. And there's a naive way to do that, which is just like, um, every N characters you chop and then you create a new section. And that's obviously not ideal because you, you lose very relevant context. And again, these are legal documents, so they're well structured, typically as long as it's not an image. Um, and so they, you know, they come in sections and then subsections and bullets. And so what we try and do is our pipeline semantically chunk things into essentially sensible chunks that, that cut at normal places and then we still retain the hierarchy of where that chunk came from so that we can reproduce it later. Um, and then we also store like metadata, of course, and things about where this document came from. They're like the root Document, um, and then eventually we have these, these text chunks and we embed those, both dense and sparse, um, and we store them in a vector database and that's eventually what we'll then query over when we go to actually make a determination. But I think it is, you know, we don't. I could probably talk for the next 60 minutes about kind of this process of chunking. I think we spent a lot of time there and it's, it's a very important part of this process. I think if you do a naive implementation, you leave a lot of accuracy on the table essentially.
Speaker B: Yeah, I wouldn't mind having you dig into some of the work that you've done to kind of assess the lift on the semantic chunking. Um, and what you've seen there. I think as you alluded to, a lot of folks will pull a rag library off the shelf and it'll give you three or four ways to, you know, chunk number of characters and, and whatnot. And you know, folks will do one of that. But there's often ways to take advantage of the inherent structure and the information that you're trying to capture. Um, you know, how did you approach that? Was it just obvious that you know, hey, we're going to do this based on sections because it's a legal doc and they're right there, or did you like iterate on that for a while?
Speaker C: I think it was obvious that like, you know, when you look at one of these documents as a human, it's very obvious how like if you were going to break it up, how you would like to break it up. And so it was clear, I kind of, I guess what should happen. The question is how do you make that happen? In a way, you know, doing it for one document is easy. Doing it for, you know, the millions of different documents that we pulled in a way that's general generalizable is much more difficult. And so I guess the details there go into, you know, as, as we ingest these documents, we have a number of different buckets, you could call them, of different structures of legal documents that we have parsers. And a lot of these are LLM backed parsers, but like bespoke parsers that for that particular type of document. So we will either from like the metadata of the document or through an LLM call determine, hey, what, what is this kind of document? Is it a case law? Because, you know, a ruling from a judge will look different than like the legislation, the statute law, which will look different from like a bulletin, um, or Like a notice that the Department of Revenue releases. And so we have bespoke parsers for each of those. And some of those, yeah, they involve like, LLM tool calls. Um, some of them are fully just algorithmic because the structure is all there and it works fine enough. Um, um, but I think that was. That's where kind of the. The devil's in the details on it was. Yeah. As a human, you look at it, you know exactly what to do on each of these different examples. But how do you do it in a way. Yeah, where it's generalizable across all of, you know, across languages, across jurisdictions.
Speaker B: Can you dig into a little bit more detail on the. The dense versus sparse, um, aspect of what you're doing?
Speaker C: We started with just a dense representation, um, which felt correct at the time. Uh, and if you, if we think about the query that will eventually run, um, you know, if, if we're looking for relevant passages about sas, um, you know, every jurisdiction has a different, you know, ignoring even different languages, of course. But like, every jurisdiction, even in English, might have a slightly different way that they describe SAs. Right. And especially in legislation. Legislation reads very old. Like, their description of SAS will be very antiquated. It'll. It'll might even talk about, like, CDs and things of this nature. And so from the beginning, I think it was a fair assumption, like, my, My, My, um, opinion was that we should be using a dense embedding, right. That semantically embed these, uh, passages. Um, and so that. That is what we started with. I think what we found and when we brought sparse back into it was there are times, especially with certain. When it comes to citations, um, and pulling out certain terms from passages that come from the dense embeddings, where you also want to search sparse. Where you want to do, you know, a full text search of certain, uh, words, certain terms, and pull those in as well so that you can. And then. And then compare the two of them. And what we saw was a pretty good increase in accuracy on the citation side. So we have some evals that we run on the retrieval part that we, um, have a baseline of, like these citations, right. These passages are the ones that should be retrieved for these queries. Um, and as we kind of layered sparse back into that, we saw another. We saw an increase in accuracy. Um, and so that's. That's kind of what we stuck with.
Speaker B: When you refer to dense and sparse, are you. It sounds like you're talking about embeddings versus full text search as opposed to like two tiers of embeddings or something like that.
Speaker C: Yeah, that's right. So, yeah, dense. Dense is definitely embeddings. Um, like, yeah, semantic embeddings that we use OpenAI's embedding models for. And then, yeah, when I say sparse, I'm referring to, uh, in our case, we use pinecone to essentially create a sparse representation. So we've loaded a vocabulary and, um, then each passage is fed through and it keeps an index, like full text search of the different terms and their different usages across passages. So it's not quite, you know, elasticsearch or Apache, Lucene, um, but it's sparse in like a TF IDF type implementation.
Speaker B: Got it. So you've got a predefined vocabulary and as you pass these documents into Pinecone, it's just flagging which documents talk about which of these terms.
Speaker C: Yeah, which passages are talking about which terms. So then you can search. Yeah, so that when. Then you can search over them and find, you know, if this one uses this particular term that's not frequently used across the corpus, it will, It'll be a high result.
Speaker B: And where are the search terms coming from? Like, do you, you know, you. Is it. I'm kind of getting ahead of the here answer, but I'm imagining like, document comes in and the first pass is to see if it's at all relevant to the task at hand. And so you're just searching for a bunch of terms to screen the document. Is that the idea? Yeah, not.
Speaker C: Not quite, but I think, yeah, it's a great, It's a great question. Um, so the query comes from something slightly upstream, which we also use TRAM for. And that is kind of the first step for us to support a particular product is for us to create what we call a taxonomy of that product. So what that means is we create kind of like a tree structure of. For this particular product type, what are the different characteristics across the world that affect its taxability? Um, so an example might be for clothing. Um, clothing that is made for children versus made for adults, um, can have different taxabilities. And you can imagine more questions like this where, you know, maybe pants have different taxability than shirts, something like that. And we build this big tree. And what the query that eventually gets fed into TRAM is essentially a. Well, it's a couple of things, but one thing is a description of that particular type of product. So in our clothing example, maybe it's adult, uh, pants. And so we have a description of adult pants. Um, and so that is the Main query that, that we put into, um, the system to then pull out relevant passages. And, and we'll use filtering, of course. You know, if we only. We're doing determinations for Florida, we'll filter to only the passages that come from Florida's corpus of tax law. Um, and then we're just looking for relevant portions to this particular product type, which we have an LLM generated few sentence description of.
Speaker B: Got it. And so this is, this query is kind of. I guess I'm trying to place this query in the context of like a document being fed through an ingestion pipeline. Yeah, an ingestion pipeline. And this is maybe after the pipeline you've got this, you know, this retrieval system. And now you're trying to use this retrieval system, uh, to, you know, update the deterministic model, for example. Is that the right way to think about it?
Speaker C: Yeah. So we've built this like, big index of law, right? The tax law from every jurisdiction. And then a query comes in which is a description of a product with a little other information around it. Uh, and then we want to find all the relevant passages in that jurisdiction for that product. So, yeah, the index itself is just all of the legislative data. And then the query is a particular type of, you know, search we want to run to pull relevant pieces of legislation.
Speaker B: So you've built this ingestion pipeline and this retrieval system. It immediately calls to mind the, you know, the R and rag. Um, and, you know, it may be that what you're not doing ultimately is generation, but, you know, certainly the idea of like, taking a bunch of context and sticking it into an LLM and having the LLM do the thing, uh, you know, something that you think about. Um, you know, what's your take on the whole like, rag is dead, you know, retrieval is dead argument that some folks make.
Speaker C: Yeah, I think, um, Yeah, I was thinking about this this morning.
Speaker A: Um,
Speaker C: I think for some use cases it's certainly true. Um, and I think we could set up some sort of system where, you know, we just have all this legislation in a file system and then an agent can grep over it and find the relevant pieces that way. And I think, yeah, for, for some sorts of problems that works well, I think for us and, or at least for this particular problem because we are so sensitive to accuracy and we're so sensitive to the exact right citation, essentially we need like, a more finely, finely tuned scalpel, um, to like, find us the relevant portion. And we need it to be highly accurate. Um, and so like anecdotally at least when I use Claude code or something and I see it gripping through the code base. There's lots of times it misses, like I'll go off and I'll find a file that I really wish it would have found. Like this file had the answer I was looking for. And so, you know, maybe, I think maybe we're on a path. Um, you know, five years from now, our RAG systems still working the way they are today. I'm sure they won't be, um, but as of today I don't think, you know, agents are just searching over the file system. Grepping over it is at a point where we could switch over and not lose accuracy.
Speaker B: Talk a little bit about the citations that you mentioned, how you use those and how the retrieval system helps you deliver them.
Speaker C: As part of the ingestion process we carry through, ah, a hierarchy of these different um, passages of text that we end up indexing and each of them carries a citation. Um, and different passages might share the same citation. But that's very important, um, for us eventually upstream, when the tax expert goes to review, because those citations also have links which will link the tax expert out to the source document where we collected this. Um, because a lot of times, you know, they want to review that, the tax expert, that is, they want to review, you know, a bit more context than maybe the, the model gave them in its, in its, you know, breakdown of the citation because the citation, you know, the model will verbatim give some of the citation back and then a bit of reasoning. But sometimes they want to expand on it and so they'll click out and read the citation. But essentially, yeah, the way that we've handled citations is through this hierarchy and tagging of passages, which citation they came from, which again I think in theory sounds easy, but there's a process at the beginning with those parsers I mentioned earlier, um, to make sure that we're pulling the actual correct like legal citation.
Speaker B: You also experimented with using fine tuning, ah, RFT in particular for your process. Can you talk a little bit about where it fits in?
Speaker C: Yeah, so we saw a big jump, um, with OpenAI's O1 that came out in December of 24 I believe. Um, yeah, the first reasoning model pretty much right out of the gate. Like, you know, we swapped out the model names like everyone does and we tried out this new model.
Speaker B: Which task in your pipeline in particular?
Speaker C: Yeah, this final task of like uh, given a certain product type, deter and jurisdiction determine its taxability in that region, which is what the tax Expert, uh, themselves review. So yeah, we have, we have evals. Even then we had evals that would run. So we plugged it in, it did quite well. We ran some through and we're impressed. So we were already like, we're on board with reasoning models. It was clear that like our use case was well suited to that extra thinking or those extra tokens that are spent considering, um, the prompt and what the answer might be. And so we're excited when uh, OpenAI reached out to us to be a part of their alpha program for reinforcement fine tuning, which is essentially is, yeah, fine tuning on their reasoning models and what we use, um, you know, with any fine tuning you need to provide examples essentially in like standard sft, um, and in rft you need to provide that as well. And then you need to provide a grader. And um, what we had that was very useful was feedback from um, our human tax experts every time the model tram had gotten something wrong on a determination. So as the tax experts are reviewing when the model is incorrect, they leave feedback and they give that feedback, um, similar to how they would give feedback to like a colleague who had maybe, you know, a more junior colleague that had made this determined text blurb about, you know, what they thought and explaining, you know, an explanation in a way where you want that person to get better and you want that person to have this, you know, extra context that maybe isn't clear from just the legislation. So some, you know, background information about how Alabama treats a certain vocabulary word, something like that. And what we found was that was very. So I guess twofold. We had already like a set of questions that we knew the model struggled with today because it had missed them. And then we had a way to give really great signal through the feedback and through the fact that of course we had the correct answer. Like they, they, the tax experts fix the issue, of course, um, and then they also leave the feedback. So we had the ground truth, we had signal, and we knew that these were hard problems that the model had missed previously. And so that was a really good recipe for rft. And we saw uh, improvements with uh, during the Alpha program with OpenAI on RFT. And that's what we use in production today is while a different model that we've worked with them to, to rft. Um, but we've seen performance or accuracy improvements. And that really is the key for us is accuracy. We track it very closely. I'm always checking in on it. We want to know how accurate is the model being. And accurate means, you know, how often is the tax expert having to make an adjustment to the model's work?
Speaker B: And I'm curious your experience with like, I guess what I call undocumented model changes. Like, you know, I think you mentioned either before we started recording or as we've been talking, uh, your use of CLAUDE code. Like, you know, we've seen Anthropic document, you know, some things they do behind the scenes, you know, tweaking various things that change the model performance. Like, do you see a lot of that, you know, with the models that you use, like needing to, you know, just uh, unexplicable, unexplained change in behavior that you need to run down?
Speaker C: I think we see that a lot during model generation changes. So like we work to not fully rewrite, but rewrite significantly. A lot of our prompts from model generation change to change. I think, you know, the things that Anthropic gets up to on CLAUDE code, as far as, you know, sending your query to a quantized model because, you know, they, they're high traffic. I guess they would never admit to something like that. But from the outside that, that looks like what they're doing. I think on the API side, because we, you know, we're using APIs with, with OpenAI, I think those sorts of changes are, are less likely and also, um, would have even bigger backlash. So we don't, I haven't seen anything, you know, intra model generation, but certainly every time the model changes, you know, things change outside. We, we can't just simply plug into the new version and get the best results immediately.
Speaker B: Is there anything in particular you've learned or specific to your product with regards to the way you approach evals, you know, beyond kind of collecting a data set where, you know, the models had errors in the past and um, you know, running new models through those or uh, that kind of thing?
Speaker C: I think, yeah, I think that's been the main thing. And I think because, yeah, and because we have these human experts, m. Maybe the one part that's not as standard is, you know, because the tax experts are reviewing these things, we have an ever growing list of evals, uh, because it's very easy for the experts. There's a, there's a toggle essentially they can click that says like, hey, this is a difficult one. You should include it in the eval set. And they give a description of why. So we have this like growing list of evals that we can pull from, which I think is important for the model, especially because we, we, we do this RFT with OpenAI. You know, I haven't done this for a while, but I think if we went back and ran the evals on like our original evaluation evals that were running you know, a year and a half ago, it would not be nearly as useful as the evals that are running today because the model has changed and improved and, and maybe degraded actually in some particular ways going back to retrieval.
Speaker B: We, you, you and I previously discussed like some interesting things you're doing around reordering and expanding and kind of using an LLM in the retrieval process to enhance your results. And I don't think we dug into that. Can you uh, elaborate on that a little bit?
Speaker C: Yeah, so that would be downstream from, you know, we've built this index of all the legislative data like we've talked about and then when a query comes in we have a, yeah, we have a multi step process to essentially build up the relevant context for that query before we eventually send it off to like the final reasoning model, model to reason through uh, the actual like taxability of um, the product. And so what that looks like is an initial search into our database of course sparse and dense to pull out relevant passages. We then use LLM as a judge or LLM as a re ranker to re rank those um, into more relevant pieces. We then expand each of the passages because we've retained the hierarchical nature of them. So we can grab you know, the previous and the following uh, chunks or passages and build out the context um, of the relevant passages and then we'll give that back to an LLM again to then reorder um, and potentially throw away certain things that now seem like they're not relevant as we've added context. And we repeat this process until we hit either a certain amount of length or a certain confidence that we have the relevant uh, context. And then that goes off to like the final step, the LLM to make the actual determination. But that was, that was a change made a little bit later in the process as well that you know, in the search for accuracy, increasing accuracy. Another wrinkle that I think added quite a bit.
Speaker B: You continue until you reach a certain level of confidence is that based on an LLM is judged type of scenario. Like an LLM's determination of confidence.
Speaker C: Yep, that's right. And that's basically by looking back at the previous like we'll give it both the previous um, passages that were fed in on the last pass before they were expanded and then the current ones as well because at some point you know, you've expanded too far and, and now the legislation is talking about automobiles or something that's no longer.
Speaker B: Got it. Got it. So you're just asking if, if there's been a scope change or something like that? Essentially, yeah.
Speaker C: Is the added context actually useful? Like is it on, on, on target for what we're looking for?
Speaker B: And then you uh, know, where do you. In this, in this search for increased accuracy, where do you see like your next jump coming from?
Speaker C: Yeah, part, um, of it is, is model providers. It's, it's great every time, you know, the release, the release cadence has been even faster from opening AI anthropic. So that's been great. We see a bump once we adjust things with, with every model that they release. I think it's further as far as like further refinement of the rft process with OpenAI. I think that's, that's kind of part. That's a big part of the way we'll get to, you know, where we aim to get and what we want is like I mentioned a human expert reviews every one of these determinations today. They go through every single one. And you know, we, right now that takes them around 10 seconds, 9 seconds to review each of them on average. So that's incredibly fast compared to the incumbents who are doing it totally manually. Um, but we, we'd like to increase that even further. And one way to do that, the best way to do that is if they could take a random sampling instead. So if we can get our accuracy to a point where we're confident that given a random sample of some number from uh, the determinations the model has done, if those are accurate, we don't need to review every single one of the determinations. So that's kind of the North Star, at least on this front that we're marching towards. And I think RFT will be a big part of that because chasing this long tail, right, chasing the nines of accuracy, um, a lot of it starts to become very, to get the correct answer. It's very sales tax focused. Right. You need to have a really deep understanding of tax law as uh, deeper than like these models have just like out of the box based on their training data. And so I think that is, you know, we'll make changes to our retrieval process of course, and then those will be somewhat helpful. But I think to get those last couple accuracy points that we need, it'll be working with, you know, the Frontier Labs to try and do something more bespoke.
Speaker B: And I asked previously about kind of this rag is dead question. Uh, but I'm wondering the degree to which uh, context length changes the way you approach the problem. It could be that these documents are so structured a section is going to be three to five pages and it doesn't really matter if you have access to uh, 2 million token context window. Um, or it could be that, you know, there are other ways you can use that context. How do you think about the impact of context window?
Speaker C: I think that was actually one of the big reasons why we saw a jump with the release of O1 back in the day was I think reasoning models are much more capable of reasoning over their full context, whereas non reasoning models, yeah, you got real degradation as even if it supported, you know, 128k tokens, uh, when you push that limit, it was not, you know, needle in the haystack wasn't great on those sorts of things. And so I think we saw big improvements there um, with reasoning models. And so it's still a balance for us. Uh, like we, like I kind of mentioned earlier, we don't need to fill up and we don't fill up the context window to its max. But a big unlock was models where we could give it more, where maybe we could be a little less precise on the retrieval portion and, and expand, expand these passages a little more aggressively. I think before when context was more limited, you know, we were being very selective on which passages we're feeding in because we, we, you know, we only had so much we could give it before the model just kind of would throw its hands up and so that, that was a big unlock. So yeah, we don't push the boundary right on the edge. But I think as reasoning models improve, as the context window gets bigger, bigger again we won't fill it up all the way. But that's a good sign that the model can handle more, more, more tokens than we're giving it today. And that means we can be less precise a bit on the retrieval portion and still get the results that we're looking for.
Speaker B: How much time do you spend thinking about trying to reduce token costs either by kind of refactoring from larger models to smaller models or you know, via other ME methods.
Speaker C: LLMs compared to lawyers, like human, you know, tax lawyers are considerably cheaper, even the most expensive LLMs. So yeah, this isn't, and this isn't also something, this isn't a process where, you know, we're pushing through billions of tokens every.
Speaker B: I guess it helps that you're building a deterministic uh, system and that is the thing that's you know, kind of the inline online system as opposed to an LLM inference call.
Speaker C: Yeah, exactly. We're not cost sensitive and that also means we're all, we're not latency sensitive either. So those, it's very nice. Those are two things that we don't even really have to consider very closely.
Speaker B: Quite luxuries. Right? Yeah. Nice, nice. Maybe to kind of wrap things up. Where do you see things going for uh, you know, both TRAM and kind of AI and feels like tax more broadly.
Speaker C: Yeah, I think we have a clear path on tram. Kind of what I mentioned earlier of deep, you know, increasing accuracy and decreasing human time spent reviewing. So we'll continue to chase those metrics um, and improve them and that will allow us to be even more accurate and even more nimble and cover more jurisdictions in the world. So that's certainly somewhere we're going to keep pushing. Then there's other parts of this that for example one, one thing we talked about was these taxonomies that we build that you know, identify the different characteristics of a product that impact their taxability across the world. Currently we do that with our human experts because this is something, it doesn't need to be repeated for every jurisdiction. It, this is like a one time thing that we, you know, we create this taxonomy just for SaaS or just for clothing. So today we're doing that the traditional way with human experts. But if you think about, you know, what they're doing and what the question is there we have all the data sitting in our index to build these taxonomies. Right. For every jurisdiction we know inherently in that data somewhere holds the answer to how to, you know, what are the different characteristics that affect taxability. And so I think that's another obvious spot that would also allow us to move even more quickly, add more product types. Uh, m. There's, you know, we'd like to increase the accuracy and the frequency of these ongoing scrapes that we're doing. Um, as you can probably imagine there's a huge amount of data sources that we're looking at right now and you know, not all of them can be scraped immediately or every hour or whatever. So we'd like to increase that and increase accuracy of the outcomes of what those changes do in our system. Um, and then there's some tangential things around like um, you know, we'd like to make it as easy as possible for customers to move from different ah, tax solution to sphere and one way to, you know, a big reason people don't switch, uh, tax solutions or why they become entrenched is because they've spent so much effort in mapping their products to tax codes for a particular system. And what we are preliminarily doing with TRAM is an automatic mapping from, you know, some competitors, tax codes or really any classification system. So if you've classified your products using HS codes, for example, which is what is used for tariffs, we could take in any product classification and map that to a Sphere tax code. And then the switching cost, um, to switch to Sphere is just seriously lowered and you can actually get people to consider making the switch. So I think there's, you know, we haven't talked about e invoicing and there's lots of other things, but at the end of the day it all stems from having this index of legislation across the world set up so that we can query over it.
Speaker B: Out of curiosity, what are the tools that you use and think of as like your biggest AI unlock from a personal workflow perspective?
Speaker C: Yeah, so I, I've been a subscriber to ChatGPT for a long time. You know, option space on my Mac, I use it all the time. Claude code. I have that pulled up, uh, you know, all day, every day. That's been a massive unlock for us, uh, well, for me personally and I think across the engineering team here at Sphere and we're also beginning work on um, something akin to like Stripe Minions. So Stripe put out a paper with, with something they called Minions, which are like AI agents that are running around and looking at the code base and opening up PRs and, and working together to kind of improve, taking care of things like dependabot, um, PRs that get open. And so that's something we're looking at as well to build out how can we do that in a Sphere specific way? Um, and kind of related to that also. What other tools can we add to our internal AI agents? What skills can we add to make them even more valuable for us based on our particular use case? You know, where to pull data, where to look for, you know, these, these AI agents should be plugged into tram's internal index and be able to give answers from the legislation. So I think there's, you know, that stuff is still Mason for us. But, um, yeah, I feel like I'm surrounded by uh, LLMs all day, every day.
Speaker B: Awesome. Awesome. Well, Alex, thanks so much for jumping on and sharing a bit about what you're up to at Sphere and how you're using AI.
Speaker C: Thank you Sam for having me.
Speaker B: Thank you, Sam.
Speaker C: M.
More from The TWIML AI Podcast
All episodes →- Why AI Agents Break the GenAI Security Model with Devvret Rishi - #77077 / 100
- Relational Foundation Models for Enterprise Data with Jure Leskovec - #76892 / 100
- How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
- How to Engineer AI Inference Systems with Philip Kiely - #766
- How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765