Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
The TWIML AI Podcast · 2026-05-21 · 1h 6m
Substance score
72 / 100
Five dimensions, 20 points each
What our scoring noted
Our reviewer’s read on each dimension, with quotes from the episode.
Insight Density
The episode delivers genuinely non-obvious ideas at a solid pace - particularly the framing that structured relational data has not undergone the same raw-data-learning transformation as vision and NLP, and the mechanics of in-context learning over graph-structured databases. Some sections repeat points or meander (the biomedical detour at the top, occasional circular re-explanations), keeping it short of elite density.
AI has not transformed the structured data space in the same way as, uh, computer vision or natural language understanding have been fundamentally transformed by AI
We don't learn on raw data. We run all these SQL queries, all this etl, all this feature engineering to then come up with a set of signals from which we, let's say, try to predict something.
Originality
The core argument - that multi-table relational structure is the missing link where real ML gains are hiding, and that in-context learning can be applied to structured databases via graph encoding - is a genuinely fresh framing not commonly circulating in standard ML discourse. The churn feature-engineering reductio ad absurdum and the agent-friendly API observation are memorable and non-recycled, though the broader foundation-model narrative follows familiar patterns.
Just attend over the transactions and let the attention figure out what predicts the churn.
single table problems are, you know, are Solved. I think the differences are kind of second order effects. What is unsolved is the multi table problem.
Guest Caliber
Leskovec is simultaneously a Stanford CS faculty member running active research (Railbench, relational graph transformer, AI virtual cell) and a co-founder with demonstrable production deployments at Doordash, Reddit, Coinbase, Expedia, Databricks, and Snowflake - a rare combination of academic depth and at-scale practitioner credibility. He is genuinely the person who built the thing being discussed.
at Doordash it's uh, um, restaurant recommendations and the notification system...We've seen, you know, uh, revenue impacting hundreds of millions of dollars.
we have these models running in production at Coinbase on the entire Bitcoin blockchain network
Specificity & Evidence
The episode is anchored by real named customers (Doordash, Reddit, Coinbase, Databricks, Snowflake, Expedia), concrete benchmark details (Railbench: ~40 tasks over 10-15 databases; SAP Salt), measured performance lifts (5% relative from zero-shot foundation model, 12% with fine-tuning), and a highly specific debugging anecdote about an agent aggregating data to midnight rather than current time. A few impact claims ("hundreds of millions") are unattributed and unverifiable, which caps the score.
the foundation model improves that I think for about 5% relative uh, the accuracy. Um, and then if you further tune the model...then the performance goes to 12% uh, over the State of the art
when it created features for that given account, it aggregated the transactions till midnight, not till the current time
Conversational Craft
Charrington earns credit for explicitly calling out an outlandish claim and pressing on it, asking whether Reddit's hand-engineered features are a "feel good thing," and the "no free lunch" challenge about compute costs. He also makes a productive structural distinction between the system and the model during the in-context learning discussion. However, he doesn't follow through on several openings - the "hundreds of millions" at Doordash goes unchallenged, and some follow-ups are leading rather than genuinely probing.
I find that proposition to be almost outlandish. Like they're just numbers with some unknown relationship.
have you looked at like if their manual features really make a difference, like is that a feel good thing
Conversation analysis
Computed from the transcript - who did the talking, and the verbal tics along the way.
Share of words spoken
- Speaker A83%
- Speaker B17%
Filler words
Episode notes
In this episode, Jure Leskovec, co-founder and chief scientist at Kumo and professor of computer science at Stanford, joins us to explore two fronts of his work: AI for science and relational deep learning. We begin with AI Virtual Cell, a multiscale effort to learn data-driven representations from proteins to cells to patients using single-cell RNA-seq data, protein language models like ESM, and structure models like AlphaFold - without hand-encoding biology. Jure then dives into relational deep learning, reframing enterprise databases as graphs and training neural networks directly on raw multi-table data. He explains Kumo’s Relational Foundation Model (RFM2), which performs in-context learning over subgraphs to make accurate predictions on new databases and tasks with no training, and how this approach benchmarks against RelBench and other multi-table datasets. We also discuss real-world deployments at companies like Reddit, DoorDash, and Coinbase, explainability via attention over tables and columns, integration with agentic systems, deployment options, and practical limitations. The complete show notes for this episode can be found at .
Full transcript
1h 6mTranscribed and scored by The B2B Podcast Index.
Speaker A: The recent breakthrough uh, that we had and we just released, um, the second version is our, what we call a relational foundation model, um, and that's a pre trained foundation model, uh, that can reason over structured relational data. And it's crazy what this model can do. It can make accurate predictions on any database and any predictive task without any model training.
Speaker B: All right everyone, welcome to another episode of the TWIML AI podcast. I am your host Sam Charrington. Today I'm joined by Yuri Leskovets. Yuri is co founder and chief scientist at KUMO and a professor at Stanford University. Before we get going, be sure to hit that subscribe button wherever you're listening to today's show. Yuri, welcome to the podcast. It's great to finally connect with you.
Speaker A: Yeah, great to be here.
Speaker B: I'm looking forward to our chat. We're going to be digging into your work on relational learning, um, as well as some of the other interesting things you're up to at Stanford and around AI for science and more. Uh, but let's start there. Tell us a little bit about your research focus.
Speaker A: Uh, yeah, great. So uh, I'm professor at Stanford here in the computer science department, uh, you know, where the future happens I like to say. Um, so there's always exciting research going on. Our um, focus recently has been I would say on two areas. First is uh, AI for science, uh, and in particular in, we have a project that we call AI virtual cell where we are basically building next generation foundation models that allow us to represent human cells, patients as well as individual molecules in cells and allow us to reason um, across this complex biomedical data for discovering new cancer therapies, molecule design, uh, reasoning about all different biomedical data modalities, objects and how they interact with each other, uh, to help speed up science. So it's everything from foundation models at the lower level of understanding proteins to then models that aggregate the um, let's say the molecules in the cell to represent a single cell. And then the next level models that now say oh you know, a tissue or a patient is a collection of cells, cells are collection of molecules. Let's build models that just aggregate all this knowledge in a very faithful representation, let's say of a, of a patient, um, and that helps uh, a lot because now the representations we have are much more uh, robust, uh, driven purely from the data. No biology in some sense is inserted in the model. Everything is emergent out uh, of the data and it's amazing how much we can, we can learn from that. So that I would say is one line of one line of work we've been working on.
Speaker B: And I can't help but hit pause and ask like, do you train this all end to end or are you training an individual model or representation at a time and then aggregating it after you've got these, these models defined?
Speaker A: Great question. So the way we are doing it right now is actually the first scientific question was, is this even possible? Right? Could you say a cell is a representation of molecules that are inside the cell? So now let's say molecules inside the cell are uh, the proteins. I can use the protein language ball to now represent every protein in the cell. And, and now the cell needs to aggregate information from all these proteins to say I am a cell, this is my state. Right? Now that you have a representation of the cell, you can build let's say a patient level model that says a patient is a collection of cells in given states, uh, that are uh, composed from the proteins that are in there and can this kind, are we able to kind of collect this information over these orders of magnitude different scales and to get a strong data driven representation of the, let's say underlying, uh, patient in this example. And the interesting thing is that this is purely doable and it's uh, trained purely in an unsupervised, self supervised way. Right? So you don't need to insert any human bias, any human knowledge of biology. The biology emerges from the data itself, right? Like cell types, cell states, relationships between them, um, that kind of human biology, how we describe it actually emerges directly from the data, right? So the model learns how to best describe the underlying processes and phenomena without us pushing it on it from the top. Um, that's kind of the exciting, uh, interesting kind of emergent capability there.
Speaker B: And tell me if this question makes sense. I think it's related to um, the way you're describing the training process. But is the data set that you're training on mechanistic in nature or behavioral in nature in the sense of like are you observing some behaviors of cells and then training on that data and they're, you know, some kind of faithful representation of mechanisms are emergent or is the, does the data have mechanistic properties to it?
Speaker A: The data we are using in this case is called single cell RNA SEQ data. Uh, this is data that large, uh, international consortia are collecting. But basically what it says is that you can take some sample from some, let's say some tissue, uh, and then for every cell in that sample, you measure the number of different protein molecules inside that cell. So Every cell is now represented by a, uh, 20,000 dimensional vector that tells me the abundance of that specific protein in that specific cell. Right. And every cell has different, uh, let's say ratios of these, uh, proteins depending on its, uh, on its type, depending on its state, uh, and things like that. So that's the, that's the raw input data. And then of course, because we know what the protein is, we can actually bring the protein information through, through um, ESM or through AlphaFold. Uh, and now it gets, and then it gets very interesting.
Speaker B: So being protein based, that brings in both mechanism and behavior.
Speaker A: Exactly, exactly, exactly. And then of course you can connect this all the way to the phenotype. Because what we are doing now, you know, is um, we can take a single drop of blood from a patient and then rather than running kind of a classical blood screen, we can do this single cell RNA SEQ analysis. So now we basically can profile every single cell inside the drop of blood. And why blood is interesting is because it circulates through the entire body. It kind of captures the state, uh, the immune state of the, of the entire body. So we are able to uh, detect diseases, uh, understand patient trajectories and things like that just from this digital twin of a single drop of blood.
Speaker B: Super interesting. Also very different from the other thing that you focus on, which is relational data.
Speaker A: Yeah. Uh, let me tell you a story. Why this is not so different. Okay. Uh, okay, so what I'm really, you know what I'm excited kind of fundamentally or how do I approach things is, you know, to always kind of take them apart and understand how different parts interact and how different parts work together. Um, and where I started, uh, doing this was actually in a certain domain, which is, um, uh, computational social science. I was very excited about how do people interact with each other. And when I started my research career, kind of social media just started as a phenomena. And my view at that point was like, I can use social media as a telescope into human behaviors. So I now can study human behavior through this digital traces that people produce through using cell phones, through using social media and so on. And it's all about networks, graphs of people interacting with each other. And what that, for example, allowed us to do is not only understand phenomena on social media, but also, for example, model the spread of COVID pandemic super accurately. And we were able to computationally analyze and predict how the virus will spread as we reopen the economy. If we increase the occupancy levels at different, you know, restaurants, gyms, uh, churches, Whatever the locations to say, this is how the virus would spread, this is what you can do. And that underlying was a network. So now what is biology? Biology is a network, right? It's all about molecules coming together to do something in a cell and then, uh, what's a tissue? Right? It's again, cells coming together, talking to each other, organizing in a given way. So that my skin has a given structure, it has given set of layers and so on. So that's essentially a network, a graph of interactions as well. And then you mentioned relational data, right? So data that sits in tables in a database that every enterprise in the world has. And it's kind of the most valuable data that's also a graph of, or capturing a graph of interactions of different entities inside that organization.
Speaker B: When I look into some of the work you're doing around the, uh, relational deep learning, it calls to mind other conversations. I've had that focus on deep learning for tabular data, uh, but that tends to be focused on that single table, uh, as opposed to these relationships that arise in, uh, enterprise data where you've got, uh, different tables that are linked by keys and whatnot. Talk a little bit about how these two areas of research and practice relate to one another.
Speaker A: Let me explain. So maybe first, if we think about, uh, machine learning, right? It hasn't really changed over the last, I would say, 30 years, maybe. Uh, no, not really. Right. Like, you know, we have, maybe we have this, you know, we used to have, I don't know, we had decision trees, then we had support vector machines, then people like logistic regression. Then people are like, oh, we'll build these deep neural networks. Then we said, oh, we have gradient, uh, boosted trees. They are better, and things like that, right? But fundamentally it has always been you have your data, you feature engineer this single table of your features, you add a label, and now you train some supervised model that, uh, from the features predicts that label. Right? And we've been doing that over and over again. And maybe this predictive model, you know, it's a deep model we would call, but, um, it's a neural network. But, but what I would argue is that AI has not transformed the structured data space in the same way as, uh, computer vision or natural language understanding have been fundamentally transformed by AI. Okay, and let me quantify what do I mean by that, right? Like, what was the big breakthrough both in computer vision as well as in natural language understanding? It was about, let's build neural networks that learn directly on the raw data, right. In the old days, you would do in computer vision, you would do all kind of feature engineering, sift features, Gabor filters. And it'd be like, I'll describe this image as well as I can. So I can then predict, you know, is there a, is there a car on the image or not? Right. Um, in, in NLP was similar, right? Like we, you know, IBM won Jeopardy. With uh, their system but it was all super hand engineered manual and so on. But you know, kind of it worked, right? But it took 300 people to build it and was very, you know, was great, but was very kind of uh, brittle. Right. So again the transformers, they just learn over tokens. No grammar, no syntax, no, it's just learn over tokens. Right. Again, a neural network directly on the raw data. The same thing is actually not happening on um, structured tabular data right there. We don't learn on raw data. We run all these SQL queries, all this etl, all this feature engineering to then come up with a set of signals from which we, let's say, try to predict something. And when we came up with um, this idea of relational deep learning, our goal was to fundamentally disrupt this and say, hey, why can't I just learn directly over raw relational data? Um, and why, why do we always have to learn over a single data in a single table? And the point is that as I take this multitabular data and just to be very precise, right, What's a good example of this? It could be like I have a set of customers, I have a set of products. So these are two tables. Each customer has an id, each product has an id. And maybe I have a third table that's a set of transactions that says customer ID this, bot product ID that, ah, at uh, this time for this price, for example, right? And that's a three table, super simple schema. And of course organizations have schemas of 50, 60 tables and more depending on their complexity. So our question was how could, just, how could I just learn directly with the neural network over this multitabular data? Um, and the answer is, you know, kind of surprisingly simple, is to say just think of the database, think of these tables as a graph of relationships between the entities in the database. So this would mean in my, you know, I'm a graph person, so I like to think in terms of graphs, right? So graphs are composed of vertices, the nodes. This would be my users, um, would be my products, would be my transactions, and so on. So this would be now the nodes and then the connections are just saying this user ID was part of this transaction that was part of that product. And now we have a path from a user to the transaction to the product. And then, you know, another user on another transaction is another path in this very simplistic graph. Um, and now that we have a graph, we can basically apply graph deep learning, like graph neural networks, um, which is a way to generalize deep learning to graph structured data and just train over that to get, to get an accurate prediction. And what happens is two things happen. The first thing that happens is you don't have to do manual feature engineering anymore, right? So it's much faster, it requires much less effort to train these models. And the second thing that happens is your models are more accurate. And then you say why can my models be more accurate? And the answer is very similar to what happens in computer vision, right? If you are saying, I am a human, I know what a car is, so I will, I will build perfect features that detect whether there is a car on the image or not. I know cars, I drive them, I'm such a car expert, I can build the best features for detecting cars. Nobody in the right mind claims that, right? But you know, in, in machine learning, data science, prediction, people are still saying, no, I'm the domain expert, I'll engineer the features. Your features are just some arbitrary human biased summary statistic of your data that you know, you kind of dreamt up with, put, put it as a feature in the, in, in your training table, retrain the model and then you saw whether that increase the accuracy or not. Right? And a neural network that trains with gradient descent is able to do so much more nuanced, almost like feature discovery by basically attending over this graph to extract much more signal. So we see this double digit increases in model accuracy because the neural network is able to extract more signal out of the raw data. Right. And I will just, you know, full transparency, right. If you are working on a super simple problem that falls on a line, then no neural network is ever going to be better than a linear model, right? So what I'm basically trying to say, I cannot guarantee that always you will get better performance because sometimes the data is linear and if you happen to train the linear model to it, you already have good performance. There's nothing more you can do, right? But majority of the data is not, is not linear, is much more complex and that's where the uh, benefit happens. So that's kind of the key idea behind relational deep learning is that now we can have neural networks just learn directly on the raw database data. Don't need to build these manual feature pipelines and feature stores that are super painful and lead to so many different kind of bugs and inconsistencies and information leakage and time travel makes putting models in production super hard. Um, rather just bring the raw data, have a neural network and get better results that way. So that's kind of, let's say the philosophy and um, the reasons why we are doing this.
Speaker B: Can you give us some examples of the types of things you're trying to predict with these models? Are you trying to predict things that are primarily about structure or are you trying to predict, you know, individual values? How do you think about what the models are capable of?
Speaker A: Uh, that's a great question. The way I describe a framework right now it's very generic in a sense that you can bring any set of tables, any set of connections between them, any set of columns. The underlying mathematical representation kind of remains the same and of course the underlying graph changes. But the graph neural network or the graph transformer can be applied to that. So what would you want to predict? Depends uh, on the data. If you have a transaction graph, for example, where we see great uh, results is on fraud. All kinds of fraud detection, anti money laundering, account level fraud, transaction level fraud works beautifully, right? You just bring this heterogeneous multitabular data together and just learn over it what fraud is. And fraud is interesting because it's so non stationary. You know, fraudsters are trying to game the system all the time. So uh, as a machine, uh, learning engineer, you are always behind, your model is always deteriorating and you're like okay, how do I design a next feature? How do I design the next feature? You have a neural network, it will just pick the signal directly out of the raw data. So fraud is an example. Fraud you can think of let's say as a classification, uh, task, uh, then you can think a lot around regression type tasks. For example for customer behavior in terms of customer churn, next best action, um, uh, things like that. Um, and then you can also think about uh, uh, in graph terms about link prediction. So predicting links between two types of entities. What's the canonical task? There is a recommender system because it's a, predicting a link between the customer, the user, uh and the product. So we've seen uh, great uses of this in recommender systems, um, for ads, uh, product recommendations and things like that.
Speaker B: Historically when I've talked to folks about deep learning, machine learning for tabular data, the results were um, I don't know the best way to characterize this Like I always get the impression that we're not quite there yet. And would you say the same is true for what you're doing or is it an issue of like there was this missing link and that missing link is the graphical structure and now we have it and we're able to do much more. I'm trying to kind of, you know, ground what you're, you know, working on and saying with um, you know, this kind of broader results of applying, you know, these techniques that have shown, you know, to be extremely effective with text and images to tabular data.
Speaker A: That's. I think that's a great point. Right. Like when we say tabular data and tabular machine learning, uh, this is the community that works on single table problems, right. The data has already been flattened, pre formatted, summarized to fit in a single table.
Speaker B: Um, we're trying to get better results than we might get with, you know, XGBOOST or something like that, right?
Speaker A: Yeah, yeah, right. And what we see there is that these models, uh, that, that and architectures and so on, kind of deep learning. Right. Didn't really displace XGBoost. Right. XGBoost is still kind of the workhorse. Maybe on individual examples, yeah, you can do better, uh, but it's still the workhorse. And uh, the reason I'm kind of less interested in this single uh, table problem is because that's not the right problem to solve. I don't know any organization that has all their data in a single table. Right. So the hard part and where the information gets lost is when you go from this rich relational structure into the single table. And once you are in a single table, then we are kind of talking almost like second order effects. Did you use this architecture? Did you use that architecture? Did you use this tabular model or this tabular foundation model or not all the information is there in that single table and all the methods are m about equally good at extracting it. I think where the difference happens is if you actually make a step back and say, hey, single table model or is not the hard, is not the hard part. It's not where it's not in a sense, uh, general or realistic enough where you need to go, you need to go to the multitable setting because that's truly now the raw data you have. It's not some summarized featurized data, it's the raw data and there is much more signal there that got dropped when the data got flattened into, summarized into a single table. So to me, single table problems are, you know, are Solved. I think the differences are kind of second order effects. What is unsolved is the multi table problem. That's where the wins uh, are being hidden.
Speaker B: And so how do you think about benchmarking performance for these types of problems? Are there established benchmarks for multi table prediction problems?
Speaker A: Actually there is quite a lot of single table data out there because of all the history of um, machine uh, learning. And I think even when people develop new benchmarks from raw data, they just release that single table because everyone learns on the single table. Right? So uh, what we did actually at Stanford, we were like, okay, so where is a multi table benchmark? And there is no multi table benchmark. And even if you look at Kaggle, out of thousands of competitions on Kaggle, you know there are four that are multi tailed. All the others are features have already been engineered for you. There is a single table and you start, you know, bagging and boosting and creating tricks until you win. Right. Um, so we created uh, a benchmark at Stanford, um, by collecting and curating, ah, open multitabular data sets that we were able to find on the web. We call it Railbench. Uh, we have now two versions of Railbench. It's about 40 different predictive tasks over I think about 10, 15 different uh, databases. Um, and then what's also interesting is that SAP, uh, the big German IT company, they released uh, a benchmark, a multi tabular benchmark of uh, enterprise data called salt. Uh, so those are, I would say the two, the two big tabular, tabular or multi tabular. So relational uh, benchmarks, SAP, uh, from salt, Salt from SAP, uh, and the uh, Railban, um, line of work that we've been doing and promoting uh, here through Stanford.
Speaker B: It also makes me wonder if there's a way to reuse existing benchmarks by like denormalizing, you know, wide single tables or something like that. Is that something you've looked into?
Speaker A: Uh, that's a great point. Like you can try to denormalize, but if you think about it, you can only denormalize one to one relations. As soon as you have many to one, you have to aggregate. And that's the key. Once you aggregate, you lose information.
Speaker B: You've lost information.
Speaker A: Exactly, exactly. And I know I can dwell on this point a bit, right? Imagine you, you do, you are doing a churn model, right? So uh, a customer churn model could be. I have a customer and here are historic transactions of the customer. I need to aggregate them. So first I say I'll count how Many purchases you made last month. And then I'll uh, maybe take the median price of those purchases. And then you know, some other data scientist says no, no, let's take the cheapest price of everything you want, right? And then somebody says no, no, you should take the most expensive one. And then somebody says no, it's the average. Another person says oh, but distributions are skewed. We should take the media. Then another person wakes up and says hey, it's about shopping in the morning. That's what's predictive of church. Let's add another feature, right? And then somebody says oh, but we have to account for the.
Speaker B: You're like just give me the data.
Speaker A: You know, like that's what I mean, right? And then you're like oh, holidays. People sleep longer on holidays. Let's now create a new feature that accounts for holidays. Oh, but then there is summer daylight change. Let's account for that. You see how kind of ridiculous this gets. Just attend over the transactions and let the attention figure out what predicts the churn.
Speaker B: When I introduced you I mentioned that you are co founder of Kumo in addition to the research. Talk about the relationship between the research and what you're doing at Kumo.
Speaker A: What we built at Kumo is a commercial enterprise grade platform that allows us to do large scale relational um, deep learning models. Uh, um and we are using this platform to two effects. One is to allow um, partners, customers to train tune, uh, single task models over the multitabular relation on data. Um, and I can talk about uh, that part. But the recent breakthrough uh, that we had and we just released um, in the second version uh, is our what we call a relational foundation model. Um and that's a pre trained foundation model uh, that can reason over structured relational data. Um, and it's crazy what this model can do. So what this model can do is it can make accurate predictions on any database and any predictive task without any model training.
Speaker B: And I find that proposition to be almost outlandish. Like they're just numbers with some unknown relationship. And you're going to say that you're going to train a model on just the relationship between random business numbers and it's going to work in some unknown use case how? Uh, make that make sense to me.
Speaker A: Thank you. Thank you. I think it's great. I think as I say this people who listen should be like what is this guy talking? So thank you. So I agree, right? Because it's easy to say oh it's a foundation model. You cool, right? Great. But then okay, what does it really do? So here's maybe how to think about this. So the key here is to do in context learning, right? The, the same way as a language, uh, model does in context learning where I give it a prompt, I give it the information, I give it a task, and then it gives me the answer. So what we do here is the system has several, several components. So there is the database, uh, and then there needs to be a way for me to instruct the pre trained foundation model what kind of prediction I want, right? I want to say predict me the sum of purchase prices over the next one month for this particular customer and that maybe is like how much, I'm predicting how much the customer is going to spend. Or I'm saying predict me, you know, uh, transaction.is fraud equals true for transaction ID this much. Okay, so this would be like predict me whether the transaction is fraudulent for this particular transaction id. So I have a way to specify my predictive task. And now what the system does, the system now goes into the database, it extracts a set of labeled in context examples that then get passed through a pre trained neural network to make a prediction. Okay, so now when I say a set of labeled uh, um, in context examples, this means that you can take the task.
Speaker B: For the example of fraud, I've got historical fraud that's already been labeled and I've got some new transactions coming in that don't have that label attached that I'm trying to predict, for example.
Speaker A: So let's do fraud. Fraud might be easier. Yes. So the way this would work, right, if I say predicting the probability of fraud, the system would go into your, into your database, um, and extract previous transactions for which we know whether they are fraudulent or not. For each of those transactions we would then extract kind of the subgraph of entities around it. Okay, so now what the relational foundation model gets on the input is a set of historical sub graphs of previous transactions and their fraud labels plus the new transaction that is unlabeled. We don't know it's fraud. And then this is passed through the relational foundation model architecture forward to kind of label the unlabeled graph, right? Like the unlabeled uh, transaction and the graph around it.
Speaker B: I'm not sure now that fraud is a good example because it's kind of, I can see how that could work. Like you've collected the graph around these known points and you're asking a model to infer relationships that might, you know, lead to this one individual uh, label. Uh, and so maybe I Think maybe. And this may be where you're going. Like I think a regression type of a problem would strike me as more challenging than a classification problem.
Speaker A: Yeah, I think the key, the key here is, right? What are uh, maybe the key components here is first is that you have a language where you specify the task. We can go generate almost like a minute labeled training data, set these in context examples, and then you have a pre trained model that is able to take this in context examples, these subgraphs that are, you know, have some certain columns and tables and so on, is able to encode them in a domain, uh, agnostic way. And then the neural network is able to essentially build a predictive model in its brain in a forward pass to give you accurate prediction.
Speaker B: Right, right, right. So it's not necessarily about like some universal understanding of numbers or what have you. It's about being able to identify the right relationships between numbers that it hasn't seen before, query the right, you know, examples and create the right universe and then formulate that as the, the right um, I guess like inference request or something.
Speaker A: Exactly. So there are, I would say two aspects to this. One is how can you take data and encode it in a uh, domain agnostic way. Right. Because we can take any database, any set of uh, any set of columns we need. The model needs to be able to encode that in a universal way. And now that it's been encoded, then the second step is to perform in context learning. So it means that the model in its brain needs to be able to build the model. Right. There's no training, there's no back propagation, there's no gradients, it's just a single forward pass in which the neural network kind of in itself builds, builds. I know the model in a sense that gives us the accurate prediction. Right. So no training is necessary, no hyper parameter optimization is necessary, no feature engineering is necessary. Um, all you need is a raw database and a way to specify, to specify the task.
Speaker B: Does the model require some type of memory structure, blackboard or something in order to um, do scratch work to come up with a representation? Or is this all like thought traces or something like that?
Speaker A: No, no, no, this is not, this is not an agent. This is a single forward pass of a uh, transformer like neural network. Right. So this is purely inside the neural network. There is no agent, there is no memory, there is no scratch pad, there is no, let me do this, let me do that. Right? The answer is truly a single forward pass of a neural network. There is no loop, nothing like that right. So you get the answer in I know, 0.2 seconds, half a second, uh, whatever the time be. Right. It's really a single forward pass of a pre trained frozen neural network. There is no language model here. Right. This is kind of technology that's parallel or complementary to language models. Right. You cannot textify a database and then go to ChatGPT and say hey, what do you think? How likely is this transaction to be fraudulent? You get horrible results. Right. So this is uh, yeah, frozen pre trained architecture that allows you, that allows you to do that.
Speaker B: I feel like I've gone the full cycle from. That's an outlandish claim to. Oh yeah, I can see how that will work. Uh, I don't know, it's still kind of crazy that it works.
Speaker A: Yeah, no, but it's interesting, right? And when we test this on data sets that are locked away and hidden and the model has never been trained on and on task tasks that we haven't even thought about, we see a gain over best supervised uh, models out there. Right. If you would go and say, I'll hire a data scientist, they'll spend several weeks building the model, tuning the model, the latest neural networks, whatever. It's still a couple of percentage points worse. Um, and then if you fine tune, let's say the foundation model, uh, on more data for the specific task, then you get to this superhuman accuracy performance that present manual or semi manual or agentic solutions uh, are just not able to um, attain.
Speaker B: That is the RFM2 Kumo RFM2, the relational foundation model you also recently published at iClear relational graph transformer. Is the one based on the other or are they independent, uh, lines of research?
Speaker A: What I would say is um, at Stanford we are pushing forward in the open, um, new architectural improvements, understandings and as much as we can as academics, we release everything open source, we talk about everything. And then of course what happens um, what happens um, inside the company is that some of these innovations that we put out also kind of uh, diffuse inside. I would say that internally the architecture we are using is a bit different. It's composed of two different parts. The first part is basically it's the encoding or the attention mechanism over this set of, set uh, of tables. Um and then the second part is this in context learning, uh, type machine. Um, there are two papers that are relevant here. One is the relational graph transformer that we mainly use for supervised uh, fine tuning type tasks. But then another paper we also published at iclear, it's called the relational transformer and that one actually allows for in context learning so that one does attention all the way at the individual cell levels, uh, of a database. And essentially you have three types of uh, attention. You have attention over a given column. So if you are interested in a cell by attending over other cells in that same column, it kind of gives you a sense of a distribution. Right. Then we attend over the cells in a row, uh, and that kind of then gives you a sense of what's the information in that row. And then we also have a graph based attention mechanism that allows you to say, oh, this is a user and these are uh, older transactions. Right. And then each transaction is a row and each row has columns this way. So this means that we can be attending over millions, tens of millions of uh, cells. And the beautiful thing is that our attention mechanism because of the graph has much more structure. So the attention mechanism is never quadratic. And this means we can, we can compute much more effectively. And to do good reasoning you really need humongous context sizes. Right. Even the largest LLMs today, I know go to a million tokens. For us, million tokens is small.
Speaker B: So I was going to ask, are there, are there data requirements or shapes or uh, use cases that this, you know, works well for or conversely doesn't work well for? It sounds like part of that is size. Like you need a lot of data in order for this to work. Is that fair?
Speaker A: I would actually uh, uh, maybe push back on that a bit actually because the model is pre trained, it can do amazing things where you have very little data because it's, you know, like training models from scratch. Yeah. Requires a lot of data. But once the model is pre trained it kind of knows what functions kind of appear in nature. So it means that you can give it few examples and it's going to give you very accurate predictions, more accurate predictions than some, you know, supervised model that you have to, that you have to train.
Speaker B: I think I picked that idea up based on you saying that the context uh, that you work with is typically large. Um, is that saying that when you have a lot of data available you can use it, uh, but you don't necessarily need it.
Speaker A: Exactly. And then once, if you have a lot of data, you can either increase the context size and by increasing the context size you get more accurate predictions. Or if you are saying, oh, I'm doing fraud, you can just fine tune your model for fraud in a sense that you don't even have to do in context learning because you know, your data, uh, you know, the task you just tune the model for that single task and then the model can be smaller, much more efficient to run and also more accurate because it doesn't have to, you know, almost like re learn the task every single time because you give it the in context examples. What we see works best is some mixture of pre training and in context examples because uh, the way you choose in conflict examples can actually depend on what the target entity is. Right. So in a sense you'd say oh, if I'm, I know predicting uh, fraud for, I don't know for me then you could say oh let me put some other Stanford professors in my in context examples. Let me put some other Bay Area folks in here because you know that's kind of the, I know the peer group or uh, the most useful examples from which you can learn to make accurate predictions about, I don't know, me being a fraudster.
Speaker B: I'm thinking about the line between kind of the model and the system. The system is what is constructing the in context examples and the model is just that forward pass.
Speaker A: And you need both. Right. I think is important, right, because somebody has to generate these in context examples. You won't generate them manually. Right.
Speaker B: And is that part also learned or is that you know, kind uh, of a formulaic graph traversal or something else somewhere in between?
Speaker A: You can do it as a form, kind of just as a graph traversal, uh, and a bit of kind of time travel, right. To generate the forward looking labels. Um, but of course how you do that and what in context examples you generate makes all the difference. So there is a lot, there is, there is a lot that goes into that uh, to get top performance.
Speaker B: And so speaking of performance, uh, you talked a little bit about some of the challenges with uh, collecting benchmarks, but how do you find performance relative to those benchmarks and uh, also more importantly in the real world.
Speaker A: Yeah. So I can say, right, like um, we have a white paper ON uh, Kumo RFM, uh2 uh that people can read with a bunch of different benchmarks. Um, what we see is that the foundation model, um, by itself improves state of the art, uh, over all supervised models ever published on this benchmark. Right. So the baseline is very high. It's like just build the best model you can and see how high you can get. Um, uh, uh, the foundation model improves that I think for about 5% relative uh, the accuracy. Um, and then if you further tune the model, meaning if you would fine tune it, do some gradient based updates, then the performance goes to 12% uh, over the State of the art. And those are quite sizable gains, especially if you think about putting this in production in uh, recommender systems, uh, or fraud detection where you know, every single digit performance in increasing accuracy can mean millions, tens of millions, uh, in uh, business impact maybe. The second thing I would say is where we see these methods also shine is with noisy and incomplete data, cold start problems because of the relationships, uh, because of the relational structure, the model is able to much better kind of hone in and be much more robust to the data missingness, data corruption and things like that. So we've also done quite a lot of analysis around understanding and uh, like how this performs in, on real world data, uh, uh, sparse data, uh, small amounts of data noise, incompleteness, um, irrelevant columns and things like that.
Speaker B: And when you mentioned cold start, like that suggests, hey, I want to start identifying fraudulent transactions, but I have no labels, I just have a bunch of data. Can you tell me where I should start looking? Like, does it work for that kind of problem?
Speaker A: Uh, yeah. Maybe I should quantify what cold start means. Usually cold start would mean, um, when a new user shows, a new product shows up. Right. So you still need to have some historical labels. I'm not. You still need some historical labels. But usually prediction is easy once you have a lot of data about a given user or a lot of data about a given product. But when the product is fresh or when the user is fresh, uh, you are data poor. That's what technically is called cold start problems. So I still need historical labels, but to make reliable predictions I don't need much data.
Speaker B: I think I saw somewhere that the system is deployed at like places like Doordash and others. Can you talk a little bit about the process for deploying it?
Speaker A: Yeah, uh, great, great question. So the system, the platform, we can deploy it in many different ways. We can, you know, run it as a SaaS basically, um, as a compute platform. We can deploy it in people's uh, private, uh, public clouds, like inside the, we call it virtual private cloud. So all the data stays with the customer. Um, uh, there's, I would say a bunch of different deployments depending on what organizations uh, like and prefer. And then yeah, in terms of, let's say deployments or use cases. Right. Uh, at Doordash it's uh, um, restaurant recommendations and the notification system. Each user gets what notification at what time of day and things like that. We've seen, you know, uh, revenue impacting hundreds of millions of dollars. Um, another, another great client we work with is Reddit. So the Advertising models on Reddit are built on top of or are built with Kuma. Um, and it was um, nearly a double digit uh, increase in app, uh, click through rates. So basically the revenue, the. Yeah, it's like unbelievable, like usually an entire team increases maybe 1% that accuracy year over year. Right? Because click through rate.
Speaker B: And this is your original point about uh, domain expertise and manual features. You would imagine that they've been working on this for a long time and they've kind of squeezed a lot of the juice out of that lemon, but here comes the machine.
Speaker A: Exactly, exactly. And it's actually interesting and I mean uh, we have a great collaboration and great relationship with the Reddit team. They're amazingly sophisticated and of course they build their own uh, super optimized feature engineered pipeline. And then the way we do it there actually is that we said, okay, let's take your data, represent it as a graph and let's create embeddings for users, uh, subreddits, ads and things like that. So now these embeddings actually get appended to their own features. Right. And even with that there was a huge increase in the click through rate because this signal that the neural network learned was kind of complementary to what the human feature engineering, uh, already helped. So actually the model that is in production is combined from the neural network embeddings, graph embeddings by Kuma, as well as the manual feature, uh, engineering. Uh, so that's been a great collaboration. So it's add the recommendation click through rate prediction if you want to think of it that way, you know, or
Speaker B: have you looked at like if their manual features really make a difference, like is that a feel good thing, like you left them in there because they had them, uh, or do they provide, you know, lift that's been measured?
Speaker A: Ah, ah, good question. Um, I don't think, I don't think we tried turning those off yet. But it's a, it's a very, it's a very interesting uh, it's a very, very interesting uh, question. Sometimes you still want to have those features in there, not maybe for the model accuracy, but because you have so many business rules, you know, like this advertising systems, they are not just pure optimization plays. There is so many kind of other business rules that, that need to trigger, uh, for the ad to be actually shown to the user. So you kind of need sometimes those, those signals to be able to trigger business rules anyway.
Speaker B: And I wonder, uh, in this case and with those uh, hand engineered features and more broadly with uh, RFM Um, you know, what kind of explainability story there might be. That's another reason why people like XGBoost is that those trees are fairly interpretable. And that's been a challenge with, you know, transformer based networks.
Speaker A: Yeah, that's a great point. Actually. Uh, I would say we do explainability really well and it's even more models, I would say are even more explainable than these three based models. Because in three based model all you can get is you get a ranked list of features. Right. So you can only explain predictions with the features you engineered. What we can do is we can do this.
Speaker B: Those features might be wrong understandings of the data or incomplete.
Speaker A: Yeah, exactly. So what we do is we can do, because the model is fully differentiable, we can basically run the model backwards and we see what tables, what columns, what cells the model is attending over and we get this structure based explanation. But then we use a uh, large language model to say here's where the attention is, this is what the columns are and this is what their semantics is. And then we generate a text based explanation that is like super readable.
Speaker B: Like a text based explanation of a saliency map of the data or something like that.
Speaker A: Think of it maybe that way. Right. But the LLM is kind of enriching it with all the human world background knowledge and so on. So it gets very actionable.
Speaker B: Uh, so you were talking about use cases you mentioned. We were talking about the recommendations, one
Speaker A: last one which is around fraud. So we've seen great results with fraud here. We've been partnering, uh, uh, with an amazing team at Coinbase. Um, so we have these models running in production at Coinbase on the entire Bitcoin blockchain network. Um, right. So also we can scale to, you know, the size to the size of the entire edit to the size of the entire Bitcoin or Coinbase, you know. Right. So uh, these methods really scale. But then you know, with some, with some clients, like let's say databricks or actually and Snowflake we are running, they are using us to run their sales models, you know, predicting what the customer is going to buy next, which customer is going to convert into a paying customer. And this allows them to, to optimize their sales team. Right. And if you think about sales team data, that data is smoother because the sales teams are, you know, hundreds, maybe you know, thousand, thousand thousand people or so. So you can do well on small data as well.
Speaker B: You know, there are aspects of it that sound free lunchy, like how am I Paying for my lunch. Like what's the, you know, is it um. Yeah, I, maybe I'll leave it there for you to ask. Answer.
Speaker A: Yeah, I mean at the end uh, what is being paid for is compute. Uh, what we are really doing is right. We like machine learning if you think about it is really a CPU compute based right. Majority of the compute that happens except maybe the final neural network training happens on the cpu. What we are doing is taking that workload from the CPU to the gpu. Uh so now the amount of GPU compute is larger because it's computed over the raw data, not over the summaries generated on the cpu. Um and uh, that's I would say what the cost is. Uh, in the end of course these models um, um ah are not in trillions of parameters. They are billion parameter type ones. They can be quite small so they're actually quite efficient um, and cheap to run. Because the reason we are making predictions is to make decisions, right? Our recommender system is making decisions what to show to every user. So we are like the speed of those decisions is at tens of thousands, hundreds of thousands, millions of times per second. So performance cost really, really matters.
Speaker B: Yeah, I think I was also trying to get at uh, limitations and if you had someone come to you with a problem that was in fact like multitabular relational, you know, what might be, you know, some reasons why you know, you ultimately you know, tell them that it's probably not a good fit.
Speaker A: The way I would say is we know if we use our technology will be at least on par or better uh to what is already there. Right? Um, now where we see bottlenecks, usually we see bottlenecks in actually um, getting the value of like connecting those predictions to some decision making downstream business process so that the value can be reliably measured. That's been I think uh, the biggest bottleneck right In a sense that models uh, are, are built, developed, they work great but then engineering teams need to hook them up to actually you know surface those predictions or make decisions uh, based uh, based on those uh, on those predictions. Another right. Like maybe that's one use case, Another use case is sometimes like where we shine or where the technology shines in this predictive, well defined predictive type problems uh, that can be mathematically well formulated and optimized. Um if it's more about hey we want to understand the patterns, we want to understand what is happening in the past that is much more you know, this kind of traditional data analytics type things or some pattern detection type thing that our platform M and what we discussed is uh, maybe not uh, a good use case. Um, so that's, I would say is, uh, you know, uh, some examples, you need to know what you are predicting, you need to be able to formulate that and you need to be able to measure accuracy and then we can optimize.
Speaker B: And you're not going to solve the traditional data science problem. Uh, like in organizations, if you've got a model, how do you use it? Um, that's still going to exist.
Speaker A: Exactly. There is still the problem of how, now that we have the model, how are we pushing that to, I wouldn't say to production. That's easy. The question is how do we connect it with the downstream app or the downstream system so that actually somebody is acting on this, uh, on these predictions. Right. And where we also see, I would say a lot of traction recently is in agentic workloads. Right. Because agents need to make decisions to take actions. And now you can make decisions based on these, you know, LLM based common sense. But for anything more like the best way to make decisions is, is to estimate or predict their downstream effect.
Speaker B: So now I'm envisioning this model sitting behind a tool interface that an agent can call to uh, um, you know, when it needs to make a prediction about the data.
Speaker A: Exactly, exactly. Right. And even if you think about, let's say a customer support agent, you call me in, I need to estimate what's your lifetime value, how likely are you to churn? I will respond differently. What's the best offer for me to give you? I need to actually ask a counterfactual question. If I make you this offer, how will that make you happier? Um, M M. And these are all predictive problems. I cannot just hallucinate them or ask ChatGPT. It will do something reasonable, something I would say common sense, Y. But that's far away from optimal. So these predictions, this reasoning over this structured relational data that captures the patterns, behavior of, let's say, customers inside the organization is crucial to make accurate decisions. Right. And as we are deploying these agents, we cannot be building now separate models for each of these and pre anticipate the questions. Right. The beauty of the foundation model is that you can ask any question. And one thing I want to say here, it's like just to show how big the problem is. Right. Like if you think about uh, organization like maybe like uh, you know, think of let's say SAP, right? Uh, SAP has, I think I know, seventy, a hundred thousand customers, each of their customers has structured data because it's a ah, it's an organization. Every one of them um, uh changes the schema a bit. So everyone has their own data, have their own schema and every one of those wants to do a uh, churn prediction. M wants uh to do churn prediction but every one of them has a bit different definition of what churn is. So now can you hire 70,000 data scientists that are going to build per client churn model with the client's data and the client specification what the churn churn means. You can't a foundation model, an agent can just ask predict me probability of churn under this definition under this data and you get the answer half a second later.
Speaker B: That raises a question for me. It around uh, like post training the model or fine tuning the model does it, is there any value to um, like intermediate fine tuning? I think what I'm envisioning is like you partner with SAP, SAP has you know hundreds of these modules. There's like a supply chain module and there's a churn module and some other thing like does it make sense to tune on the use case separate from the individual customer's data or does the foundation, the breadth of the foundation model already capture all of the information at that use case level of abstraction and really you're only improving it if you're looking at a specific customer's data.
Speaker A: That's a great question. Actually that's something we are uh deeply uh looking into right now. I would say there's several reasons why to post train. One reason you would want to post train even in the in context learning scenario is to, to better learn the distribution of the underlying data to better understand the distribution of the underlying data so that prediction then like the data gets better encoded and prediction later will be more accurate. So that's one reason you would wanna let's say train even in a uh task agnostic way over the underlying data to better capture distributions to better learn priors. Another reason why you would wanna fine tune is uh, for uh cost reasons. Because if I fine tune for a specific task I don't need to do icl, right? Because now I don't now my context is much smaller. I don't need to bring in the label data, I just bring in the, I want to predict on now the, the attention is smaller, it's faster, it's uh, it's cheaper uh, cheaper to run if I have large amounts of data um, you know the model, the model can learn uh, can learn a lot from. So I would Say there is, there is, you know, there is a spectrum. There is a continuum of what you can do and what. And the benefits of it is kind of different depending on where on this continuum are you seek.
Speaker B: So what's next?
Speaker A: Yeah, what's next? We are very excited about, um, uh, agents both basically surfacing these two agents as, uh, uh, tools. Uh, the second thing is right now, right, like the coding agents are, uh, out there, but what we see is that coding agents require a proper abstraction and a proper infrastructure to be able to be effective. Right? And for example, if you could say, hey, why don't I just give this modeling task to Cloth code? And Cloth Code will build the model for me. So what's the big deal? And when we do that, what we see we've run this internally, right, Is that these models write thousands of lines of code, but there are these super subtle data sciencey mistakes. So for example, we've done this, uh, together, uh, uh, uh, with Expedia, um, and, uh, when it was, uh, account level fraud and mistakes, uh, for example, the agent make was that when it created features for that given account, it aggregated the transactions till midnight, not till the current time. Right. So it said, oh, today is, I don't know, uh, uh, April 30th. So we'll use the data up to midnight of April 30th, not actually saying, hey, it's actually 10:00am, 10, 10:00am on April 30th. We can only use data up to here. Right? So that's information leakage was a little mistake in there. Another mistake it made was that, you know, we did it at the transaction level instead of the account level. And these are like these subtle mistakes that really, you need the human, the human in there. But if you give it more this, more like higher level, uh, Kumo, like API, then it's able to do the same work in about 50 lines of code, no mistakes.
Speaker B: The task in this case is to, is to do what? Ah, like I thought the task that you were describing was to code up something like what Kumo's trying to do,
Speaker A: the task is build me a account level, uh, fraud detection model over this data.
Speaker B: And so what you're proposing is like, as opposed to trying the agent, trying to code it up from scratch, you create some kind of skill or something like that that teaches it how to use Kumo to get the same information.
Speaker A: Yeah, or what I'm saying is agents, you know, they, they can go, they can autonomously maybe make two steps, but not hundred steps. So now when I ask it for a task. I can say, hey, here's Pytorch. Go build me the model that. That's, you know, takes thousand lines of code to build a model. With Pytorch, I could say, here's xgboost. Build me a model that takes about engineer features and so on. That takes about 500 lines of code. Uh, or I can say, using the Kuma API, go build me the model that only takes 50 lines of code. And now if you think of this analogy of steps, you know, 50 lines of code is maybe like two steps, 500, uh, lines, uh, of code is, uh, 20 steps. Right. And a lot. You can get quite lost, uh, navigating, I don't know, 20 steps in this. I think the observation is more general. The observation is that for agents to be effective, they need APIs that are agent or agentic. Friendly.
Speaker B: Okay, awesome. Well, Yuri, thank you so much for jumping on and catching us up to what you and, uh, Kuma are up to. Super cool stuff.
Speaker A: Yeah, thank you so much for the conversation and, uh, very insightful questions.
Speaker B: Awesome. Thanks so much, Sam.
More from The TWIML AI Podcast
All episodes →- Why AI Agents Break the GenAI Security Model with Devvret Rishi - #77077 / 100
- Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #76982 / 100
- How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
- How to Engineer AI Inference Systems with Philip Kiely - #766
- How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765