The Product Paradigm Shift: How Livekit Navigated High Stakes Scaling Challenges to Build the Future of Voice-First AI Interfaces w/ Russ d’Sa #262
The Engineering Leadership Podcast · 2026-06-23 · 46 min
Substance score
56 / 100
Five dimensions, 20 points each
Russ d'Sa, CEO of LiveKit, discusses the paradigm shift toward voice-driven AI interfaces and agent-centric UX, covering the company's multi-cloud infrastructure strategy built from day one, scaling challenges from powering OpenAI's and Character AI's voice modes, and how voice AI is evolving from legacy industries like healthcare and financial services toward new use cases like AI coding assistants.
Key takeaways
- Early voice AI adoption focuses on legacy industries like healthcare, financial services, and customer support because they're already telephone-native at scale, making voice AI a natural replacement for IVR and human call handling.
- Multi-cloud architecture across AWS, GCP, Azure, and other providers is necessary from inception to avoid outages when a single provider fails, though it's extremely difficult to implement retroactively during scaling.
- Turn detection and conversational dynamics in multi-party voice AI interactions remain unsolved technical problems that will unlock new product categories around embodied AI and robots.
- Agents should be designed to interact with existing backend services and APIs programmatically rather than forcing users through UIs, with chat interfaces serving as the natural human-to-AI communication layer.
- In a future where AI writes most code within two years, go-to-market strategy and product differentiation become more critical than engineering velocity, shifting founder focus from building to sales and awareness.
Guests
What our scoring noted
Our reviewer’s read on each dimension, with quotes from the episode.
Insight Density
The episode contains genuine technical substance around multi-cloud overlay networking, state synchronisation bottlenecks, and the asymmetric shape of voice AI workloads vs. video conferencing - all hard-won operational lessons. However, large sections are padded with Iron Man/Interstellar analogies, S-curve truisms, and vague future-casting that a well-read operator would already know.
we had a bottleneck in our cross continental kind of state synchronization. The way Live Kit works is like this completely distributed system. It's a mesh network. There's no single point of failure
the analytics system wasn't designed with that prior, uh, because it just wasn't the use case that we were building for when we first started
Originality
There are a few genuinely fresh framings - the argument that chat UIs are actually the native human-to-human text interface, and the operational logic for going multi-cloud from day one to avoid painful in-flight migration. But the bulk of the episode trades in widely circulating ideas: S-curve adoption, AI writing all code in two years, GTM as the new moat.
the chat interface actually is the native human to human text interface. And so I think what ends up happening is I think that like we've had this like thin client dream for a long time
if the frontier apps are building the brain, life gets building the nervous system to that brain
Guest Caliber
Russ d'Sa is a legitimate practitioner - co-founder of infrastructure powering ChatGPT voice mode and Character AI at genuine hyperscale, with prior company experience through YC and a six-year co-founded venture. He speaks from direct operational experience, not from a thought-leadership perch, though his profile is still emerging rather than legendary.
OpenAI, uh found the demo, read the blog post and signed uh up in secret with a personal Gmail address, um, so we wouldn't know it was them
Character AI ended up signing up and building voice mode as well...And they just ratcheted to a hundred, like straight out of the gate and took it us down global outage
Specificity & Evidence
The episode has a solid layer of concrete operational detail - 15 minutes notice before OpenAI opened to free users, the analytics rebuild taking a year, the state sync fix taking a week and a half, the demo getting 90 likes, team size of 20 at OpenAI onboarding. What it lacks is financial metrics, ARR figures, user counts, or latency benchmarks that would push it higher.
I had 15 minutes notice
it took us like a year to rewrite that thing, um, and make uh, it really good and make it scale to the moon
Conversational Craft
The host asks reasonable scene-setting questions and occasionally surfaces useful story threads (co-founder dynamic, scale planning retrospective), but there is no meaningful pushback, no challenging of vague claims, and the structure follows a predictable founder-story arc with a softening rapid-fire close. Questions are often compound and leading rather than incisive.
So can you bring us a little bit into the scaling story here behind this?
What's the scale planning that then like when you look back on you're like uh, that was a great decision and then helped out in that case
Conversation analysis
Computed from the transcript - who did the talking, and the verbal tics along the way.
Share of words spoken
- Speaker A74%
- Speaker B26%
Filler words
Episode notes
Russ d’Sa (CEO & Co-founder @ LiveKit) joins the show to deconstruct the "Product Paradigm Shift" toward voice-driven interfaces and agent-centric UX . We dive into LiveKit’s high-stakes scaling lessons: from powering OpenAI and Character AI’s voice mode, how they navigated real time bottlenecks to hit the next level of scale, the architectural necessity of a multi-cloud strategy, and the foundations of a co-founder relationships that can effectively blend engineering & business strategy. ABOUT RUSS D’SA Russ is a startup vet who founded his first company in the 2007 YC batch and was the 2nd frontend engineer hired at Twitter, Russ d'Sa now leads voice AI unicorn LiveKit. They're the backbone of ChatGPT Voice Mode, Salesforce Agentforce, Grok, and roughly 30% of US 911 calls. ABOUT LIVEKIT LiveKit is an open source framework and cloud platform for building voice, video, and physical AI agents. It provides the tools you need to build agents that interact with users in realtime over audio, video, and data streams.
Full transcript
46 minTranscribed and scored by The B2B Podcast Index.
Speaker A: We knew upfront going into it that we were going to take a lot of outages because it was novel and there were going to be like issues that we just did not have software mitigations for. But we knew that long term it was the right decision because you can't just depend on aws, but it just means you can't also depend on GCP or Azure or DigitalOcean or Linu.com I it's like you have to actually leverage all of them together. And so this was another really important decision where we kind of run an overlay network across all of them. How do you make sure that you don't take an outage if somebody's cloud provider goes down? Well, you have to build across multiple cloud providers. Uh, it's just not realistic to expect a system to be perfect forever. When you have to transition to the system that I'm describing that we built kind of from the very beginning. It's extremely difficult to do while the plane is in flight. Very difficult to do.
Speaker B: Hello and welcome to the Engineering Leadership Podcast brought to you by elc, the Engineering Leadership Community. I'm Jerry Lee, Founder of elc. And I'm um, Patrick Gallagher and we're your hosts. Our show shares the most critical perspectives, habits and examples of great software engineering leaders to help evolve leadership in the tech industry. Russ DeSa, CEO and co founder at LiveKit, joins us to deconstruct the product paradigm shift happening right now and how this will impact strategies and roadmaps. We are talking about the shift toward Voice Driven Interfaces and Agent Centric UX plus. We dive into a bunch of stories behind Live Kit's origin including some of their high stakes scaling lessons from powering OpenAI's voice mode, hitting the next level of scale and navigating real time Bottlenecks with character AI's voice mode launch to the architectural necessity of a multi cloud strategy from day one and the foundations of a co founder relationship that can effectively blend engineering and business strategy. Let me introduce you to Russ and LiveKit. Russ Desa is the co Founder and CEO of Live Kit, the voice AI platform behind ChatGPT, Character AI and 11Labs. He started his first company in the 2007 batch of YC and was the second front end engine. Twitter Live Kit began as an open source project for building live streaming and video conferencing applications using WebRTC. Over time they evolved into a developer platform for building voice, video and physical AI agents. What started with just a media server and some SDKs is now a full ecosystem of APIs and tools for multimodal computing. Enjoy our conversation with Russ Desa. Russ, I just want to say welcome. What's been kind of fun is every time you and I have talked about, we have gotten into some sort of paradigm shifting idea. And I think that actually kind of sets the stage for our conversation today. And so, I mean, we have a few things around, like product paradigm shifts that are happening. And so I guess as is tradition, since you and I have connected a couple times now, let's get into some paradigm shifts. So can you introduce us a little bit to this product paradigm shift that's going on? We've been kind of talking about this world of voice driven apps and computer interfaces and some of the major shifts going on there. So I guess bring us in, like what are some of these fundamental shifts happening in this world right now?
Speaker A: So there's so many things that are going on right now, 2026 that, you know, feel like sci fi even five years ago. One of those things that is kind of relevant to my work is the way that you interact with computers. There is this paradigm shift in how computers feel to use them. Right? The way you've interacted with computers has changed over time. So it started off with these punch cards and then like you had a keyboard and a mouse and then you have a touchscreen. And um, the computers that you wear even now, like you know, you wear an Apple Watch, it has a touchscreen on though, and that's the primary way that you interact with that thing. But with LLMs and generative AI, suddenly we have a computer in the abstract sense that can understand human thoughts through a written or audio based form. Right? Like it can, it's read the Internet, um, and it can write. The way that the trajectory of these models is that they're becoming more and more realistic in the way that they're able to understand what you say to it and then speak back to you in a convincingly human way. It has all of this knowledge from the Internet and then the way it phrases things like, uh, it can do so and write in very human like ways. And it doesn't just mean that it writes like perfect English all the time. It can also like speak in slang if you instruct it to, and all of these kinds of things, different languages. Now with an AI model like that you can create software, uh, with that AI model at its core that you can talk to. That's I think, the magic unlock, uh, that we've seen recently in this industry we call voice AI is That now you can interact with software applications using just your natural human input and output. I can speak to the computer, it can understand me, maybe it can even pick up on my, my mood, um, or how I'm feeling that day. And then it can commensurately, like it can respond in a way that feels empathetic and either like helps me think through something that uh, I'm trying to work through in my own mind, um, or it can like even do work for me or perform tasks uh, on my behalf, uh, and I don't even have to like go and sit in front of the computer for a while and perform those tasks myself. Now the computer can kind of do it, uh, almost like a person might if I was working with another person, uh, on a particular thing. And so this is just very sci fi. Um, you know, you like watch movies like Interstellar, you got tars and it's like uh, he's having a conversation with tars and TARS is doing certain things like in the ship for him and like he's also you know, doing things and they're kind of working together and that's like now possible. It's, it's really, really crazy.
Speaker B: I've seen this sort of manifest in a few different ways, but I think what's been interesting is to see sort of how this pops up in products. So I think like, specifically what's been going on is like, you know, as I'm paying attention to like how the software development lifecycle's changing and how people are applying different AI coding tools. Like what we started to notice even a handful of months ago was like how people are using like voice driven development and like just talking to a microphone and using that to power the types of development work that they're doing. But even on the side of that like there's also this like whole class of products around personal productivity or like personal assistant sort of multi agent orchestration tools that are happening that are also like, hey, like you're probably getting a lot of value from this. But also try voice, like make voice, sort of the first class way that you engage. And so it also feels really early days for this in a lot of ways in terms of like how humans are interacting with these different products. And so you know, like, I think the people listening to this are kind of like in two different camps here. So like on one side they're sort of experimenting with and building new products, new businesses. And then the other side of this they're also like the builders. So the engineering leaders essentially building out the new capabilities of existing things. When you think about, I guess like those two people and sort of where we're at with VoiceNow and like where things like, could go and like the implications of this, like what are some of the strategic implications of this, like in terms of what people are building, like what are you starting to see there? Or like what gets you excited about like these strategic implications.
Speaker A: Like all technology, there's kind of like an adoption curve. There are certain use cases that end up being natural early adopters of uh, a paradigm shift or of a technology kind of shift. And then over time there's like a saturation effect. Right. You know, it's like this S curve kind of idea, right, that we move through across every paradigm shift that occurs, both like within categories like voice AI or voice interfaces, and also broadly speaking across all of technology and tools that humans build. I think for voice AI in particular, there are these two segments that I would categorize. I think there's one which is like the early adopters of voice AI types of applications. And they tend to be focused, somewhat ironically, they tend to be focused in more of the legacy or older established industries. Um, so we're talking like financial services, health care, customer support is another really common one. Some retail use cases as well. And the reason that they're focused, uh, or that we've seen adoption in those categories is because those categories, they've been around for some time, they have established scale by virtue of being around for a long time. And because they've been around for a long time, they tend to have adopted the cutting, uh, edge technology of the time when they were building their companies. Right. And kind of moving towards scale. What does that mean for voice AI? Well, it means that they use a telephone a lot. They usually have a person sitting on the other end of a telephone answering that phone call. Right. Like, so you could use like a clinic, for example, like uh, like a dental office that is, is trying to schedule appointments, um, for, for a patient to see their dentist. And these systems have been at scale. They've been around since the 70s and 80s and like they're dominated by the telephone. And the telephone is a voice native system. Right. There typically is not a screen associated with the telephone. And having this like, AI model that can now like hear you, understand what you're saying and like call tools properly and then speak back to the, to the person that's interacting with it. Well, it's pretty obvious and I'm going to go. And the first place I can put that is I can put that on the other end of a telephone call. There are, there are a lot of companies now, large enterprises that are kind of retooling a lot of their, their stack and how they facilitate uh, these workflows. And then there's smaller companies that are building platforms to service some of the larger ah, enterprises or clin or hospitals out there. And what they're doing is they're allowing you to schedule appointments using AI, right? And that happens over a telephone call. When someone calls in, you're connected to an AI instead of ah, an IVR system which is an older technology that was voice based or a human that was uh, previously answering that phone call decades ago. And so you see this happening across many different workflows in these different industries. So like in Finserve there's like wire confirmations and basic banking functions and um, account setups and balance checks and all of these things that are now being facilitated uh, by voice AI's through uh, phone calls. So that's kind of the broad bucket of early adopters I think. On the other end is kind of these new and emergent use cases. And so you mentioned one in particular like anthropic, um, just recently launched voice uh, mode for Claude code. And the idea there is that you can kind of take more of a lean back experience, uh, as you're leveraging Claude as your copilot to help you build these kinds of applications. It doesn't have to be a voice application, it could be any kind of application. You're uh, you're, you're working with Claude in this capacity that is an early glimpse into almost like a Jarvis like experience, right? Where uh, if you in Iron Man 1 when he's like developing the suit for real, you know, not the one in the desert, the makeshift one in the desert for him to escape, but the suit like in the lab, like the real proper Iron man suit. He's got Jarvis there and he's talking to it and it's helping him with the design. And he's spinning around a 3D model that is generated by the AI. And then uh, I think there's a point where he like sees like this hot rod in the garage and he's like, well can you make it red and gold or pink it this color? And then it like goes in and like paints it. At least that's how they portray the origin of how the Iron man suit got that color in the movie. But um, you know, kind of this voice interface to Claude code is, is an early kind of Glimpse at, I think where things are going. And so why is that farther out? Right? Why did I talk about like one being short term and one being long term? Well, I think that there's still technology that needs to be developed, problems that need to be solved that haven't been solved yet. For something that feels as natural to interface with as Jarvis does, right. You need to be to like move between spaces with access to Jarvis, right? Like have that continuity with, in uh, in terms of your session with that AI model. Like one thing that you see in Tony Stark's house is like he's got the intercom hooked up to Jarvis and he can like talk to it in the suit and he can go into this garage and like in his design studio and Jarvis is there, right? It's like almost like Jarvis is like omnipresent. So I think that that's one thing. The second thing is that like turn detection is a really hard problem that needs to be solved. So it's like understanding exactly when is a person done speaking and when can the AI start to speak? Multiply that, the complexity of that just for a one on one with your AI, multiply that out into an AI that joins like a group, uh, setting. Right? If you had an AI place reserved at a table for your phone, sitting there like listening to the conversation and then being a participant in a multi party kind of interaction, well uh, that's a completely different game, uh, than just a one on one turn detection, which we haven't even solved that part yet. Then take that and embody it in robots. And maybe you have multiple robots or AI models within the robots that are sitting at the table with you with a bunch of humans. And then how is that interaction work? Like are you talking to the robot? Which one are you talking to? Are you talking to a person? Are there two people talking to a single robot? Is it like multiple different, smaller conversations happening at the same table? How do you like cancel out or ignore or focus your voice on the, on the conversation that you're, that you're actually having, you know, between the AI and a human, uh, and ignoring the other side conversations. So there's like all these conversational dynamics that you know, I think we understand where these problems occur, like some of what those dynamics are, but we don't have the technology yet to solve them all.
Speaker B: So what I'm thinking about is for people building right now is the question of, okay, so you've described this paradigm of where early adopters are at right now. Some of the use cases that are really interesting. And then also the unsolved problems of the future. But as you're describing each unsolved problem, it unlocks a whole new class of possible experiences and products. On top of that, as soon as you get into kind of embodied voice AI in robots, that expands to a whole new product service. And so part of, uh, the question is, how does this impact what people are building right now? When you're thinking about some of these implications, where things are going, what the unsolved is like, how would you advise somebody who is maybe building for a product experience right now to sort of deconstruct this and sort of extract insights for what they're working on?
Speaker A: A general way that I think about it is I think of this concept of an agent as the new software. It's not to say that we're going to go and rewrite software for every single thing, um, to be an agent, but more so that I think that there's this movement now to build agents that can do things for us directly, but that they can also go and interact, interact with software from the previous paradigm and use those tools like a human would. Right? So that the human doesn't have to go and use those tools. People design things in Figma, people go and write posts in a Word document editor, write a word processor that might be like Google Docs or it might be Notion or might be Microsoft Word or anything like that. People go and respond to emails and send emails. It's like there's a whole host of applications that we use every single day for different purposes, uh, different workflows, parts of our life or parts of our job. I think that there's this like, broader movement to reconsider which of those kinds of workflows can be done by an agent, um, and how much of that task can be done by an agent. What I see a lot of are people kind of reimagining entire functions from an agentic perspective. So like there are folks that are working on like AI SDRs and folks that are working on AI salespeople and folks that are working on AI customer support. And there's um, of course like cloud code and Codex, um, and cursor, all kind of working on like a AI software engineer. So I think broadly speaking, my recommendation would be to start thinking about what does the world look like when there are agents that can assist in facilitating all types of different tasks. And I think in the context of voice, I think that voice doesn't, it doesn't make sense today for voice to be in Everything I think you have to consider where is it appropriate, where is it not appropriate. What's also important with voice is to think through like what is the graduated approach to where I, I do. There's going to be a world where everybody is interacting primarily with computers using voice and vision. Before we get to that world, I think that there's like a gradual approach, right? So today what do we do? We use tools by like operating these tools with a keyboard and a mouse. You have to understand where is a login button, where's the sign up button? Like where is the form that I need to fill out? You know, how far do I need to scroll down the page? It's like you have to like learn all of these things and there are, there are conventions and like human interface guidelines for how to design these kinds of applications so that you build the application and then it's uh, you're trying to maximize familiarity, um, and navigation of the thing that you've just built for the user or someone that you're building that thing for. So when you no longer have to do that anymore, when you maybe no longer have to be in front of the keyboard and the mouse, how do you design your UX so that agents can understand how to use that application, right? Or agents can understand how to use that backend service, right? Maybe you don't even have a UI anymore, but you still have a backend, right? You still have an API or a service that still is responsible for running through a deterministic algorithm or process or application logic and it still needs to update the database properly and have a record of something that occurred. But what if you programmatically expose that to an agent such that it can now interact with that service on the backend versus having to force a user to kind of interact with the ui and then on the interface side, like what is the interface to that agent? Well, I think chat is a really common one. You know, people always complain that like, oh, why is every UI like a, ah, chat, chat thing. Like where are we going to see like these AI native UIs? And my argument against that is that like an AI native UI, right? Like if the AI is so smart, right, it's so human. Like forget voice for a second. Just like if AI is super smart and human like in its capability. Well, how do you like actually communicate with other humans with text? It's a chat interface, it's iMessage, Telegram, WhatsApp. It's like even email is kind of a glorified chat interface. And Slack is like that too, and teams and the chat interface actually is the native human to human text interface. And so I think what ends up happening is I think that like we've had this like thin client dream for a long time. People have been trying to do thin client stuff and there's various ways of approaching that. But what if the way that thin clients come about is that the UI disappears and gets forced into this human to human kind of client experience, which is a chat ui. Everybody understands how to use that because we use with each other all the time. So now your agent is actually just interacting with, with these services on the back end and then communicating with you through a chat interface. And I mean if you see openclaw like this is like basically the way that you operate openclaw is through different chat, uh, channels, right? And then voice of course is, is another layer on top of that.
Speaker B: As you describe that like I, I can see sort of like the challenges, the problems that a lot of folks are working on right now, which is, you know, you're starting to see a lot of these like agent first interfaces. How do we start to architect things in a way in which like that's the parent paradigm. And I think what's interesting when I'm thinking about what you're talking about is what does that mean then for somebody three months from now or two years from now in terms of what they're building, in terms of what's on the roadmap. So when you start to think about how this is impacting people's roadmaps, what are you seeing there or what are some of the implications that you've been reflecting on in terms of the three month near term roadmap to the two year longer term roadmap.
Speaker A: I think even with AI, it's hard to predict what's going to happen in even a two year time span. The thing that I would bank on and uh, this is actually tied to even Live Kit, right? Like my company and the way that we work and like what our roadmap is, the assumption that we're making is that AI is going to be doing more and more and more of the tasks, right? The like core or rote or mundane or mechanical tasks involved in building products. And I think in two years it would be hard to imagine most if not all code is being written by AI. I don't think that's an exaggeration. I think there are some people out there who are smarter about this stuff than me, who are training the foundational models, right, who are even predicting Something faster than that timescale. But um, I do think that that is real and I do think you'll have agents that are able to do a lot of the functions, a lot of that core, rote, mechanical, mundane type of work. And then everybody suddenly becomes the orchestrator of it, the director of it, the product designer of it. And what we've been thinking about as well, from a business perspective, what that means for companies for found go to market becomes extremely important. Because what happens in a world where any product can be replicated in a week, suddenly how you actually raise awareness for that product, communicate what that product does and the value of it, sell that product and service a customer that decides to buy or use that product, that becomes paramount. And maybe that eventually also becomes done by AI, I'm not really sure. But I think that most people's attention will shift suddenly to how they do go to market.
Speaker B: Uh, they build, I can see sort of versions of this happen. Like we were like, there's little flashes and signals of this that happen are happening already. Like, you know, there are some people who talk about some experiments they made where they built a custom CRM that took functionality that they love and they built it in a weekend. Maybe it's not necessarily enterprise grade, but it totally serves like their core functionality. But like you start to see that sort of get the flashes of those types of projects pop up all the time. Like I was talking to somebody today and they built their own sort of family dashboard for managing recipes, errands and all this stuff. And so, so like when you start to expand that to like a two year timescale where like there's all these other things happening, like you can, you can start to see that. And so like the, the go to market piece is super, I think is super interesting. I want to take a quick swerve because we spent a lot of time talking about like Voice AI implications but you know, I don't think we've actually properly introduced Live Kit. And, and like why you have such a deep sense of context around here. So like maybe you can just give a quick, quick uh, intro to like what is Live Kit and how did Live Kit come to be and like Voice AI become like such the, the focal point of, of your work right now?
Speaker A: Yeah, it's been a crazy journey. I'll tell the fast version of it. So Live Kit started during the pandemic. The realization that we had when the pandemic hit. My co founder and I is we were trying to build an application that needed audio and Video streaming. Because during the pandemic the only way that you could interact with other people was over the Internet. So how do you stream audio and video over the Internet to somebody else? Um, we were working on an application that needed to do that and, and it turns out that the Internet wasn't really actually built for this thing, like for being able to do that. Most of the Internet's built on HTTP and that stands for the Hypertext Transfer Protocol. Um, and it's a protocol that was made for transferring documents over a network. Right? You open your browser and you're trying to transfer a document from some server somewhere, which is a document being a webpage, you know, it's a document full of HTML content, um, and render that thing in your browser. And it wasn't built for voice and video streaming in real time. And so there's other protocols that are designed for that purpose. And that's kind of our first foray to how we got into building infrastructure for that is we started to build infrastructure to stream audio and video for the app that we were building. Um, the app that we were building went nowhere. But uh, the infrastructure, we decided to put it out as an open source project so that developers everywhere around the world could build those kinds of features using that infrastructure. During the pandemic, when we needed it desperately, it exploded on GitHub and we started to have real large companies using it and asking us for a commercial version basically where we take the open source stuff that we built, the servers, uh, you know, and the clients and we scale that up so we deploy these servers all over the world and they run as a global network. And um, you get kind of like this massive scale and performance and reliability, um, out of a system like that. And so we raised around and started to build this thing called LiveKit Cloud. And LifeKit Cloud is this global network of media servers, our open source media servers with this one piece of technology that uh, we kept closed so that we could build a business, um, at the same time as improving the open source. And that piece of technology allows all of these servers around the world to communicate with one another, um, so they can out media camera and microphone streams anywhere in the world with extremely low latency. And uh, so we spent about two years working on that system, um, launched it and then ChatGPT comes out and you know, like all of us, it's like we were just so amazed by GPT 3.5 wrapped in an app and uh, felt like texting with a person, a real life person. And uh, so thought okay, well what if we you uh, know, use our audio video streaming tech and like we pair it with ChatGPT and build a computer that you could talk to instead. What would that feel like? Would it feel like Samantha from her. And so we built this demo called kit named uh, after Knight Rider Play on Live kit too of course. And uh, we put it out there as a demo and we're like all right, we're going viral for sure. Uh, with this demo is the first time you could really have a conversation with an AI model with multi turns and talk to it like a person. And yeah, it got like 90 likes. It went nowhere. But five months later OpenAI, uh found the demo, read the blog post and signed uh up in secret with a personal Gmail address, um, so we wouldn't know it was them. And they started to build voice mode on top of LiveKit Cloud. And uh, after about a few weeks of getting proof of concept together they, they pinged us and they said hey, like uh, we'd love to, to talk commercial and work together more closely. And so we uh, we started to work with them and you know, still to this day like still powering stuff, uh, their voice mode stuff for them and their computer vision stuff too. So like when you tap on the video button, you can now share your screen and uh, share the camera feed uh, of what you're seeing and like you know, have the AI tell uh, you um, what you're looking at and, and there's all kinds of cool use cases there like identifying species of plants and all kinds of neat uh, need stuff that are enabled by uh, this particular feature of ChatGPT. But yeah, I think it was at that moment that they really changed the trajectory of the company. Um, we were doing quite well in video conferencing and live streaming which were the use cases during the pandemic that we were used for. But I think we saw this calling uh, to kind of the future of how you're going to interact with a computer through this initial use case with OpenAI and we could uh, play a big part in this paradigm shift of computing towards computers that feel very natural to interact with and human like to interact with. The line that I always use out there is if the frontier apps are building the brain, life gets building the nervous system to that brain. So uh, it's been a crazy ride ever since.
Speaker B: You know, I know there's a whole scale angle then of this, you know, somebody signs up with a personal email and then talks. How can you power this for the fastest growing piece of technology for consumers ever, then I'm sure you're like, we have a lot of things we have to figure out. So can you bring us a little bit into the scaling story here behind this?
Speaker A: Yeah, it's when, it's when I stop sleeping. Well, um, it is really crazy. So at that time, uh, when we started to work with them, we were just 20 people, right? 20 person infrastructure company had only raised a seed round. OpenAI moves extremely fast and they were just on really tight timelines so we had to get our ducks in a row very fast. From the very beginning when we started a company around kind of the open source project, we had conviction that this would be a fundamental shift in technology, that we have an opportunity to be a of, part, part of being like, kind of like the network infrastructure, network backbone for a future where communication between people isn't getting slower, it's getting more and more real time. People aren't using less of audio and video, they're only using more audio and video. And so we kind of architected the system for scale from the very beginning. Now that's not to say that we like planned every single scaling wall that we would hit out, uh, ahead of time. Like there's definitely like this, um, balance between time to market, market how perfectly designed a system is for every tranche of scale that you might hit. Um, and so we built with a certain level of scale in mind and figured, okay, well let's get to that scaling wall first before we try to get to the next one. And so OpenAI came in and we started to work with them and we're like, okay, this is going to be serious. We really got to focus, we got to try to anticipate where are things going to fall over. Um, and so the initial launch of Voice mode was just on, uh, chatgpt plus users. So people who were, I don't know if they call it that anymore, but like people who were paying subscriptions. So it was a significant amount of volume, but still it was smaller than like the entire corpus of ChatGPT users. And so that, that was a little bit of like a graduated ramp, right, uh, where we, we got to cross at least one threshold and see how the system was performing. And uh, if there were any hotspots that we had to, to kind of work through or figure out how to, to work past. And I think at that time there, I'm trying to recall, like there were maybe a couple of things that we rallied really quickly to that we had not anticipated for that, that we had to rewrite, but we did it quite fast in the matter of like a week or two. Uh, and then we were kind of off to the races. Then what happened was towards December, um, OpenAI decided to release voice mode to all free users. And so then that was the next kind of challenge. Um, we looked at our. That's a funny story too. I won't go into the whole thing, but like my co founder was like off on vacation and I told the company like, do not deploy anything this week. Uh, I just don't want to have any issues come up. And so there was a deployment freeze while my co founder was out. And uh, OpenAI decides to uh, to ratchet it up.
Speaker B: And yeah, it's like we're going to, we're going to open up all of the floodgates full throttle. Let's do it.
Speaker A: Yeah, I had 15 minutes notice, um, and uh, there's a sound bite but like, yeah, don't do that one. But, but, but, yeah. So it was, I, it was short notice and I like panicked. I like ping the team and I said, hey, you guys, like, are we going to be okay? And they like looked at the, the servers and, and some graphs and Grafana and they were like, yeah, yeah, we're good, we're good. And it just scaled. It was no problem at all. I was so proud of the team. Then the next month, Character AI, which is also pretty insane scale on voice, uh, most people would be surprised. Character AI ended up signing up and building voice mode as well for interacting with their AI personalities, uh, in their app. And they just ratcheted to a hundred, like straight out of the gate and took it us down global outage. Um, and so that was very painful. But we realized that we had a bottleneck in our cross continental kind of state synchronization. The way Live Kit works is like this completely distributed system. It's a mesh network. There's no single point of failure. Every kind of data center operates independently and they coordinate together. But because they're operating independently and uh, a lot of these sessions are across different data centers or span multiple data centers. You have to do an element of state synchronization across them. And so we had like the state synchronization bottleneck when you had to continents or geographies, uh, to do that state transfer, yeah, we had to rally really quickly. I think we solved that one, uh, in like a week or a week and a half. But, uh, we had to ask Character AI to ratchet down for a little bit, uh, While we solved it, but we crossed that one. And then we haven't hit any scaling walls, uh, since for the core infrastructure. The thing that did end up breaking though was, uh, they were doing such a high volume of sessions that they just destroyed our analytics product. So we had this whole system that records all the telemet and analytics and gives you all of these, like, this whole breakdown of the session and how it works. I kind of like almost refer to it as like the matrix Green Rain. It's like you're not in the session listening to like what people are saying, uh, there's no audio data in there or anything like that. But you're kind of looking at all the telemetry of like, okay, this person's talking to this person and here's how long the session's been going and like all this stuff and you get this breakdown of like, you know, uh, what bit rate is their microphone sending and like, how fast is their network performing and are they getting any blips or stutters. And so it's this whole kind of like outside of the session matrix kind of view that you learn how to read. And yeah, OpenAI's volume was so high that they just like destroyed that thing. And uh, it took us, man, they were without an analytics product for a year. It took us like a year to rewrite that thing, um, and make uh, it really good and make it scale to the moon. And um, we just never foresaw like that kind of crazy volume. And the saving grace is that like the telemetry part, very important, but not as important as the core infrastructure. Like, can you connect and can you talk and is that a great experience? Like that's the core thing. And so that thing scaled pretty dang well. But we, we definitely had some pains downstream.
Speaker B: So Mike, as you're sharing all of this, like, when I'm thinking about how you started this, which was there were some strategic decisions made at the beginning that sort of set it up, uh, architecting for scale from the beginning. And then you kind of discovered along the way, like some other things that it seemed like afterwards were kind of an okay trade off to make, like, not perfect, but like, at least there was like some flex. I'm just taking this telemetry example is like people were in sessions and there's maybe less visibility into the quality of that, but like, it's because the usage was just astronomical. So it's like, okay, those are probably okay. Like that's the right kind of scaling challenges is sort of what I assume
Speaker A: the tricky part about that scaling task was because originally we were built for like video conferencing, live streaming during the pandemic and video conferencing, you know, 50 person conferences or a thousand people conferences are in all hands with 5,000 people for a large company or something like that, or a live stream where it's like one person streaming to like 10,000 people or 50,000 people or something, something like that. Just a very different shape versus a voice AI application which is a one on one conversation. But you might have millions and millions and millions of these one on one voice AI conversations going on at one time. Our core technology was built to handle any shape, right, and any number of these sessions. But the analytics system wasn't designed with that prior, uh, because it just wasn't the use case that we were building for when we first started. And so we had to reimagine the world and how kind of telemetry and analytics and logging should work. Work for a world where it's mostly one on one conversations with AIs. And how does that scale out and how is that visualized and all of that?
Speaker B: You know, when I think about like somebody building an infrastructure company, like there probably is like a bet that like we're making a bet that this will be a key part of how the Internet works. And I think like probably the optimistic or maybe unplanned for dream is that like it gets adopted right away and it then plays an essential piece of like some of the most important things that come out. And so like the question there is like what scale planning worked in that setup? Because I feel like when somebody building an infrastructure company, it's like that's kind of the dream path is like you have a bet for how the Internet's going to shift and change and you're building towards that and then boom, it happens. What's the scale planning that then like when you look back on you're like uh, that was a great decision and then helped out in that case.
Speaker A: I think the best decision that we made was to go multi cloud from the very start. And the reason why we made this decision was because we said to ourselves, if we are truly like the back backbone of real time Internet, right, uh, audio and video streaming Internet, right? If we are the backbone for that in the future and it becomes like the default infrastructure, it becomes like a utility, it must always work, it cannot go down ever. And so how do we architect a system such that it can never go down? So I mentioned it a little bit earlier in Passing, but like having no single point of failure was one key decision that we made made there. Once you have a single point of failure, that becomes a risk to the system, right? If there's a mistake made there, if there's a bug, if there's like a particular decision that was made and how it was architected or some kind of bottleneck that can cause the entire system to fail. So that was one thing that was really important. You have to take the no single point of failure approach and you have to thread it through almost every decision you make, uh, at the infrastructure level, not just in the software, but also in the hardware that you do depend on. So cloud providers, some of them are very robust. They rarely have outages. But even AWS has outages, right? Everybody has outages at some point. LiveKit will have an outage at some point. It's just like statistically true. Um, the longer and longer time goes, uh, it's just not realistic to expect a uh, system to be perfect forever. And so to prevent like we're depending on cloud providers, we deploy our servers across many cloud providers providers. How do you make sure that you know, you don't take an outage if somebody's cloud provider goes down? Well, you have to build across multiple cloud providers. So you can't just depend on AWS, but it just means you can also depend on GCP or Azure or DigitalOcean or Linu.com I it's like you have to actually leverage all of them together. And so this was another really important decision where we kind of run an overlay network across all the of of them and then what the system is doing in software is measuring the performance of routing between them and like uh, are there data centers from specific providers that are getting compromised or that are services degraded in a particular area and how do you route to a different provider in that area? How do you move users over to a server or a region that failed on a particular provider to another region or another provider in that region. We try to solve a lot of these problems in software. We to need knew upfront going into it that we were going to take a lot of outages by kind of coming out of the gate with a system like this because it was novel and there were going to be like issues that we just did not have software mitigations for. We were going to have outage after outage after outage and we kind of anticipated that would happen. But it doesn't necessarily mean it was any less painful. It was very painful. But we had Outage after outage after outage. Um, in the early days, in the first year that we rolled out commercial. But we knew that long term it was the right decision because going from a system where you depend on a single cloud, where you might even depend on a database as a single point of failure, and then you start to hit some level of scale when you have to transition from that system to the system that I'm describing that we built kind of from the very beginning. It's extremely difficult to do while the plane is in flight. Very difficult to do. And presumably you're going to take some downtime in doing so. And taking downtime while you also have large customers that depend on you is kind of a death sentence for your brand. Um, and so we just really didn't want that to happen if we wanted to kind of live up to the aspirations we had for the company and the project.
Speaker B: You were talking about the tension of like time to market and this also, this sort of sits sort of in tangent with that. Like it's a very intentional choice where this is an area where we're going that's mission critical for us to invest in as a business and to take the time to do this right and go multi cloud. Uh, on the other side of this story though too is sort of how this also reflected the dynamic between you and your co founder David and the engineering strategy as it matches with business outcomes. And that like, in addition to like multi cloud being a, uh, huge piece of resilience and sort of strategic infrastructure and architecture for the company, there was sort of a greater arc for this business strategy that this played into. So I was wondering if you kind of also shade in that side of the story too and how you and David sort of navigated that conversation. Because I think like balancing the time to market and the time to build this, plus the engineering strategy and the business outcomes, all of those are sort of concerns driving the strategic decisions here. So can you kind of bring us into some of those elements of the story?
Speaker A: For sure. I think that I have a pretty unique setup, um, as a, as a founder, as a CEO of the company. Um, David and I have known each other for 20 years. We met in YC in 2007, starting different companies at that time. That was the fifth batch of YC. What's interesting is that like David has been a founder multiple times. We actually did the previous company to Live Kit together as well for about six and a half years and he was the CEO of that one and I was the CPO of That one. And so this time we kind of flipped. Uh, he's like cpo, CTO at the same time. And I'm the CEO the this time. And I think what's interesting is that because we both have experience on both sides, right? We're both engineers from our whole career, but we also both have business experience. We kind of like come at these engineering problems we solve with like a business mindset with some amount of our like mind share, thinking about the business implications at the same time, thinking about kind of the engineering and the excellence there. You know, ultimately we're trying to build something that people trust, people find a lot of value from. And I think of course like the open source part of it is maybe more product and engineering focused, but the cloud part of it is a commercially focused product. And for a commercially focused product, the business kind of strategy has to come into play and has to inform how that product is built. And which uh, problems do you solve now versus which ones do you push down or kick down the road till later? And how do you thread the needle between like time to market and uh, generating business value, economic value for the company versus uh, building like the perfect system? It's not meant to be a cop out answer, but I think that like, because David has also been a CEO and he did, his YC company ended up, ended up getting acquired by Google. And um, you know, I was, I had nothing to do with that, that company. He's like a solid business person in his own right. And so I think that like I benefit from having someone uh, who can, can think from that lens versus me having to kind of like have uh, business justifications that impact his engineering roadmap. It's like that's not just not how he thinks. Like we both kind of like move together in a way.
Speaker B: I think what I love about that story is sort of this like shared empathy and perspective because for a lot of folks in our community making this transition, it's oftentimes like the first time they're putting on the hat of somebody transitioning from tech leadership to overall sort of business strategy. And how does, like how does, does this thing then generate money to then fuel all these other opportunities and stuff that we're doing. So thinking about the business engine behind stuff is oftentimes first time. And so getting a sense of what that relationship looks like and how to thread the needle of both and how you both sort of can shift perspectives there. I think it's really interesting. Russ. We're a little over time. I have some rapid Fire questions if you still have a couple minutes to jump into those.
Speaker A: Yeah, hit me with it.
Speaker B: Okay, perfect. What are you reading or listening to right now?
Speaker A: I'm starting to read Hyperion, um, because you recommend, recommended it to me, Dan. Just recently. The author of the book just recently passed away, so rest in peace. But uh, it's highly recommended sci fi book, um, that I think has influenced a lot of things too downstream from it. So I'm excited to dig into it and read it.
Speaker B: Oh man, I can't wait till you get into book two and three when things get real weird. Um, you know, you start to get into alternate. It's like also great, kind of like multiverse, like alternate timeline. Like I can't talk anymore or else I'll spoil it for you. But I'm so glad you're, I'm so grateful you're picking that up.
Speaker A: Thank you. Thank you for the rec. Yeah, yeah, I'm stoked for it.
Speaker B: Heck yeah. Next question. What is a tool or methodology that's had a big impact on you?
Speaker A: This one's going to be a boring answer. Probably everybody's answer right now, but ah, cloud code. Eight or nine months ago, I don't know when it came out in beta. Initially I had this like pretty detailed prompt of a very complicated product from a company that's doing really well. I wanted to replicate it or see if I could. And uh, so I had this prompt, I gave it to it and it just completely failed. So I was like, ah, bummer. And then like all my colleagues were telling me about over the, the holidays, everyone was raving about four five, Opus four. Five, um, in cloud code. And I was like, okay, yeah, I gotta check it out. And then people really started to rave about Opus 4. Six, um, like across some other, this crazy threshold. And so I took the same prompt and I threw it into Opus 4. 6 and it just single shotted it and it worked. First try I was just like, oh my gosh.
Speaker B: I do think it's important for people to have a couple of those complex prompts as a way to like check in. Like, I have a couple because I have a couple sort of, you know, side business ideas that I've been kind of working on. And I have had similar experiences where I'm like, oh my gosh, can't believe this is what got generated. Like, I love that idea, uh, of like taking a pulse of the progress. Okay, next question. What's a trend you're seeing? I mean, okay, so we've spent a lot of time Talking about voice AI. So it's also okay if like, there's a little nuance here you want to get into there. But the question is, what is a trend you're seeing or following that's interesting or hasn't hit the mainstream yet?
Speaker A: That trend for me is agent factories. I don't know if that's the official term for it, by the way. Stripe calls them minions. And uh, internally it's kind of like another layer. Uh, Open Claw is a little bit of like an example of it kind of sort of. It's not really a factory. Open Claw is like a single agent that kind of goes and does a bunch of work. I actually don't know what's happening under the hood and I haven't used OpenClaw because I've been scared it's going to like just pound me. Um, but, uh, so I have to be admit that I haven't really dug too deeply into openclaw. Agent factories is this idea that you have multiple agents, agents that all kind of work together in a, uh, you know, quote unquote factory. So you have like software engineering agents and like they've broken down different features that you've specified of uh, of the product that you want to build. And then you have like agents that are meant to review specs that are generated by the software engineering, uh, agents. Or like maybe you have agents that maybe upstream you have agents that do planning of like the actual features you want to build. Then you have the agents that review the specs and then you have the coding agents and then you have the agent responsible for like verifying like that code is clean and looking at like basically code doing code reviews. Um, and then you have agents that are like testing the system and to verify its function and that it works properly. So I think this whole concept and like, this is where the world is moving to. Uh, I haven't played too much with it, but I'm starting to follow it.
Speaker B: Last question, Russ. Is there a quote or mantra that you live by or a quote that's resonating with you right now?
Speaker A: I mean, I think I live by a few quotes, but, uh, one that I think is like, we learn very early that I like to consider quite often in how I just like conduct myself and uh, especially in like the business world is treat others the way you want to be treated? I, um, think it's like something that is so basic that we learn about from a very young age, but I think often forget. So, ah, that's an important one. I think that Uh, I think about often.
Speaker B: Well, I think that's a really powerful way to leave people here who are taking that leap to start their own thing is like, remember, remember that piece as you enter into this. Uh, I appreciate that as a conclusion for our conversation. Russ. This has been an absolute joy. I mean, I feel like every time we talk, we're always diving down to something, paradigm shifting. Uh, whether it's like, you know, talking about, like, Hyperion and how that shapes sort of like AI driven entities and like, what that looks like behind the scenes. But even now, like imagining a voice AI primary interface world and the implications on what people are building right now. Um, it's been a ton of fun. So thank you for spending time with us.
Speaker A: Thanks for having me. I always, uh, I always enjoy our chats, so I hope we do many more of them in the future.
Speaker B: If you're listening to this and you're wondering, how can I connect with other engineering leaders in my city? Pull up your phone right now and go to elc.community click our chapters page, you can see that on the menu on the left. Find your local chapter and click Join. We're hosting virtual and in person events all the time and this is the best way to help you get involved, expand your network in your city, and support your leadership and career growth. So pull up your phone, head to ELC.community join your local chapter and get involved. A huge thank you to all of our local leaders who make community happen, and thank you for listening to the Engineering Leadership Podcast.
More from The Engineering Leadership Podcast
All episodes →- Building an empowered career w/ Jean Hsu & Cate Huston #26143 / 100
- Redefining profit, centering human flourishing, and building an incorruptible mission-driven roadmap w/ Eric Ries #26055 / 100
- Affirm’s AI-native transformation & how it’s driving operational excellence w/ Geddes Munson #259
- Building reliable and proactive agentic systems at scale: how Shopify’s reflexive AI culture was instrumental in their development of Sidekick w/ Andrew McNamara #258
- How the R&D Org at Twilio Drives Business Strategy and Transformation w/ Inbal Shani #257