How to Engineer AI Inference Systems with Philip Kiely - #766

The TWIML AI Podcast · 2026-04-30 · 55 min

Episode notes

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference - batching, quantization, speculation, and KV cache reuse - lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most. The complete show notes for this episode can be found at .

More from The TWIML AI Podcast

All episodes →

Explore the best B2B AI & Data podcasts →

Listen to this episode All The TWIML AI Podcast episodes →