AI Networking

Next in Tech · 2026-06-09 · 24 min

Substance score

38 / 100

Five dimensions, 20 points each

Insight Density9 / 20

Originality7 / 20

Guest Caliber7 / 20

Specificity & Evidence9 / 20

Conversational Craft6 / 20

This episode explores how AI infrastructure is fundamentally reshaping networking requirements across data centers, wide area networks, and cloud environments, with emphasis on dynamic interconnection, automation, and the shift from standardized to specialized high-performance networking solutions. The discussion covers the acquisition of Alkira by Lumen, the evolution from InfiniBand to high-speed Ethernet, and emerging challenges around optical interconnect and network reliability for GPU-intensive workloads.

Key takeaways

Dynamic, automated interconnection between multiple cloud providers is now essential for AI workloads, driving acquisitions like Lumen's purchase of Alkira to enable software-driven network configuration.
GPU clusters require 400Gbps to 800Gbps+ interconnect capacity with near-zero loss tolerance because network failures cause expensive GPU idle time, making network performance directly tied to training cost and speed.
Vendor-specific networking technologies like Nvidia's Spectrum X and Cisco's Silicon One are replacing standardized approaches because hyperscale demand outpaces traditional standards processes, creating de facto proprietary ecosystems.
Co-packaged optics bundled directly into GPU devices offer significant power savings but face reliability and swappability challenges that are still being resolved across the industry.
Wide-area network focus is shifting toward automation, management, and orchestration layers using software-driven services to dynamically provision connectivity based on business needs rather than fixed infrastructure.

Guests

Mike Fratto

Topics in this episode

Broadcom Alkira Lumen InfiniBand Nvidia Spectrum X Cisco Silicon One Ultra Ethernet Consortium SONiC Marvell Data Center Bridging

What our scoring noted

Our reviewer’s read on each dimension, with quotes from the episode.

Insight Density

9 / 20

There are occasional useful technical points - such as how tail latency in GPU clusters causes entire job batches to sit idle rather than just triggering a retransmit, and the serviceability problem of co-packaged optics integrated into GPU devices - but the episode is padded with high-level recaps of industry trends rather than sustained novel analysis. Many observations arrive without follow-through depth.

it's not like you're losing one GPU potentially with tail latency, the entire chunk of GPUs that are being used for a job could end up just sitting idle

if that happens to be bundled into the GPU device that's within your system, that winds up being a little harder to swap out

Originality

7 / 20

The conversation largely tracks standard industry narratives - standards bodies being too slow for hyperscalers, merchant silicon giving way to tighter integration, open-source networking moving value up the stack - without offering contrarian or first-principles arguments. The co-packaged optics reliability angle has a modestly fresh framing but is not developed far enough to be a standout insight.

the traditional standards processes just simply can't keep up

it's not like this is they're competing over the general enterprise or service provider networks. It's really very much in these clusters

Guest Caliber

7 / 20

Mike Fratto is a credible networking-focused industry analyst at S&P Global with genuine domain knowledge, but both participants are researchers and observers rather than operators who have built or run AI networking infrastructure at scale. The episode is essentially two analysts comparing notes, which limits practitioner depth.

the analyst leading researcher on networking for us, Mike Fratto

I always enjoy coming on and sharing what I know and what I don't

Specificity & Evidence

9 / 20

The episode does name real companies, products, and events - Lumen/Alkira acquisition, Nvidia Spectrum X, Cisco Silicon One, Broadcom cognitive ECMP, SONiC, Ultra Ethernet Consortium, ONUG, Vera Rubin - giving a reasonable scaffold of specificity, but there are no dollar figures, market-share data, customer metrics, or quantitative outcomes anywhere in the discussion.

Lumen acquired Alkira, a startup that is doing dynamic interconnect and automated stitching between cloud environments

400 gig is. Yeah, it's table stakes. 8 hundreds here in some cases and even beyond

Conversational Craft

6 / 20

The host frequently pre-answers his own questions with long setups before Mike responds, and Mike consistently validates rather than challenges or extends the framing. There is no meaningful pushback, no productive disagreement, and the questions are mostly confirmatory rather than probing - resulting in a collegial chat rather than a rigorous interview.

Yeah, 400 gig is. Yeah, it's table stakes

Yeah, it's a good way to put it

Conversation analysis

Computed from the transcript - who did the talking, and the verbal tics along the way.

Share of words spoken

Speaker A56%
Speaker B44%

Filler words

so31uh28like19sort of13actually11right10kind of9you know6um5I mean3er1basically1literally1

Episode notes

Networking can be an invisible part of IT infrastructure, but AI is creating demands that make it a critical part of keeping AI application fed with data. Mike Fratto returns to the podcast to discuss both the long haul and local requirements for AI networking with host Eric Hanselman. It's always been important to link chunks of infrastructure efficiently, but AI's voracious need for data has dramatically increased the scope and scale of the need. The risk that any gap in performance or capacity presents is that precious GPU resources will be idled, an increasingly expensive proposition. The realities of AI application architectures is that infrastructure is ever more hybrid, requiring access to repositories of data both on-premises and in various clouds and models scattered across various providers. The need for dynamic connectivity is driven by the rapid evolution of preferences for new models and the diversifying needs of agents to reach new data sources. It's not only forcing network expansion, but it's also driving M&A activity as network providers look to enhance automation in response to customer demands.

Full transcript

24 min

Transcribed and scored by The B2B Podcast Index.

Speaker A: Welcome to Next in Tech, an S and P Global podcast where the world of emerging tech lives. I'm your host, Eric Anselman, Chief analyst for industry research at S and P Global. And today we're talking about a different aspect, networking in AI, uh, and the different levels at which we're starting to have to think about networking differently now that we're headed into the guts of AI infrastructure. And with me to discuss it is the analyst leading researcher on networking for us, Mike Fratto. Mike, welcome to the podcast.

Speaker B: Great Eric, thanks for having me.

Speaker A: Thanks for being back on. And if we think about a lot of what's been happening in discussions around AI, everybody gets all tangled with GPUs and storage and all the other things that are happening. But AI is also forced assessment, maybe rethink um, possibly of really how we build networks and what the requirements are for. Interconnect really is a. We're having to shift a bit in terms of really how we're thinking about building networks.

Speaker B: Yeah, we are. And it touches everything from the data center to the wide area to the cloud. All of it is changing and has to adapt for these AI applications and use cases.

Speaker A: Well, we'd sort of gone through this period of thinking about hybrid and multi cloud and what that transition looked like. And I think historically, if you look back at a lot of the voice enterprise data, there was this movement for ages that um, enterprises wanted to get, so they just had one cloud provider and then really, I don't know, two, three years ago, they realized that no, they actually were probably going to have more than one. And then AI and now suddenly they've realized that they need all of the cloud providers and they all have to be interconnected.

Speaker B: Yeah. And that interconnection needs to be fast and reliable and all the requirements for any kind of distributed application. And it's not just the cloud providers, it's where are the AI applications being consumed? In the office, in the remote office on the road. What's going on? You haven't mentioned Agentic, but I'm sure it's going to come up at some point. How is Agentic going to impact the architecture and the use and the utilization of the wide area and talking to all of these different cloud services. So it's really kind of a crazy time. And it's changing things that are happening from the IP layer, from layer three, from just Interconnect and basic routing and switching all the way up into the application layer where load balancers live in application security and all of those networking Features that are above the wire basically.

Speaker A: Well, uh, yeah. Bringing in security. Suddenly we've moved from something that was relatively simple just in terms of doing application protection to now having to have network based security. Be something that actually has to be able to do semantic parsing of prompts in order to understand whether or not the prompt is malicious. That's a very different world. But ah, I want to drill down uh, on one of the things that actually is also starting to have some market activity which is the need to manage this dynamic interconnection at multiple layers. And that's something where we've seen in the data center interconnect and what we used to call the wide area network, the need to be able to go get configurable connectivity, uh, for data center environments and the enterprise environments that house most of the data that's needed for training. This is one of the things that was behind some M and A activity that's going on. Lumen acquired Alkira, a startup that is doing dynamic interconnect and automated stitching between cloud environments. That's an interesting shift to actually have what has been a traditional interconnect provider now start to be dabbling in the dynamism that is the interconnect world.

Speaker B: Yeah, it's a strategy that Lewen's been pursuing for several years now and trying to create a service that's dynamic and can change very quickly and is on demand. And the other service providers are doing the same thing as well. And so that acquisition, what Alkira brings to the table is it's just a software layer and so it provides not only that foundation for that sort of dynamism to take place, but the nature of what Alkira did, reaching deep into the cloud services, being able to configure the networking components in those cloud services literally in a hands off kind of fashion, all of a sudden can be completely driven by software. So you can go in there, spin up all the routing and security components in AWS or GCP or what other supported cloud and you just, you're just not really thinking about it anymore. And in addition it gives Lumen this global reach. So it's not just in the us it's anywhere, which is pretty fascinating.

Speaker A: And I think it starts to get back to the legacy challenge that has been levels of automation and the initial work with AI. Okay, great. You want to be able to dynamically connect data to do training agents, as you said, we're definitely going to dig into agents agents, shift the scope and scale and Speed of what has to be able to happen. And I, uh, think it gets to your point. In order to be able to capitalize on that kind of velocity, you have to get fully automated. And the network is a fundamental part of that.

Speaker B: Yeah, in supporting applications, the network needs to be fast and reliable and secure and all those features and capabilities that organizations want out of the network, but it also has to stay out of the way. And so, you know, one of the things that cloud and AI is bringing to the business is more and more people are able to type into a prompt and build an application that's talking to all these data sources. And so you never really, really know where things are going to be used or spun up or accessed. And so that's what's driving this velocity. It's not just organizations now just need to, in a very controlled, fixed manner, bring up a new pipe somewhere. It's, you've got some business analyst who's putting together a uh, dashboard in some kind of, in their AI platform and they're just doing it on their own. And all of a sudden it's got to touch 10 different things. And so that's not it driving that. And that's what I mean by the network just kind of has to stay out of the way. So if they can go in and go, I need all these things, and then they push go. The business analyst isn't driving network configuration, but what they're doing is triggering these automated processes that bring up these connections dynamically.

Speaker A: And those dynamic connections are going to chew up a lot, uh, of capacity that you got to be able to respond to.

Speaker B: Yeah, yeah, absolutely.

Speaker A: This is also something that if we look at the internal networking, this is something where AI is having a significant impact there. Although those are things that we'd kind of seen ramping up for a while thinking around both within data centers, the kind of capacity that's necessary to manage training. But that's another area where scale has started to become an issue. But this is something where uh, just the levels of data movement within clusters is now starting to consume dramatically higher levels of Internet requirements in terms of, or networking requirements in terms of interconnection. I mean, think about where this is headed. It's 400 gig used to seem like a lot of capacity and now, uh, geez, it's table stakes.

Speaker B: Yeah, 400 gig is. Yeah, it's table stakes. 8 hundreds here in some cases and even beyond. And it's just the sheer capacity and the throughput is really insane. But the requirements for what the impact of loss is not just a retransmit time. It's in a training session or even an inferencing. It's time when expensive GPUs could be sitting idle. It can costing, costing money and delaying that whole sort of run. So it's not like you're losing one GPU potentially with tail latency, the entire chunk of GPUs that are being used for a job could end up just sitting idle. So it's there, there's a. The impacts and the stakes are a whole lot greater and the.

Speaker A: When we've gotten to a point at which computational power has always been expensive. But as you're pointing out now we've got what is a dramatically more expensive resource, GPU capacity. As we start to go step up to ever and ever denser clusters to do the computational work that's around this keeping them fed, the cost for not keeping them fed winds up going up dramatically. If you can't get data into a cluster at sufficient speed to keep it fully occupied, that's costing you money right then and there.

Speaker B: Yeah, it is. And part of the architectural challenge is how do you build that network from the cluster front end all the way out to the wide area and beyond in such a way that you can get all the data to all of the endpoints reliably and not run into congestion and loss and that latency that we talked about and those other sort of network degradations. So there's a number of architectural and protocol and engineering challenges that the industry is facing, not only on the vendor side and the provider side, but also on the operation side of, uh, you know, how do we build and operate and run and what technology do we use and what are the standards and sort of what's that approach?

Speaker A: Well, you know, we, we came through the age of merchant silicon where we were able to build networking gear with some fairly standardized set of silicon providers who really were building out the core. And the big difference for a lot of these systems really was the overlying operating system in the switch of the router that was really handling this. That really has picked up significantly as well in that now you've got silicon providers working very closely with the GPU providers to be able to go build platforms that can keep up with the capabilities they need. You've had, I guess, historically we had that compute cluster interconnect of Infiniband, which was this very specialized at the time, the lowest latency interconnect that you could get to. Now shifting gears to actually adapting Ethernet to Be able to carry on at these substantially higher speeds and to be able to give a lot of latency and performance guarantees that clearly you got with some of these specialized networking technologies like Infiniband, uh, and at ever higher rates of interconnect capacity. This once again we've got ever tighter coupling between the different players in this environment, the Broadcoms, the Marvells, now Cisco with. Now that Cisco is a silicon supplier as well, now you've got such tight coupling of their capabilities, really being one more tight integrated piece of this where Cisco live is going on. There's been a lot of other recent conferences that we've seen. The fact that Cisco is out there pitching the innovations of their silicon capabilities as a standalone device just seems to me like we're in very different territory right now.

Speaker B: Yeah, it is very different. And part of that difference is the demand, especially from the hyperscalers and the cloud scalers, which is where a lot of the revenue opportunity is right now is so strong that the traditional standards processes just simply can't keep up. How long did data center bridging take for the whole sort of constellation of protocols you take to come from inception to standards? And those couple of years have been several years, I think. More I'm getting old and forgetting. But you know, I do recall that it was surprising that it was so fast for standards processes, which tend to be glacial, which, because they're being thoughtful

Speaker A: and it's a consensus process and it takes a while to sort out. But yeah, to your point, it took a while.

Speaker B: Yeah, it took a while, but now it's like we have all of these different technologies that are coming to market which are essentially proprietary. And I know, uh, that's considered a bad word, but they're proprietary, but they're effective within their own sort of ecosystem of technologies. And so it's going to be, there's going to be competing competition between different, uh, chip makers, different network vendors and so forth. Not so much because they want to lock in the technology, but because they've got different viewpoints and opinions for what's best and what's the most sort of effective path forward.

Speaker A: Well, in some ways you've now got a different community that's driving those requirements. You've got Nvidia specifying standards for interconnecting with Spectrum X and things like that, because that's what they need to interconnect in their environments. And because you've got such a strong focus on being able to scale the cluster itself, a strong focus on a single Vendor, all these capabilities. You've now got networking vendors who are all ensuring they can bolt up to Spectrum X and the capabilities that are in there.

Speaker B: Yeah. And Cisco with their Silicon One and their fully scheduled fabric is a tightly closed technology. Broadcom has their, I forget what they call it, cognitive ecmp. Um, and you know there are others but they're definitely serving a very, they're serving a very large but uh, sort of limited, limited in scope sort of problem. So it's not like this is they're competing over the general enterprise or service provider networks. It's really very much in these clusters. Right. Of how do we.

Speaker A: Very dedicated clusters that are a uh, huge concentration of computational power, networking density. Power density, the entire thing all in a much more tightly integrated block which in order to operate at these dramatically elevated rates you gotta be. Although I did notice that in some of the Cisco Live keynotes, the network troubleshooting problem that they happen to be working through was a spanning tree protocol problem which, which uh, I keep thinking we've put a stake in the heart of things like spanning tree, but I guess not. If it's on stage, it's just go live. But, but I also. While there's this greater focus on what that cluster capability looks like, we've also got a shift in the providers that are playing in this networking arena. One of the things that was interesting uh, coming out of Dell tech world is Dell is selling a lot of networking gear and we maybe, I don't know, seven, eight years ago, as we were moving through that merchant silicon piece, you had folks like Cumulus Networks who were going to give us an overlay operating system and you could run it on any hardware and all the hardware was all going to be roughly the same because the silicon was all roughly the same. We're now at a point at which the Silicon Valley is really high performance and there's a shift towards looking at open source efforts around the software. So diverting that you've got efforts like Sonic, which grew out of Microsoft's data center requirements for high performance switching that you've got some part of this market right now is saying oh the important part and the high value of networking is now moving up the stack so the operating system for the switch of the router is no longer a big deal. Um, I guess differing viewpoints on how we're moving through open source and some of those generic networking options.

Speaker B: Yeah, uh, the options like Sonic are interesting. I think they're more interesting for those kinds of organizations that want to have more of an engineered solution or I uh, hate to call it white glove, but a managed solution. Right. This is what Dell does really well. They have a whole professional services engagement built around their networking and their sort of full stack engagements and that's where it fits in really well. We're using Sonic because it takes away a lot of the uncertainty of using this open source platform. Not that Sonic is unreliable, but there's all of the soft things that come along like having to learn Sonic and implement it into your tooling and how to manage it and know how to troubleshoot it and all of those things. Sonic works out really well for those sorts of organizations that are, that have the technical expertise and talent but they need some help getting them along to a full deployment. I still don't think it's sort of

Speaker A: more along the managed distribution model, something like the red hat for networking kind of thing.

Speaker B: Yeah, with a little more of a managed or a little more of a professional engagement that goes along with it. Yeah, that seems to be the path

Speaker A: curated and where it seems to go. It's interesting to see that the, that Cisco is also offering Sonic as an option so.

Speaker B: Well they, yeah, I mean they've been uh, offering Sonic for a long time. All the big network vendors have offer Sonic because they want to cater to that hyperscale or cloud scale customer that wants to have Sonic or some other open source software on their silicon.

Speaker A: Uh, as always it's a matter of looking for options. And I guess one of the things, I was at the open network users group a few weeks ago and there was a discussion about when do we start to shift data center interconnect fully to optical. Because it's always been the promise for a long time which is hey, fiber connections, you got power advantages, you got all sorts of goodness that starts to come along with that. And yet a lot of the participants in onug, some major financial services organizations, a set of major logistics companies, all sorts of folks are working on this stuff. They're still in the process of getting themselves fully onto optical options.

Speaker B: Did they say why that is? I suspect it's partly the availability of CO package optics and other optical components.

Speaker A: Well, especially when you look at top of rack, end of row, those sorts of things. Clearly there's a lot of good optical options there. But I think I was actually doing a panel at both uh, Broadcom and Marvell folks on there. They were talking about a couple different things and most of it was the readiness of the ecosystem because you were mentioning CO package optics, the idea that you're now actually delivering optical interconnect directly to the device to the gpu. It's one of the things that Nvidia has been talking about and they are uh, when they get up to Vera Rubin they're talking about actually starting to standardize some of the CO package optics that are around there. But you've got to be able to go prove out the problem right now of course if you look at optical interconnect throughout the data center, one of the challenges is reliability. So if you've got your various lasers that are in the plugins for the optical stuff, one fails, you can swap it out. If that happens to be bundled into the GPU device that's within your system, that winds up being a little harder to swap out. It seems like uh, a lot of people working on ensuring that the reliability of CO packaged optics is there that you actually might be able to do swappable lasers. To me the fact that there's still a lot of discussion around it says that maybe it's still not quite so mature as it needs to be able to get there, but I guess we'll have to see where that shakes out.

Speaker B: Yeah, I think that, I think the benefits of Copack Optics it's going to, is going to drive the development of the technologies just in power savings and all of the downstream power savings that derives from. But yeah, there's still some kinks. It's interesting to think about some of the challenges of if there's a failure in the hardware, all of a sudden you're limited, you may be limited in what you can do, right?

Speaker A: Yeah, it's, it's how, how far back do you have to go to actually check, resolve the problem and let alone diagnosing the problem and sorting this out. And it gets right back to that whole question that we started with which is how do you keep all of this really expensive GPU infrastructure actually running at full capacity? And that uh, that winds up being some of the biggest challenges. So what should we be looking for going forward? Curious what your take is on next stages and where we, you know, directionally what we should be looking at.

Speaker B: So I think with, with interconnect, uh, uh, across the wide area I think there's going to be, the emphasis is going to be on the operations of the automation and management plane of being able to connect all of the endpoints when they need to be connected, providing the right capacity, monitoring the traffic and being dynamic and how that ends up being driven and then of course putting controls around that for things like cost, regionality and so forth. Data sovereignty all of a sudden becomes a big component, has been, but shows up. So I think sort of in the wide area, it's more about services and software running those services and how they're being configured within the data center. All the action is really, it's on the front end. The front end. So the part that connects to the enterprise network, the backside, that's just a stack. And so I would expect to start seeing more hardware and software that supports some of the standards that are coming out of the Ultra Ethernet consortium. That might drive some unity around the networking protocols and the traffic management protocols that end up being used. And for probably the foreseeable future, we're still looking at a lot of vendor specific technologies until those standards start to come into play.

Speaker A: Well, it's coming back around to the point you made at the outset, which is if what you're trying to do is grind as much performance out of that capability as possible, we're going to be looking at specialized approaches and that comes down to the various sets of enhancements that are going to be more de facto, I guess, than standardized. One of those trade offs that we typically make.

Speaker B: Yeah, it's a good way to put it.

Speaker A: Cool. Well, this has been great, Mike. Thanks for all the insights.

Speaker B: Oh, thanks for talking with me and thanks for your insights. I always enjoy coming on and sharing what I know and what I don't.

Speaker A: Well, we got links to some of the research in the show, notes to show the stuff that we do know and, uh, we'll go from there. But we're at time for this episode. Thanks to our audience for staying with us and thanks to our production team, including Sophie Carr, Ron Mudeishin, Keir Smith and Dylan Scheibel, um, on the marketing and events teams. If you like this episode, please subscribe or like us. I hope you'll join us for our next episode because there is always something next in tech.

More from Next in Tech

All episodes →

Explore the best B2B Finance podcasts →

Listen to this episode All Next in Tech episodes →