How Slack Rebuilt Its Search Index for 10 Million Daily Queries

The CTO Podcast with Fexingo · 2026-06-25 · 8 min

Substance score

37 / 100

Five dimensions, 20 points each

Insight Density11 / 20

Originality7 / 20

Guest Caliber3 / 20

Specificity & Evidence11 / 20

Conversational Craft5 / 20

Slack rebuilt its search infrastructure from a single Elasticsearch cluster handling 10 million daily queries to a custom search engine called 'Search It' using time-based sharding, running both systems in parallel for over a year before gradually migrating workspaces, achieving 10x latency improvements and horizontal scalability.

Key takeaways

Time-based sharding - where each shard covers a specific time range with new data in active shards and old shards read-only - is more effective for time-series data patterns than fixed-shard models like Elasticsearch's.
Shadow traffic testing, where queries are sent to both old and new systems but only the old results are shown to users, enables safe validation of search results, latency, and error rates before production cutover.
Gradual migration by tenant (workspace-by-workspace) reduces risk compared to big-bang switches, allowing teams to isolate issues and maintain service availability during long infrastructure transitions.
Incremental shipping of value during multi-year migrations - releasing faster search for recent messages as wins - helps maintain team motivation and demonstrates progress to stakeholders.
Custom search engines are only justified when you have specific access patterns that off-the-shelf solutions don't fit and the resources to build and maintain distributed systems; most teams should use Elasticsearch or Algolia.

Guests

Luna

Topics in this episode

Slack Elasticsearch Search It time-based sharding shadow traffic testing query router index management replication layer tokenizer workspace migration

What our scoring noted

Our reviewer’s read on each dimension, with quotes from the episode.

Insight Density

11 / 20

The episode covers genuine engineering concepts - time-based sharding, shadow traffic, parallel system operation, and workspace-level rollout - at a reasonable density for 8 minutes. However, several of these patterns are standard industry knowledge, and the explanations stay surface-level without revealing deeper tradeoffs or implementation details a practitioner couldn't infer independently.

Search It was designed around a 'time-based sharding' model. Each shard covers a specific time range - say, one week of messages. New data goes into a new shard automatically. Old shards become read-only.

They used a technique called 'shadow traffic'. For months, every search query was sent to both systems, but only the Elasticsearch result was shown to users.

Originality

7 / 20

The episode is a competent retelling of a known Slack engineering case study but offers no contrarian angles, first-principles reasoning, or counterintuitive arguments. Observations like 'understand your access patterns before you commit to a technology' and 'sometimes the right solution is to build your own' are well-worn industry maxims.

It's a pattern that a lot of time-series databases use.

And it's a reminder that sometimes the right solution is to build your own, even when off-the-shelf tools exist.

Guest Caliber

3 / 20

There is no actual guest - Lucas and Luna are both hosts conducting a scripted explainer dialogue. The single practitioner reference ('I spoke with one of the engineers') is unnamed, unverifiable, and not present in the conversation, making this a narrated case study rather than an interview with a credentialed operator.

I spoke with one of the engineers who said the key was to ship incremental value.

Lucas: Alright, back to search.

Specificity & Evidence

11 / 20

The episode does include concrete metrics - p99 latency dropping from over 2 seconds to under 200 milliseconds, 10 million daily queries, a two-year migration, and over a year of parallel operation - which give it meaningful specificity. However, sourcing is opaque (no blog post, paper, or named engineer cited), and technical depth stops well short of implementation-level detail.

p99 latency was over 2 seconds during peak hours, and they had a growing number of timeouts.

the new system handles 10 million queries a day with p99 latency under 200 milliseconds. That's a 10x improvement.

Conversational Craft

5 / 20

The dialogue is clearly scripted, with Luna asking pre-planned prompts ('How so?', 'And once they were confident, how did they actually migrate users?') that function as narrative transitions rather than genuine inquiry. There is no pushback, no challenging of claims, and no moment of productive disagreement - the exchange reads as a rehearsed explainer with artificial turns.

Luna: So they were literally paying for two search infrastructures simultaneously. That's expensive.

Luna: So the hot shards stay small and fast. That's clever.

Conversation analysis

Computed from the transcript - who did the talking, and the verbal tics along the way.

Filler words

so9like4right4actually2sort of1kind of1literally1honestly1

Episode notes

Slack's search is used over 10 million times a day. But the original Elasticsearch-based index couldn't keep up as the platform grew to billions of messages. In this episode, the CTO Podcast breaks down how Slack's engineering team rebuilt its search infrastructure from the ground up - migrating from a single Elasticsearch cluster to a custom search engine called 'Search It'. We cover why they moved away from Elasticsearch, how they sharded data across hundreds of nodes, and the key trade-offs they made to keep search latency under 200 milliseconds. Lucas and Luna also discuss the human side of a multi-year migration: getting buy-in from leadership, managing developer burnout, and the moment the new search went fully live. If you're responsible for scaling a product that depends on fast, reliable search, this episode is packed with practical lessons. #Slack #SearchEngine #Elasticsearch #Scalability #Infrastructure #Migration #Engineering #CTO #TechLeadership #Architecture #Productivity #Business #Technology #FexingoBusiness #BusinessPodcast #CTOPodcast #RealTimeSearch #Backend Keep every episode free: buymeacoffee.com/fexingo

Full transcript

8 min

Transcribed and scored by The B2B Podcast Index.

Lucas: Luna, I want to talk about a piece of infrastructure that most of us use every day but rarely think about until it breaks. Luna: Let me guess - search on Slack? Lucas: Exactly. Slack handles over 10 million search queries a day. That's more than what many dedicated search engines see. And for years, it ran on a single Elasticsearch cluster that was starting to buckle. Luna: I remember the days when searching for an old message felt like a coin flip - sometimes it was instant, sometimes it timed out. Lucas: Right. And the engineering team knew they had to rebuild. But this wasn't a simple upgrade - it was a full migration to a custom search engine they called 'Search It'. The whole thing took over two years. Luna: Two years is a long time to be running a system you're actively replacing. How did they keep the old one running while building the new one? Lucas: That's the fascinating part. They ran both systems in parallel for over a year. Every query went to both the old Elasticsearch cluster and the new Search It system. They compared results, latency, and reliability before they ever let Search It serve production traffic. Luna: So they were literally paying for two search infrastructures simultaneously. That's expensive. Lucas: Absolutely. But it was the only way to guarantee zero downtime. The team knew that if they flipped a switch and something went wrong, millions of users would lose access to their message history. So they took the cost hit. Luna: What was wrong with Elasticsearch? It's not like it's a bad product. Lots of companies run it at scale. Lucas: Elasticsearch is great - for certain workloads. But Slack's data has a very specific pattern. Users search mostly for recent messages, but they also need to occasionally find something from years ago. The access pattern is skewed, and Elasticsearch's sharding model wasn't a great fit. Luna: How so? Lucas: Elasticsearch uses a fixed number of shards per index. If you pick too few, you can't scale. Too many, and you waste resources on metadata overhead. Slack's data grows every day - billions of messages - so they were constantly reindexing, which caused performance spikes. Luna: And the custom solution? How did Search It differ? Lucas: Search It was designed around a 'time-based sharding' model. Each shard covers a specific time range - say, one week of messages. New data goes into a new shard automatically. Old shards become read-only. That means write traffic is always going to a small set of active shards, and reads can be distributed across many. Luna: So the hot shards stay small and fast. That's clever. Lucas: It's a pattern that a lot of time-series databases use. But applying it to a general-purpose search engine required building a lot of custom infrastructure. The team wrote their own query router, their own index management, and their own replication layer. Luna: And all of that had to be reliable from day one. How did they test it? Lucas: They used a technique called 'shadow traffic'. For months, every search query was sent to both systems, but only the Elasticsearch result was shown to users. The Search It result was logged and compared offline. They looked for differences in result sets, latency, and error rates. Luna: I imagine they found a lot of edge cases. Queries with special characters, misspellings, that sort of thing. Lucas: Exactly. Slack search has to handle all kinds of messy human input - emoji, code snippets, file names, channel names. The team had to tune the tokenizer and scoring algorithm for months to get parity. Luna: And once they were confident, how did they actually migrate users? Lucas: They did it gradually - workspace by workspace. Each workspace was a tenant. They could flip a single workspace to Search It and monitor its behavior before moving the next. That way, if something went wrong, only that workspace was affected. Luna: So the migration took months, not a weekend. Lucas: Right. And the team had to manage a lot of complexity: keeping the two systems in sync, handling backfill for older data, and maintaining backwards compatibility for the API. It was a massive engineering effort. Luna: What about the human side? Two years is a long project. How did they keep the team motivated? Lucas: That's a great question. I spoke with one of the engineers who said the key was to ship incremental value. They didn't wait until the whole system was done. They released pieces - like a faster search for recent messages - and celebrated those wins. That kept the team energized. Luna: And leadership? Getting buy-in for a multi-year, expensive migration can't be easy. Lucas: They had data on their side. The old system was hitting limits: p99 latency was over 2 seconds during peak hours, and they had a growing number of timeouts. Leadership understood that if they didn't fix it, the product would suffer. Luna: So it was a bet on the future. And it paid off - Slack search today is much faster and more reliable. Lucas: Yeah, the new system handles 10 million queries a day with p99 latency under 200 milliseconds. That's a 10x improvement. And it's horizontally scalable - they can add capacity by adding shards, not by reindexing. Luna: It's a great case study in how to think about search infrastructure at scale. But also in how to manage a long-running migration. Lucas: Honestly, that's the part that often gets overlooked. The technical design is important, but the project management - running two systems, shadow testing, gradual rollout - that's what made it successful. Luna: And it's a reminder that sometimes the right solution is to build your own, even when off-the-shelf tools exist. Lucas: Yeah, but you have to be honest about the cost. Building a custom search engine is not something most companies should do. Slack had the resources and the specific need. For many teams, Elasticsearch or Algolia is perfectly fine. Luna: So the lesson is: understand your access patterns before you commit to a technology. Lucas: Exactly. And be willing to invest in the right long-term solution, even if it means a painful multi-year migration. Luna: I want to quickly mention something that's top of mind for us. This podcast exists because of listeners like you - and a small group of them actually chip in monthly to keep it ad-free. Lucas: Yeah, and we really appreciate that. If you find these deep dives useful for your own work, you can support us at buy me a coffee dot com slash fexingo. It's what keeps us going. Luna: And we're able to focus on exactly the topics that matter to operators and builders. So thank you to everyone who already does. Lucas: Alright, back to search. One thing I didn't mention is that the migration also forced Slack to rethink its backup and disaster recovery strategy. The old Elasticsearch cluster had a single point of failure. The new system is fully distributed, with data replicated across availability zones. Luna: That's a nice side effect of the redesign. You fix one problem and you end up fixing others. Lucas: Exactly. And that's the kind of outcome that makes the multi-year effort worth it. Thanks for listening, and we'll be back next week with another deep dive.

More from The CTO Podcast with Fexingo

All episodes →

Explore the best B2B Engineering & DevTools podcasts →

All The CTO Podcast with Fexingo episodes →