Introduction
Search is one of the most critical features in any messaging platform. For Discord, a service handling billions of messages across communities, keeping its search feature fast and reliable was a serious challenge. The company originally built its search infrastructure using Elasticsearch, with Redis managing the indexing queue. This setup worked well until the platform grew too big.
As Discord’s growth exploded, Redis started failing under pressure, and the entire pipeline became fragile. In response, Discord engineers made a major architectural shift. They moved away from Redis and rebuilt the entire search system on Kubernetes.
This article breaks down what went wrong, why Redis wasn’t enough, and how Kubernetes made the search system better.
The Problem: Redis Couldn’t Keep Up
During the early days, Redis helped Discord by collecting messages and passing them in batches to Elasticsearch for indexing (making them searchable). But once the platform started handling huge volumes of messages every second, Redis just couldn’t keep up.
Here's what went wrong:
- Too Many Messages, Not Enough Power: If Elasticsearch had a problem, Redis would have to wait. During this wait, more messages kept coming in. Redis didn’t have enough power (CPU) or space to hold everything. So the queue would fill up, Redis would get overwhelmed, and eventually, it would start dropping messages.
- No Room for Safe Updates: The system had grown so fragile that updating software or fixing bugs meant taking everything offline. For example, when a major security patch was needed, Discord had to shut down the search system to apply it.
- Scaling Didn’t Work Well: As Discord grew, adding more Redis servers didn’t always help. Redis didn’t have built-in tools to spread traffic easily or recover quickly, which made it harder to keep things running smoothly.
- No Separation Between Groups: All servers, big or small, used the same Redis system. If one large Discord server caused issues, it could slow things down for everyone. With Redis, there wasn’t any way to separate the loads or stop a single problem from impacting the whole system.
- Hard to Monitor and Debug: When things went wrong, the Discord team had to spend a lot of time just figuring out what caused the issue. This happened because Redis was not able to offer much visibility into how messages moved through the system or where they were getting stuck. This made troubleshooting slow and frustrating, especially during high-traffic times.
This case is a good area of discussion for the Redis vs Kubernetes for scaling debate. Redis was never the problem here; it just wasn’t built to handle the traffic and workloads that Discord brought in.
The Solution: Kubernetes and a Complete Rebuild
The Discord team knew that they needed to start fresh. Simply patching the old system wouldn't help anymore. Using Kubernetes and modern automation tools, they completely redesigned their search system, from message flow to deployment, to handle the scale. Here’s what they did:
Step 1: Replace Redis with Pub/Sub
Redis was removed from the message queue. Discord built a new system using Pub/Sub (Publish/Subscribe), which is actually more reliable for large-scale messaging. This system groups messages smartly and sends them where they need to go without getting stuck.
With this change:
- No more lost or dropped messages.
- Smarter message handling.
- More stable performance during traffic spikes.
Step 2: Running Search on Kubernetes
Discord was already using Kubernetes to manage other backend services, like chat servers. Now, it uses it for search, too.
They also split the big search system into smaller clusters (called “cells”). These clusters could be managed independently, making the whole system easier to scale and fix.
Here’s what Kubernetes helped them do:
- Update Without Downtime: New versions and fixes could be applied safely while the system kept running.
- Automatic Maintenance: Tasks like operating system updates no longer had to be done manually.
- Keep Problems Contained: If one cluster failed, others kept running just fine.
- Use Resources Wisely: Each part of the system got only the memory and computing power it needed.
They also created special clusters for massive communities (called “Big Freaking Guilds” or BFGs) that needed more power to search across billions of messages.
The Results:
- Indexing Speed doubled.
- Search Latency dropped from 500ms to under 100ms.
- Zero downtime during system updates.
- New features like cross-message search became possible.
This Discord Kubernetes migration story shows how rethinking infrastructure can unlock performance and reliability at scale.
Conclusion
Discord’s story made one thing clear: if you can’t keep up with the evolving technology and demand, you will easily be replaced. Here, Redis wasn’t able to handle the level of scale that Discord achieved. Whereas Kubernetes offered the flexibility, resilience, and control needed to rebuild search from the ground up.
If your business is also running into similar limitations while scaling up, it might be time to rethink your architecture. Moving your workloads to Kubernetes can help, and if the move feels too complex to handle alone, working with a Kubernetes consulting company can make that transition smoother and more reliable.
Great news! Dev.to is distributing a limited-time token giveaway for our top content creators. Visit the claim page here. wallet connection required. – Dev.to Airdrop Desk