Ever wonder how Google avoids crawling the same page twice across billions of URLs? Or how your browser blocks phishing links instantly without downloading a blacklist the size of a dictionary? Or how spam filters flag dodgy senders before the email even loads?
Behind all these real-time decisions is one quiet principle: systems need to ask, “Have I seen this before?” But at scale, that’s not a simple yes/no question. And answering it wrong — or slowly — can cost millions.
The Real Cost of "Remembering"
At small scale, you can use a hash set or a map to keep track. These give you lightning-fast lookups. But try storing a billion SHA-256 hashes — that's 32GB of raw memory. And that’s just the hashes, not the metadata, not the index, not the structure.
Want to ship that to every mobile browser? Or store it in L1 cache on your server? Good luck.
So big systems flip the question. Instead of asking,
“Is this in the set?”,
they ask:
“Can I say for sure this is not in the set?”
This small shift leads us to a powerful, counterintuitive tool: the Bloom Filter.
Let Me Introduce You to the Maybe-Machine
Bloom Filters don’t aim for perfect accuracy. Instead, they give you a fast, memory-efficient answer:
- If something isn’t in the set → they’ll say so with complete certainty.
- If something might be in the set → they say “maybe.”
That tradeoff is deliberate. It's what makes Bloom Filters ridiculously cheap, fast, and useful.
Imagine a bouncer who isn’t checking names, just checking definitely not names. Anyone who fails the test gets turned away. Everyone else gets passed to a more serious security check — your database, your backend, your full set.
How They Actually Work
Let’s walk through this. Suppose you’re Instagram and want to check if a user has seen a particular reel before. Storing every reel ID per user is a memory nightmare. So instead, you use a Bloom Filter.
It’s just two things:
- A bit array of size
m
(all zero at the start) -
k
independent hash functions
To insert a reel ID X
:
- You run
X
through allk
hash functions:
h₁(X) → index₁
h₂(X) → index₂
...
hₖ(X) → indexₖ
- Each gives you a position in the bit array, and you set those positions to
1
.
To check if another reel Y
has been seen:
- Hash it again using the same
k
functions. - Check those bit positions:
- If any bit is 0 →
Y
is definitely new. - If all bits are 1 → maybe seen, maybe not.
And that’s it. There’s no list of IDs. No map. No real identity. Just a pattern of flipped bits.
Think About It Like This
Imagine a long row of switches (bits), all off.
Each new item you insert flips a few switches to "on," chosen by hashing. When checking later, if your item’s switches are all on, it might have been there before — or maybe someone else flipped those same switches.
What Happens If Two Items Hash to the Same Positions?
Then you’ve got a collision. But that’s not a bug — that’s inevitable. You're compressing a huge space (like all URLs) into a fixed-size array. By the Pigeonhole Principle, collisions must happen.
This is where false positives come from. If item A and item B flip the same k
bits, the filter can’t tell them apart anymore. Identity gets blurred. That’s the cost of compression — and it’s baked into the model.
Once a bit is shared, it’s communal. You don’t know who flipped it anymore.
But the tradeoff is worth it — you get blazing-fast queries in a tiny amount of memory.
Okay, But How Likely Are False Positives?
Let’s do some math. Say:
- Bit array size
m = 10,000
- You insert
n = 1,000
items - Use
k = 7
hash functions
The formula for false positive probability is:
Plugging in:
So about a 0.82% chance of false positive — using just 1.25KB of memory.
That’s insanely efficient.
Tuning Matters
- Add more items → more bits flip → more false positives.
- Flip too many bits → everything looks like a “maybe” → filter gets saturated.
- Use too few hash functions → less precision.
- Use too many → wasteful and overlapping.
There’s even an optimal value for k
:
Tune it based on how many items you expect to store and how much space you can spare.
A Few Sharp Edges
You can’t delete from a basic Bloom Filter. If item A and item B both flipped the same bits, and you try to “unflip” those bits after deleting A — you might accidentally delete B’s trace too.
That’s why deletion is impossible in standard Bloom Filters.
If you really need deletion, you’ll need a Counting Bloom Filter — where instead of a bit, each position stores a counter. But you lose the original simplicity and memory efficiency.
Also: Bloom Filters aren’t cryptographically secure. In adversarial settings (e.g., BitTorrent), attackers have faked presence by spoofing bit patterns — leading to denial-of-service style distractions.
So Why Did Redis Ditch Them?
Redis once had Bloom Filters in its core, but later removed them — not because they’re useless, but because they’re specialized. Instead, they moved it into a module: RedisBloom
. This lets power users access not just Bloom Filters, but more advanced probabilistic structures — Cuckoo Filters, Count-Min Sketch, Top-K, etc.
It’s a reminder: Bloom Filters aren’t silver bullets. They’re sharp tools — if you know when to use them.
The Bottom Line
Bloom Filters aren't trying to be accurate — they’re trying to be fast and cheap. They let you say “no” with confidence, and “maybe” with grace. They compress identity into rough patterns. They don’t track, they approximate. And that’s exactly why they work at scale.
So next time you build something that needs to track millions of things with limited memory — ask yourself if a perfect answer is worth the price. Because sometimes, “probably” is good enough to be brilliant.
☕️ If this post helped Bloom Filters bloom in your brain — or just made you whisper “nerdy but nice” — consider buying me a coffee. Unlike a Bloom Filter, I won't forget you. Unless you hash to the same index as someone else. Then... we’ll never know 🤷♂️.