Bloom Filters Explained
Some data structures are elegant in an unexpected way — not because they're perfect, but because they're deliberately imperfect in a very useful way. Bloom filters are exactly that. They're a tool that trades a small chance of being wrong for massive gains in speed and memory. Once you understand them, you'll start seeing problems they're quietly solving everywhere.
The Problem They Solve
Imagine you're building a system that needs to answer one question millions of times per second: "Have I seen this thing before?"
Maybe you're a web crawler checking whether a URL has already been visited. Maybe you're a database avoiding expensive disk lookups for keys that don't exist. Maybe you're a spam filter checking whether an email address has been flagged before.
The naive solution is a hash set — store everything you've seen and check against it. That works, but at scale it gets expensive fast. Millions of URLs or email addresses take up a lot of memory, and memory is not free.
A Bloom filter solves this by answering the same question using a fraction of the space — with one important caveat we'll get to in a moment.
How It Works
A Bloom filter is built from two things:
- A bit array — a long list of bits, all initialised to
0 - Multiple hash functions — functions that take an input and return a position in the bit array
When you add an item, you run it through each hash function, get a set of positions, and flip those bits to 1.
When you check whether an item exists, you run it through the same hash functions, look at those positions, and ask: are all of those bits 1?
- If any bit is
0→ the item is definitely not in the set. It was never added, because adding it would have flipped all those bits. - If all bits are
1→ the item probably is in the set. But not certainly.
That second case is the caveat. Those bits might have been flipped to 1 by other items that happened to hash to the same positions. This is called a false positive — the filter says "yes I've seen this" when it actually hasn't.
Crucially, there are no false negatives. If a Bloom filter says an item is not in the set, you can trust that completely.
The Hash Functions
The hash functions inside a Bloom filter are the engine of the whole thing. They don't need to be cryptographically secure — they just need to be fast, deterministic, and spread inputs evenly across the bit array. Three well-known options see regular use:
FNV-1a (Fowler–Noll–Vo) processes input one byte at a time using XOR followed by a multiply. It's fast, simple, and has stood the test of time for general-purpose hashing. The "1a" variant XORs before multiplying, which gives it slightly better avalanche behaviour for short inputs.
djb2 was created by Dan Bernstein and is arguably the simplest effective hash function around. The core is just hash = hash * 33 + char, iterated across the input. It's remarkably good at distributing short strings, which makes it a common first choice.
Murmur3 is a non-cryptographic hash designed specifically for fast hash-based lookups. It processes input in 32-bit blocks with several mixing rounds, giving it a strong avalanche effect — small changes in the input produce large, unpredictable changes in the output. It's used in production by Cassandra, Redis, and Elasticsearch.
Want to see how they behave? The interactive playground lets you toggle each hash function on and off and watch the bit array update in real time.
A Concrete Example
Let's say your bit array has 10 positions, and you use 3 hash functions.
You add the string "alice":
- Hash 1 → position 2
- Hash 2 → position 5
- Hash 3 → position 8
Bits 2, 5, and 8 are now 1.
Later you check for "bob":
- Hash 1 → position 1
- Hash 2 → position 5
- Hash 3 → position 9
Bit 1 is 0 → definitely not in the set. Correct.
Now you check for "carol":
- Hash 1 → position 2
- Hash 2 → position 5
- Hash 3 → position 8
All three bits are 1 — but "carol" was never added. Those bits were set by "alice". This is a false positive. The filter incorrectly says "carol" might be in the set.
Tuning the False Positive Rate
You can control how often false positives occur by adjusting two things:
- The size of the bit array — a larger array means less bit overlap between different items, which means fewer false positives
- The number of hash functions — more functions means more bits checked per lookup, which makes accidental matches less likely (up to a point)
In practice, a well-tuned Bloom filter can achieve a false positive rate of less than 1% while using a tiny fraction of the memory a hash set would require. There are established formulas for calculating the optimal size and number of hash functions given how many items you expect to store and what false positive rate you can tolerate.
Where They're Actually Used
Bloom filters might sound academic, but they show up in real production systems all the time:
Databases — Google's Bigtable and Apache Cassandra use Bloom filters to avoid unnecessary disk reads. Before going to disk to look up a key, they check the filter. If it says the key definitely isn't there, the disk read is skipped entirely.
Web browsers — Chrome uses a Bloom filter as part of its Safe Browsing feature. Rather than downloading a massive list of malicious URLs to check against locally, it uses a compact filter to catch likely matches and only phones home to verify.
Caching layers — Bloom filters can prevent cache pollution by blocking single-hit items (things looked up only once) from ever making it into the cache.
Distributed systems — Nodes in a distributed system can share Bloom filters to quickly communicate what data they hold without transmitting the full dataset.
The Key Trade-Off
Bloom filters are not a general-purpose replacement for a hash set. They come with real constraints:
- You cannot remove items from a standard Bloom filter. Flipping a bit back to
0might affect other items that share that position. (Variants like counting Bloom filters solve this at the cost of more memory.) - You cannot retrieve items from a Bloom filter — it only answers yes/no questions.
- You have to accept false positives. They're rare and tunable, but they exist. If a false positive would be catastrophic in your system, a Bloom filter isn't the right tool.
The sweet spot is: you need to answer membership questions at high speed and scale, a small error rate is acceptable, and memory is a real constraint.
Bloom filters are a great reminder that the best data structure isn't always the one that's perfectly correct — sometimes it's the one that's correct enough, at a cost you can actually afford. Understanding when to make that trade-off deliberately is what separates good engineering from just reaching for the obvious tool.