Bloom Filter

A constant-size data structure that answers 'definitely not in the set' or 'maybe in the set' in O(k) time. The structure inside every production LSM-tree (RocksDB, LevelDB, Cassandra) used to skip disk reads on missing keys. Six stages to a production-quality implementation.

fork the python starterkarnstack/byox-bloom-filter-python

use this template

gh repo create my-bloom-filter-python --template karnstack/byox-bloom-filter-python --private --clone

pick private to keep work to yourself, or public for unlimited actions minutes. verification works on both.

checkinggithub link status

your progresspython

0/6verified

no verifications yet. fork the starter, push to main, watch stages light up here.

stages

tests run on every push to main

stage 01Bit array and single hash
free45m
stage 02Multiple hashes (Kirsch-Mitzenmacher)
50m
stage 03Optimal sizing math
40m
stage 04Cache-line-blocked layout
55m
stage 05Concurrent-safe Add
35m
stage 06Serialize and saturation
50m

references

papers cited

Space/Time Trade-offs in Hash Coding with Allowable ErrorsBurton H. Bloom (1970) · Communications of the ACM, 13(7), pp. 422-426The original Bloom filter paper. Establishes the space-vs-error tradeoff.
Less Hashing, Same Performance: Building a Better Bloom FilterAdam Kirsch, Michael Mitzenmacher (2006) · Algorithms - ESA 2006Justifies the two-hash construction h_i(x) = h_a(x) + i * h_b(x) used in stage 2. Saves k hash evaluations.
Cache-, Hash- and Space-Efficient Bloom FiltersFelix Putze, Peter Sanders, Johannes Singler (2007) · WEA 2007Introduces the blocked Bloom filter layout used in stage 4. All k probes hit one cache line.

used in production by

RocksDBPer-SSTable Bloom filter. Configurable via bits_per_key; default ~10 bits/key (~1% false-positive rate). Saves random disk reads on missing keys.
LevelDBPer-SSTable Bloom filter, same idea as RocksDB (RocksDB forked LevelDB). 10 bits per key by default.
Apache CassandraPer-SSTable row Bloom filter. Default 0.01 false-positive rate. Used to skip SSTables that definitely do not contain a row before going to disk.
BigtablePer-SSTable Bloom filter. Described in section 6 ('Refinements') of the original 2006 Bigtable paper. Same purpose as Cassandra and RocksDB.

stages

references

Command Palette