easy
~4 hours6 stagespython
Bloom Filter
A constant-size data structure that answers 'definitely not in the set' or 'maybe in the set' in O(k) time. The structure inside every production LSM-tree (RocksDB, LevelDB, Cassandra) used to skip disk reads on missing keys. Six stages to a production-quality implementation.
fork the python starterkarnstack/byox-bloom-filter-python
use this templategh repo create my-bloom-filter-python --template karnstack/byox-bloom-filter-python --private --clonepick private to keep work to yourself, or public for unlimited actions minutes. verification works on both.
checkinggithub link status
your progresspython
0/6verified
no verifications yet. fork the starter, push to main, watch stages light up here.
stages
tests run on every push to main- stage 01Bit array and single hashfree45m
- stage 02Multiple hashes (Kirsch-Mitzenmacher)50m
- stage 03Optimal sizing math40m
- stage 04Cache-line-blocked layout55m
- stage 05Concurrent-safe Add35m
- stage 06Serialize and saturation50m
references
- Space/Time Trade-offs in Hash Coding with Allowable ErrorsBurton H. Bloom (1970) · Communications of the ACM, 13(7), pp. 422-426The original Bloom filter paper. Establishes the space-vs-error tradeoff.
- Less Hashing, Same Performance: Building a Better Bloom FilterAdam Kirsch, Michael Mitzenmacher (2006) · Algorithms - ESA 2006Justifies the two-hash construction h_i(x) = h_a(x) + i * h_b(x) used in stage 2. Saves k hash evaluations.
- Cache-, Hash- and Space-Efficient Bloom FiltersFelix Putze, Peter Sanders, Johannes Singler (2007) · WEA 2007Introduces the blocked Bloom filter layout used in stage 4. All k probes hit one cache line.
- RocksDBPer-SSTable Bloom filter. Configurable via bits_per_key; default ~10 bits/key (~1% false-positive rate). Saves random disk reads on missing keys.
- LevelDBPer-SSTable Bloom filter, same idea as RocksDB (RocksDB forked LevelDB). 10 bits per key by default.
- Apache CassandraPer-SSTable row Bloom filter. Default 0.01 false-positive rate. Used to skip SSTables that definitely do not contain a row before going to disk.
- BigtablePer-SSTable Bloom filter. Described in section 6 ('Refinements') of the original 2006 Bigtable paper. Same purpose as Cassandra and RocksDB.