all primitives
easy
~4 hours6 stagespython

Bloom Filter

A constant-size data structure that answers 'definitely not in the set' or 'maybe in the set' in O(k) time. The structure inside every production LSM-tree (RocksDB, LevelDB, Cassandra) used to skip disk reads on missing keys. Six stages to a production-quality implementation.

fork the python starterkarnstack/byox-bloom-filter-python
use this template
gh repo create my-bloom-filter-python --template karnstack/byox-bloom-filter-python --private --clone

pick private to keep work to yourself, or public for unlimited actions minutes. verification works on both.

checkinggithub link status

your progresspython
0/6verified

no verifications yet. fork the starter, push to main, watch stages light up here.

stages

tests run on every push to main
  1. stage 01Bit array and single hash
  2. stage 02Multiple hashes (Kirsch-Mitzenmacher)
  3. stage 03Optimal sizing math
  4. stage 04Cache-line-blocked layout
  5. stage 05Concurrent-safe Add
  6. stage 06Serialize and saturation

references

papers cited
used in production by
  • RocksDBPer-SSTable Bloom filter. Configurable via bits_per_key; default ~10 bits/key (~1% false-positive rate). Saves random disk reads on missing keys.
  • LevelDBPer-SSTable Bloom filter, same idea as RocksDB (RocksDB forked LevelDB). 10 bits per key by default.
  • Apache CassandraPer-SSTable row Bloom filter. Default 0.01 false-positive rate. Used to skip SSTables that definitely do not contain a row before going to disk.
  • BigtablePer-SSTable Bloom filter. Described in section 6 ('Refinements') of the original 2006 Bigtable paper. Same purpose as Cassandra and RocksDB.

Command Palette

Search for a command to run...