bloom filter
stage 01 of 06
stage 01

Bit array and single hash

What You Are Building

A bit array of m bits, the smallest possible representation of a Bloom filter, with Add(key) and Test(key) operations driven by a single hash function. By the end of this stage the filter does not have a tunable false-positive rate (that arrives in stage 3) and it is not concurrent-safe (stage 5). It is the foundation everything else builds on.

Why This Matters

Every distinction between a fast Bloom filter and a slow one comes down to two things: how compactly you store the bit array and how you compute the hash. Stage 1 fixes the first. The bit array lives in m / 64 machine words (Go), m / 8 bytes (Python, TypeScript), and individual bits are addressed by (word_index, bit_offset) pairs. This is the same layout RocksDB uses internally.

If the bit array is sloppy (one byte per bit, or a Python list of booleans) the filter is correct but tens of times larger and slower than it needs to be. The whole point of a Bloom filter is space efficiency, so the bit array is not optional.

The Interface You Implement

The interface name and method names follow each language's idioms. Tests target behavior; method names and exact types are language-specific.

Go (go/bloom/bloom.go):

package bloom
 
// Filter is a Bloom filter with a single hash function (stage 1).
type Filter struct {
    bits []uint64
    m    uint64 // total bit capacity (multiple of 64)
}
 
// New returns a filter with at least m bits of capacity, rounded up to the
// nearest multiple of 64.
func New(m uint64) *Filter
 
// Add inserts the key into the filter.
func (f *Filter) Add(key []byte)
 
// Test reports whether the key may be in the filter.
// Returns false only if the key was definitely not added.
func (f *Filter) Test(key []byte) bool
 
// M returns the actual bit capacity (after rounding up).
func (f *Filter) M() uint64

Python (python/bloom/__init__.py):

class Filter:
    def __init__(self, m: int) -> None: ...
    def add(self, key: bytes) -> None: ...
    def test(self, key: bytes) -> bool: ...
    @property
    def m(self) -> int: ...

TypeScript (ts/src/bloom.ts):

export class Filter {
  constructor(m: number);
  add(key: Uint8Array): void;
  test(key: Uint8Array): boolean;
  get m(): number;
}

What the Tests Check

Five tests gate this stage. Each is named identically (modulo language casing) across the three starters.

  • AddedKeyIsPresent. For any single key, Add(k) followed by Test(k) returns true.
  • EmptyFilterReturnsFalse. Test(k) on a fresh filter returns false. With a single hash and zero inserts, there is no false-positive path.
  • BitArrayBoundary. A filter constructed with m = 1 still works. After Add(k) for any k, exactly the single bit is set, and Test(k) returns true. This catches off-by-one errors in the modulo-into-bit-position math.
  • HashIsDeterministic. Test(k) returns true after Add(k) even across multiple filter instances of identical m. This catches the bug of using a random seed without persisting it.
  • NewAllocatesAtLeastMBits. New(95850).M() returns at least 95850. This catches truncation of m rather than rounding up.

The starter repo carries failing skeletons for all five. Run mise run stage 1 to see them fail.

Guidance

Choosing a hash function

Use a fast, deterministic, well-distributed hash. The choice matters less in stage 1 than it will in stage 2.

  • Go: hash/fnv (fnv.New64a()) is in the standard library and fine for this stage. github.com/zeebo/xxh3 is faster but adds a dependency; defer that to stage 2.
  • Python: hashlib.sha256(key).digest() then take the first 8 bytes as a uint64. Slow but correct; ten thousand inserts run in well under a second.
  • TypeScript: the simplest portable option is FNV-1a implemented inline (about 15 lines). The Web Crypto crypto.subtle.digest("SHA-256", key) works but it is async, which complicates the interface.

Do not seed the hash from time.Now() or Math.random(). The filter must give the same answer for the same key across runs.

Mapping a hash to a bit position

A 64-bit hash needs to become a bit position in [0, m). The straightforward approach:

pos = hash % m
word_index = pos / 64
bit_offset = pos % 64
mask = 1 << bit_offset

Set: bits[word_index] |= mask. Test: (bits[word_index] & mask) != 0. Clear: bits[word_index] &= ^mask. Tests in stage 1 only exercise set and test; clear arrives in later stages.

Rounding m up to a multiple of 64

The bit array is stored in 64-bit words. If the caller asks for m = 100 bits, allocate 2 words = 128 bits, and record M() = 128. The test NewAllocatesAtLeastMBits checks this. Do not truncate.

Common Pitfalls

  1. The EmptyFilterReturnsFalse test fails because you allocated the bit array with non-zero default values. Go's make([]uint64, n) zero-fills. Python's bytearray(n) zero-fills. TypeScript's new Uint8Array(n) zero-fills. If your test fails on the empty filter, double-check that you are not initializing bits to 1 somewhere.

  2. The BitArrayBoundary test fails because m = 1 makes the modulo trivially zero. That is fine. After Add(k), position is zero, bit zero of word zero is set, Test(k) reads that bit and returns true. If your code special-cases small m and divides by zero or short-circuits, fix it.

  3. The HashIsDeterministic test fails because your hash uses a random seed. Either pin the seed to a constant or use a deterministic hash family.

  4. The AddedKeyIsPresent test fails for some keys but not others. Almost certainly off-by-one in the bit position math: check that you compute pos = hash % m, not pos = hash % (m - 1). The valid bit range is [0, m-1] inclusive.

Bonus

  • Add a Clear() method that zeroes the bit array. Not tested, but useful for stage 5 and 6 work later.
  • Add a PopCount() method returning the count of set bits in the array. Go has math/bits.OnesCount64. Python has int.bit_count() (3.10+). TypeScript can use a Brian-Kernighan loop. You will need this in stage 6 for saturation estimation.
  • Benchmark Test against a map[string]struct{} of size n = 10000. The Bloom filter should be at least as fast in lookup and dramatically smaller in memory.

What You Built

A single-hash Bloom filter that returns false for absent keys and true for added keys. With one hash, the false-positive rate is high (each added key sets one bit; once enough bits are set, random tests collide trivially). Stage 2 brings the filter close to its theoretical false-positive rate by adding multiple hash functions, using the Kirsch-Mitzenmacher two-hash construction that avoids k full hash evaluations per call.

Continue to Stage 2: Multiple Hashes.

Command Palette

Search for a command to run...