Bit array and single hash
What You Are Building
A bit array of m bits, the smallest possible representation of a Bloom filter, with Add(key) and Test(key) operations driven by a single hash function. By the end of this stage the filter does not have a tunable false-positive rate (that arrives in stage 3) and it is not concurrent-safe (stage 5). It is the foundation everything else builds on.
Why This Matters
Every distinction between a fast Bloom filter and a slow one comes down to two things: how compactly you store the bit array and how you compute the hash. Stage 1 fixes the first. The bit array lives in m / 64 machine words (Go), m / 8 bytes (Python, TypeScript), and individual bits are addressed by (word_index, bit_offset) pairs. This is the same layout RocksDB uses internally.
If the bit array is sloppy (one byte per bit, or a Python list of booleans) the filter is correct but tens of times larger and slower than it needs to be. The whole point of a Bloom filter is space efficiency, so the bit array is not optional.
The Interface You Implement
The interface name and method names follow each language's idioms. Tests target behavior; method names and exact types are language-specific.
Go (go/bloom/bloom.go):
package bloom
// Filter is a Bloom filter with a single hash function (stage 1).
type Filter struct {
bits []uint64
m uint64 // total bit capacity (multiple of 64)
}
// New returns a filter with at least m bits of capacity, rounded up to the
// nearest multiple of 64.
func New(m uint64) *Filter
// Add inserts the key into the filter.
func (f *Filter) Add(key []byte)
// Test reports whether the key may be in the filter.
// Returns false only if the key was definitely not added.
func (f *Filter) Test(key []byte) bool
// M returns the actual bit capacity (after rounding up).
func (f *Filter) M() uint64Python (python/bloom/__init__.py):
class Filter:
def __init__(self, m: int) -> None: ...
def add(self, key: bytes) -> None: ...
def test(self, key: bytes) -> bool: ...
@property
def m(self) -> int: ...TypeScript (ts/src/bloom.ts):
export class Filter {
constructor(m: number);
add(key: Uint8Array): void;
test(key: Uint8Array): boolean;
get m(): number;
}What the Tests Check
Five tests gate this stage. Each is named identically (modulo language casing) across the three starters.
AddedKeyIsPresent. For any single key,Add(k)followed byTest(k)returns true.EmptyFilterReturnsFalse.Test(k)on a fresh filter returns false. With a single hash and zero inserts, there is no false-positive path.BitArrayBoundary. A filter constructed withm = 1still works. AfterAdd(k)for anyk, exactly the single bit is set, andTest(k)returns true. This catches off-by-one errors in the modulo-into-bit-position math.HashIsDeterministic.Test(k)returns true afterAdd(k)even across multiple filter instances of identicalm. This catches the bug of using a random seed without persisting it.NewAllocatesAtLeastMBits.New(95850).M()returns at least 95850. This catches truncation ofmrather than rounding up.
The starter repo carries failing skeletons for all five. Run mise run stage 1 to see them fail.
Guidance
Use a fast, deterministic, well-distributed hash. The choice matters less in stage 1 than it will in stage 2.
- Go:
hash/fnv(fnv.New64a()) is in the standard library and fine for this stage.github.com/zeebo/xxh3is faster but adds a dependency; defer that to stage 2. - Python:
hashlib.sha256(key).digest()then take the first 8 bytes as auint64. Slow but correct; ten thousand inserts run in well under a second. - TypeScript: the simplest portable option is FNV-1a implemented inline (about 15 lines). The Web Crypto
crypto.subtle.digest("SHA-256", key)works but it is async, which complicates the interface.
Do not seed the hash from time.Now() or Math.random(). The filter must give the same answer for the same key across runs.
A 64-bit hash needs to become a bit position in [0, m). The straightforward approach:
pos = hash % m
word_index = pos / 64
bit_offset = pos % 64
mask = 1 << bit_offset
Set: bits[word_index] |= mask. Test: (bits[word_index] & mask) != 0. Clear: bits[word_index] &= ^mask. Tests in stage 1 only exercise set and test; clear arrives in later stages.
The bit array is stored in 64-bit words. If the caller asks for m = 100 bits, allocate 2 words = 128 bits, and record M() = 128. The test NewAllocatesAtLeastMBits checks this. Do not truncate.
Common Pitfalls
-
The
EmptyFilterReturnsFalsetest fails because you allocated the bit array with non-zero default values. Go'smake([]uint64, n)zero-fills. Python'sbytearray(n)zero-fills. TypeScript'snew Uint8Array(n)zero-fills. If your test fails on the empty filter, double-check that you are not initializing bits to1somewhere. -
The
BitArrayBoundarytest fails becausem = 1makes the modulo trivially zero. That is fine. AfterAdd(k), position is zero, bit zero of word zero is set,Test(k)reads that bit and returns true. If your code special-cases smallmand divides by zero or short-circuits, fix it. -
The
HashIsDeterministictest fails because your hash uses a random seed. Either pin the seed to a constant or use a deterministic hash family. -
The
AddedKeyIsPresenttest fails for some keys but not others. Almost certainly off-by-one in the bit position math: check that you computepos = hash % m, notpos = hash % (m - 1). The valid bit range is[0, m-1]inclusive.
Bonus
- Add a
Clear()method that zeroes the bit array. Not tested, but useful for stage 5 and 6 work later. - Add a
PopCount()method returning the count of set bits in the array. Go hasmath/bits.OnesCount64. Python hasint.bit_count()(3.10+). TypeScript can use a Brian-Kernighan loop. You will need this in stage 6 for saturation estimation. - Benchmark
Testagainst amap[string]struct{}of sizen = 10000. The Bloom filter should be at least as fast in lookup and dramatically smaller in memory.
What You Built
A single-hash Bloom filter that returns false for absent keys and true for added keys. With one hash, the false-positive rate is high (each added key sets one bit; once enough bits are set, random tests collide trivially). Stage 2 brings the filter close to its theoretical false-positive rate by adding multiple hash functions, using the Kirsch-Mitzenmacher two-hash construction that avoids k full hash evaluations per call.
Continue to Stage 2: Multiple Hashes.