Bloom Filters: Space-efficient Probalistic Data Structure
Bloom filter
A Bloom filter is a space-efficient probalistic data structure. It is used to test whether an element presents in a set or not. It return either a "firm no" or "probably yes", which mean false positives are possible while false negatives are impossible. Once an element is added to the set, it cannot be removed. As the number of elements inserted into a bloom filter increases, the more likely the false positives are. Bloom filter trades off decrease in accuracy for reduced memory usage while providing fast query responses. It could be very useful for use case that can tolerate some false positves but not any false negatives.
How Bloom filter works
Initially, an empty Bloom filter is an array of bits set to 0.
Adding an element
To add a new element to the Bloom filter, we take the element to be inserted and run it through a set of hash functions, mod the result by the length of bit array to generate a set of hash values, then for each hash values, set the bit corresponding to the index to 1.
Following figure gives an example of insertion process when adding "Running" object to the Bloom filter.
Testing an element
To check whether an element exists, we generate the bitmap index with hash function and check if any of the corresponding bit are 0. If any corresponding bit is 0, this means that the element is definitely not in the set , otherwise the bit would have been set to 1 during the insertion process.
If all the bits corresponding to an element's hash values are set to 1, this indicates the element probably exists which is not 100% certain as Bloom filter can produce false postives. The false positive happens when there is hash collision which the hash values for an element corresponding to other already inserted element in the filter.
For instance, when we check if "football" is present, the hash function return hash values of 2 , 8 and 9. As shown in the figure below, since all the respective bit to the hash values of "football" are set to 1, even though the "football" does not exists, the bloom filter will return true which is a false positive.
Trade-off between memory space and accuracy
We can control the false positive rates by adjusting the size of bit array based on the expected number of elements that will be in the set. Increasing the size of bit array and number of hash functions can reduce the probability of false positives but require additional memory space. Conversely, reducing the size of bit array and numner of hash functions result in reduced memory space but increase the likelihood of false positives. These are trade-offs between space used and accuracy.