Bloomfilter

A Bloom filter is a probabilistic data structure that help can be found very quickly which data have occurred already in a data stream and which occur for the first time. For this purpose, a "fingerprint" of the data record read will be left in a one-line hash table with an appropriate number of hash functions.

Developed in 1970 by Burton H. Bloom for spell checking and hyphenation at the end of the line, Bloom filters are now often used in database management and for routing in networks. Unlike other algorithms Bloom filter need very little space. But are also the following peculiarities of crucial importance for the applicability: key values ​​that were recorded once in the hash table, remain there. Furthermore, false positive results are possible, ie what the filter accepts, was likely to contain the key values ​​; however, was definitely not what he rejects.

Principle of operation

A Bloom filter is composed of an m- digit bit array (which is initially filled with zeroes ) and k different hash functions, with a range from 0 to m -1. This is important especially that supplied by the hash function values ​​are uniformly distributed cryptographic properties are not required. For this reason, simple and very fast hash functions are often used (eg MurmurHash or FNV ).

Initially, the Bloom filter is assigned his vocabulary. To this end, k hash values ​​are calculated for the value to be stored by the k hash functions, each of the determined k hash values ​​representing a position in the bit array. Each of these positions will be set to 1, it being immaterial whether previously a 1 or a 0 was at that position.

Will now be checked whether any value is included in the vocabulary, the k hash values ​​are determined in turn. Contains one of the k digits of the bit array is a 0, then it is certain that the value is not included in the filter. If at each of the k points a 1, it is very likely that the value has been stored in the filter. However, a Bloom filter can never say with absolute certainty that a particular value is actually stored as a certain risk -related collisions, classified as belonging mistakenly values ​​(false - positive classification). This risk, however, can by a suitable choice of k, m, and capacity of the filter (ie the maximum number of values ​​to be stored ) is reduced.

Since a certain position may have been set in the bit array by adding different values ​​to 1, it is not possible with conventional Bloom filters to remove once added value subsequently. Remedy known as Counting Bloom filter here.

Hash functions with constant and variable source image set amount are not bijective in general, so it is impossible with classical Bloom filter to recover the contained values ​​. It can only be determined with a certain (possibly high) probability whether a given word is contained or not. This is in many applications, a major drawback, since, for example, the wildcard search is impossible.

Bloom filters can be useful when sensitive data is to be stored. For example, the method can be used to exclude certain at a manhunt that just cleared person is wanted, without having to hold personal data in plain text.

Example

Assuming that the filter has a length of 10 bits (m = 10) and has three hash functions (k = 3). The vocabulary consists of the words "the ", "an" and "the".

First, the filter is empty:

0 0 0 0 0 0 0 0 0 0 Now the words are sequentially added to the filter. For this purpose the results of the three hash functions are determined for each of the words.

0 1 0 1 0 0 0 1 0 0 " The " Suppose the hash functions provide for the word " the " values ​​7, 4 and 1 It thus positions 7, 4 and 1 are set in the array to 1. The array now looks like this: 1 1 0 1 0 0 1 1 0 0 " The " Assuming that the hash functions provide for the word " the " the values ​​of 9, 8 and 2 are set in the array 1, the positions 9, 2 and 8, with which the array looks like this: 1 1 0 1 0 0 1 1 1 0 It will now be reviewed in the filter, the existence of different words.

It should be remembered that this is a deliberately constructed an extreme example, which still does not allow conclusions about the quality of the process.

Variants

In the scientific literature, numerous variants and extensions of Bloom filters have been proposed. Provides a good overview of the article " Network applications of bloom filters: A survey" by Broder and Mitzenmacher. These enhancements include, for example Counting Bloom filters that allow the cost of a higher space requirement removing items. Other Bloom filter variants try the benefits of the Bloom filter to stream data to expand (eg, Stable Bloom Filter), while other variants options for probabilistic Multiplizitätsabschätzungen of inserted elements offer (eg Bloomier filter).

Applications

  • The Proxy Server Squid uses Bloom filters for matching cache entries.
  • The NoSQL BigTable and Apache Cassandra use Bloom filter to avoid expensive disk accesses for queries on a non - existing key values.
  • The web browser Google Chrome maintains a Bloom filter with the signatures of dangerous websites and checks when you enter a URL if it is included in this filter.
132884
de