Hash table

In computer science, refers to a specific index structure as a hash table ( hash table or hash map english ) or scattering value table. As an index structure of a hash table is used to locate data elements in a large amount of data. For hash tables alternative index data structures are for instance tree structures (such as a B - tree) and the Skip List. Hash tables are characterized by a normally constant-time operation during insert or remove operations. When using a hash table to search in data volumes is also called a hash algorithm or hashing method.

  • 3.1 Hashing with chaining
  • 3.2 Hashing with open addressing
  • 3.3 Cuckoo hashing
  • 3.4 algorithms 3.4.1 Linear probing
  • 3.4.2 Quadratic probing
  • 3.4.3 Double Hashing
  • 3.4.4 Brent Hashing
  • 3.5.1 benefits
  • 3.5.2 disadvantages

Hashing

The hash algorithm is an algorithm for searching for data objects in large data sets. It is based on the idea that a mathematical function calculates the position of an object in a table. Eliminates the need to search many data objects to the target object has been found.

The algorithm

When Hashing the target data is stored in a hash table. A hash function is calculated for each data object, a hash value, which is used as an index into the table. To calculate this hash value, a key is needed, which clearly identifies this object. This key is used by the hash function to calculate the hash value. The data object is stored at a fixed location by the hash value (english bucket ) in the table. Ideally, each object gets its own bucket. However, hash functions need not be unique, so that two different objects can have the same hash value. This case is called a collision, it requires a special treatment by the method.

A search in the hash table will now proceed similarly. First, a hash value is calculated from a search key again, which determines the bucket of this data object. If there has been a collision, now the desired target must be determined only by direct comparison of the search key with the objects.

In practice, the table is implemented as an array. For the treatment of collisions collided data is stored according to an alternate strategy in other alternative fields or in a list. At worst collisions can lead to a degeneration of the hash table when a few hash values ​​were assigned to very many objects, while others hash values ​​remain unused.

Collisions

Since hash functions are generally not injective, two different keys can lead to the same hash value, ie to the same field in the table. This event is called a collision. In this case, the hash table must have several values ​​at the same location. To handle this problem, there are various collision resolution strategies.

One way is called open hashing. When doing an entry should be stored at a location already occupied in the table, another vacancy is taken instead. Distinction is often made between three variants:

Another option is collision resolution by chaining. Instead the data you want in the hash table contains this container (English buckets), record all data with the same hash value. A search that is the right target container is first calculated. So that the set of possible targets is greatly reduced. Finally, the remaining elements still need to be searched in the container. In the worst case, it may happen that all elements have the same hash values ​​and thus be stored in the same bucket. In practice, this can also be avoided by choice of a suitable size for the hash table, and a suitable hash function. Often the concatenation of a linear list per container is realized.

Benefits

Depending on the application, the use of a hash table advantages in terms of access time ( compared to other tree index structures ) and in the required space ( over ordinary arrays).

Ideally, the hash function should be selected for the data to be stored so that the number of collisions is minimized and remains under a constant. This is true, then requires a hash table with stored items by accessing a hash table entry on average only constant time (). Depending on the hash function, the hash table must contain (in practice usually 20 to 30 percent), thus the number of collisions can be limited additional unused fields.

In comparison, the access to an element in a B- tree in the order of. The entire tree data structure requires in the order of storage.

Disadvantages

When filling the number of stored elements is divided by the total number of buckets indicated. Filling level increases, the probability of collision increases, and the deterioration increases. Then only an enlargement of the table lead with subsequent restructuring back to an acceptable runtime behavior.

Complexity

Were hash function and the hash table size suitably chosen, the cost of search in the table (English look-up ) O ( 1). Because of the possible collisions has a hash table, however, the so-called worst-case a much worse performance. This is estimated to be O ( n), where the n is the number of the stored entries in the hash table correspond. It so all entries are searched in the table.

Variants of hashing

There are several variants of the hash process, which are more suitable for certain data. An important factor here is the dynamics of change of the number of elements. The open hashing solves this problem, but it takes a loss to the accesses into account. The closed hashing, however, is dependent on explicit strategies for collision handling. Caution: The terms " open " or "closed hashing " can also be used in exactly the opposite meaning.

Hashing with chaining

Hashing with chaining (English separate chaining ) the hash table is structured so that each container can accommodate a dynamic data structure - such as a list or a tree. Each key is then entered or searched in this data structure. So it is easily possible to store several keys in a container, but this leads to more or less extended access times. The efficiency of the access is determined on how quickly data is inserted into the selected data structure and it can be recovered. Hashing with chaining is a very common indexing variant, with very large amounts of data are indexed by hash tables in databases. The size of the bucket is in database systems a multiple of the sector size of the storage medium. The reason for this is that the data quantity can not be maintained in main memory. In a query, the database system must read the buckets by sector.

Hashing with open addressing

This process is abbreviated to open hashing based on the open addressing, but also closed hashing, based on the limited number of possible keys in the container named.

Hashing with open addressing each container can only be assigned a fixed number of keys. Often, you simply select a single possible keys per container. In the event of a collision then need to look for an alternative container. Here, the approach taken is that you define a whole sequence of m hash functions for m container. Where application of the first hash function, we call it h0 to a collision, so you turn to h1. Does not this also be a collision (that is, the corresponding container is already occupied ), we apply h2, and so on until hm -1. The term "open addressing " arises from the property that same key may be assigned different addresses by collisions.

Cuckoo hashing

Cuckoo hashing is a further method of avoiding collisions in a table. The name is derived from the behavior of the cuckoo to remove eggs from a nest to set in its own egg.

The principle is to use two hash functions. This results in two possible locations in a hash table, which always guarantees a constant access time. A new key is stored in one of two possible locations. If the first target position to be occupied, the existing key is set to its alternate position and stored in its place the new key. If the alternative position to be occupied, in turn, the key will be transferred to this position to its alternate position, and so on. If this procedure leads to an infinite loop (usually one breaks after steps down), the hash table is rebuilt with two new hash functions. The probability of such a Rehashing is on the order of, for each insertion.

Algorithms

Linear probing

The simplest way to define such a sequence is so long to check each next container, until you come to a clear container. The definition of the sequence of hash functions looks like this:

The application of the modulo is related to the limited number of vessels: When the last container is tested, starting again with the first container. The problem with this method is that as quickly form chains or clusters and increase the access times in the range of such chains quickly. The linear probing is therefore inefficient. Its advantage is that - unlike other exploratory procedures - all containers in the table are used.

Quadratic probing

As in linear probing is searching for a new free memory, but not sequentially, but with steadily increasing distance square to the original position and in both directions. Causes a collision, so be sampled, etc. in succession. In formulas expressed:

The constant change of sign in this collision strategy is called " alternating quadratic probing " or " quadratic probing with refinement ." If you choose the number of containers sent ( ie, is prime) so produced each probe sequence to a permutation of the numbers 0 to; it is thus ensured that each container is made.

Having to Square probing is no improvement in the probability to perform an exploratory (), but may reduce the likelihood of collisions during the exploratory (), that is, cluster formation is avoided.

Double Hashing

In the double - hashing two independent hash functions and applied. These are called independent if the probability of a so-called double collision, that is the same and so is minimal. The sequence of hash functions that are now being formed by and looks like this:

The cost of this method is close to the cost of a base hashing.

Brent Hashing

When Brent hashing is checked whether the place to which the element is to be inserted, is free. If this is the case, then insert the item there. The square is occupied, then a new place in the table is based on the square just calculated for each of the element to be inserted and for the element that is already in the place calculated. If the two recalculated places also shows the procedure for the re-calculated occupied space of the element to be inserted repeatedly. However, if the item to insert a space calculated is free, the element is inserted there. The square is occupied, and the calculated free space for the element that has occupied the space for the element to be inserted in the previous run, then the two squares of the elements are interchanged and so could the element to be inserted in the table are housed.

Dynamic hashing

With increasing filling level of the table, the probability of collisions increases considerably. Later than when the number of indexed data is greater than the capacity of the table, collisions are unavoidable. This means that the method must spend an increasing effort for conflict resolution. To avoid this, the dynamic hashing the hash table is increased, if necessary. However, this inevitably has an impact on the range of the hash function, which must also be expanded now. However, a change of the hash function again has the adverse effect that also change the hash values ​​for previously stored data. Specifically for this purpose, a class of hash functions developed for the dynamic hashing the value range can be enlarged without changing the previously stored hash values ​​.

Benefits

  • There is no upper limit to the data volume
  • Entries can be deleted without problems
  • Address collisions do not lead to clustering.

Disadvantages

If not an order-preserving hash function was used:

  • No efficient running through the entries to an order
  • Not an efficient search for the entry with the smallest or largest key

Application

A use case is the associative array (also map, lookup table, Dictionary or Dictionary ). The lookup of the associated with a key data can be implemented quickly and elegantly using a hash table.

Important hash tables are also databases for indexing tables. A so-called hash index may under favorable conditions lead to ideal access times.

Furthermore find hash tables used in virtually every modern application. Here they are used for the implementation of sets ( set) or a cache. Symbol tables, which are used in compilers or interpreters are often implemented as a hash table.

Use in databases

Hash tables allow such a very fast search in large data sets, since the calculation of the hash value in a single step, the number of possible targets is limited. Thus hash tables are among the most efficient index structures. However, a major disadvantage is the risk of degeneration due to collisions, which are inevitable in a steady growth of the data set ( if the table does not increase and each element contained therein is rehashed ). See collision. Therefore, due to unfavorable IO access patterns when the hash table is stored on disk and to iterate the inability intervals in accordance with an order relation efficiently, the use of database systems over alternative Indexdatenstruktien must be such as B - trees weighed.

Most hash functions do not allow the movement to the next or previous record in accordance with an order relation, since they to be distributed evenly in the value space of a targeted " mix " the data. Special " order-preserving " hash functions allow such iteration according to their order relation, and thus the query with inequality links ( "greater than", "less than" ) or the sorted access to all values ​​. In order to make efficient use of such methods, a prior analysis of the data distribution is usually necessary. You therefore have mostly in database systems, which are also, for example, to perform such an analysis for query optimization.

377867
de