Hash function

A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). Hash functions vary in the domain of their inputs and the range of their outputs and in how patterns and similarities of input data affect output data. Hash functions are used in hash tables, cryptography and data processing.

Formally speaking, a hash function is defined by its domain (typically variable length strings of bytes), its range (typically fixed length bit-sequences) and the defining function (let's call this function H). The key desired characteristics of a hash function in general are that H(x) ≠ H(y) implies x ≠ y and that H(x) = H(y) probably implies that x = y. Unfortunately, the use of probably here is just too inexact to let us succeed. If the set of all possible values of H(x) is much smaller than the set of all possible values of x, then this latter condition cannot be true if all sequences are equally likely. Thus, the second condition is often restated in different ways to make it plausible. For instance, cryptographic hashes use the second condition in the form that given x, it is computationally very difficult to find y such that H(x) = H(y). Additionally, it is desirable that if we are given x and H(x + s) where + can be bit changing or concatenation, then we can't find s short of exhaustive enumeration. In practice, for most applications other than error correction, cryptographic hashes serve very nicely. The MD5 and SHA-1 algorithms are two of the most popular algorithms although any cryptosystem can be used to create a hash function (as, indeed, any cryptographically secure hash can be used to create a cryptosystem).

For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.

For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all identical items. Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.

The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular addition.

Hash tables, a major application for hash functions, speed up the lookup of a record of information, given a key (note: this use of the word key is separate from its meaning as a code for encrypting data). For example, an alphabetic string might be used to look up an employee record in a database system.

Hashing can provide almost direct access to records, which means that, on average, a lookup can require just one or two probes into the memory or file containing the records. (At the other extreme is the worst case, in which lookup proceeds by comparing the key with each record in turn -- this linear search can require examining all the records.)

Assuming the key to be a string of bytes, the hash function should be an index into the records that has random distribution over expected strings. Otherwise, there will be more key "collisions" and hence degradation of lookup time. If, for example, a key is alphabetic, then each byte can have only 26 of its possible 256 values. So simple functions will not probe all record indices evenly.

When speed and good distribution are important, and the key is an array of bytes, one good function to use is Pearson Hashing.

See also: message digest, HMAC, cryptography