In a hash file, records do not have to be written sequentially to the file. Instead, a
hash function calculates the address of the page in which the record is to be stored
based on one or more fields in the record. The base field is called the hash field,
or if the field is also a key field of the file, it is called the hash key. Records in a
hash file will appear to be randomly distributed across the available file space. For
this reason, hash files are sometimes called random, or direct, files.
The hash function is chosen so that records are as evenly distributed as possible
throughout the file. One technique, called folding, applies an arithmetic function,
such as addition, to different parts of the hash field. Character strings are converted
into integers before the function is applied using some type of code, such
as alphabetic position or ASCII values. For example, we could take the first two
characters of the staff number, staffNo, convert them to an integer value, then add
this value to the remaining digits of the field. The resulting sum is used as the
address of the disk page in which the record is stored. An alternative, more popular
technique, is the division-remainder hashing. This technique uses the MOD function,
which takes the field value, divides it by some predetermined integer value,
and uses the remainder of this division as the disk address.
The problem with most hashing functions is that they do not guarantee a unique
address, because the number of possible values a hash field can take is typically
much larger than the number of available addresses for records. Each address generated
by a hashing function corresponds to a page, or bucket, with slots for multiple
records. Within a bucket, records are placed in order of arrival. When the
same address is generated for two or more records, a collision is said to have