Hash Tables I: Fundamentals of Fast Retrieval

A hash table is an array that uses an intermediary function (the hash function) to map a given key to the index of a bucket, where the corresponding key-value pair is stored.

They are designed for one primary goal: speed. In an environment where retrieval parameters are tight, a hash table often provides the best performance for mapping unique identifiers to specific values.

The Anatomy of a Hash Table

A hash table is composed of four critical components: Keys, Hash Functions, Indexes, and Buckets.

  • Keys: These are unique identifiers (strings, integers, or characters) that serve as the input for our storage logic. For example, if we want to store the value 50 associated with the name michael, then michael is our key.
  • Hash Function: This is the mathematical formula that processes a key into a Hash Value. A fundamental property of a good hash function is consistency: a specific key will always result in the same hash value.
  • Indexes: The hash value is typically mapped to an index in an array. This is often done using the formula: $$ \text{index} = \text{hash\_value} \pmod{n} $$ where $n$ is the number of rows (or size) of the table.
  • Buckets: These are the storage containers that an index addresses. The bucket stores the actual key/value pair.

The Importance of the Hash Function

A "good" hash function is defined by several characteristics:

  1. Even Distribution: It spreads keys out evenly across the table to minimize collisions.
  2. Efficiency: It allows for fast computation even with large inputs.
  3. Determinism: The same input always produces the same output.
  4. Low Collision Probability: It minimizes the chance of two different keys producing the same index.

Note: It is the key that is hashed, not the value. The value is simply the cargo carried by the indexed bucket.

Understanding Collisions

Collisions occur when two different keys produce the same index. Consider a table of size 30. If we have values like 15, 30, 45, 60, 75, 90 (sequential multiples of 15), and our simple hash function is just the value itself, we see an evident collision pattern.

For example, both 15 and 45 would result in index 15: $$ 15 \pmod{30} = 15 $$ $$ 45 \pmod{30} = 15 $$

To avoid this, we must provide unique keys or use a more sophisticated hash function that spreads these values across the available buckets. Without proper key management or collision resolution, our "extremely fast" retrieval system would quickly degrade in performance.

Implementation: Building a Hash Table from Scratch

Let's build a hash table in Python that demonstrates all the concepts we've discussed. We'll use chaining as our collision resolution strategy, where each bucket contains a list of key-value pairs. This implementation includes deletion and automatic resizing.

class MikeysBoxys:
    # Step 1: Create table with a given size
    # To avoid collisions, we try to get values as close to <= size as possible
    def __init__(self, size):
        self.size = size
        self.count = 0
        self.mikeyboxy = [[] for _ in range(size)]  # buckets

    def get_array(self, n):
        return self.mikeyboxy[n - 1]

    # Step 2: Create hash function
    def hash_function(self, key):
        return sum(ord(c) for c in str(key)) % self.size

    # Step 3: Ensure hash function result maps to index
    def mapit(self, key, value):
        index = self.hash_function(key)
        # Check if key already exists in the bucket
        for pair in self.mikeyboxy[index]:
            if pair[0] == key:
                pair[1] = value
                return
        # If key doesn't exist, append new key-value pair
        self.mikeyboxy[index].append([key, value])
        self.count += 1
        # Auto-resize when load factor exceeds 0.75
        if self.count / self.size > 0.75:
            self.autosize()

    # Find value for respective key
    def lookupval(self, key):
        # Use the hash value to trace back the key and find its value
        index = self.hash_function(key)
        for pair in self.mikeyboxy[index]:
            if pair[0] == key:
                return pair[1]
        return None

    # Step 4: Enable deletion - performs mapit process but removes key instead
    def deletekey(self, key):
        index = self.hash_function(key)
        for i, pair in enumerate(self.mikeyboxy[index]):
            if pair[0] == key:
                del self.mikeyboxy[index][i]
                self.count -= 1
                return True
        return False

    # Step 5: Double hash table size when keys/bucket threshold is crossed
    def autosize(self):
        old_table = self.mikeyboxy
        self.size *= 2
        self.mikeyboxy = [[] for _ in range(self.size)]
        self.count = 0
        for bucket in old_table:
            for key, value in bucket:
                self.mapit(key, value)

Using the Hash Table

# Create a hash table with 20 buckets
mb = MikeysBoxys(20)

# Access bucket at index 5
sixy = mb.mikeyboxy[5]
print(sixy)  # []

# Get array at position 6
n_array = mb.get_array(6)
print(n_array)  # []

# Create another hash table and add key-value pairs
ht = MikeysBoxys(23)
ht.mapit("mike", 50)
ht.mapit("miky", 50)

print(ht.lookupval("mike"))  # 50
print(ht.lookupval("miky"))  # 50

Testing Deletion and Auto-Resize

# Create a small table to test auto-resizing
test = MikeysBoxys(6)
test.mapit("mike", 50)
test.mapit("miky", 60)
test.mapit("rike", 70)
test.mapit("riky", 80)
test.mapit("dike", 90)
test.mapit("diky", 95)

# Delete a key
test.deletekey("rike")

# Check bucket contents
testcheck = test.mikeyboxy[3]
print(testcheck)

# Table has auto-resized from 6 to 12
print(test.size)  # 12

How Chained Pairs Are Stored

When collisions occur, multiple key-value pairs are stored in the same bucket as a chain. Here's how the structure looks when bucket 5 has multiple entries:

# Structure of chained pairs in bucket 5:
mikeyboxy[5][0][0] → "key0"    mikeyboxy[5][0][1] → "value0"
mikeyboxy[5][1][0] → "key1"    mikeyboxy[5][1][1] → "value1"
mikeyboxy[5][2][0] → "key2"    mikeyboxy[5][2][1] → "value2"
mikeyboxy[5][3][0] → "key3"    mikeyboxy[5][3][1] → "value3"
mikeyboxy[5][4][0] → "key4"    mikeyboxy[5][4][1] → "value4"
mikeyboxy[5][5][0] → "key5"    mikeyboxy[5][5][1] → "value5"
mikeyboxy[5][6][0] → "key6"    mikeyboxy[5][6][1] → "value6"
mikeyboxy[5][7][0] → "key7"    mikeyboxy[5][7][1] → "value7"

Each bucket is a list of lists. The outer index selects the bucket, the middle index selects which pair in the chain, and the inner index (0 or 1) selects the key or value.

In the next part, we will explore alternative collision resolution strategies like Open Addressing and Linear Probing.