Home Data Engineering Data DIY Tutorial on Hash Table Data Structure

Tutorial on Hash Table Data Structure

May 24, 2021

Hash table data structure (aka dictionary, hash map, associate array) is a key-value pairs mapping backed by a resizeable array data structure

Attributes

Hash table uses a load factor to decide when to grow the internal array capacity
A hash function is used to generate a hash code and array index to properly distribute entries across the array buckets
Hash codes generated by the hash function may be duplicated. Separate chaining through a linked list or red-black tree data structure is the most common solution to resolve the problem

Public operations

Search, aka Get: retrieve the value by key
Add, aka Put: add a key-value entry into the hash table
Delete, aka Remove: remove an entry from the hash table by key

Internal operations

Hashing: properly distribute entries across the hash table
Resize: auto grow hash table capacity
Rehash: re-execute hashing operation for every entry in the buckets

Time complexity

Hash table offers constant-time O(1) on average and linear-time O(n) in worst case performance for search, add and delete operations

Let’s walk through this tutorial to explore them in more details

The load factor

Hash table is backed by an array and it uses the load factor to decide when to grow the array capacity

Load factor is defined as below

loadFactor = numberOfEntries / arrayCapacity

arrayCapacity, aka the number of buckets

As a general rule, loadFactor = 0.75 offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup costs and vice versa

Hashing operation

Hashing is the hash table internal operation which distributes entries across an array

Hash table uses a hash function to compute the key object to a hash code number

hashCode = hashFunction(keyObject)

Hash code is reduced to array index via modulo operator %

index = hashCode % arrayCapacity

Then the computed index is used to assign value object to the array

array[index] = valueObject

Hash collision

There’s no perfect hash function. The hash codes computed by the hash function may be duplicate. However, the less duplicate hash codes, the better performance gain

There’re some solutions to resolve the hash collision. Use a linked list or red-black tree data structure to save duplicate hash code entries in separate chaining is the most common way

Resize and rehash operations

Resize happens when the number of entries in the hash table is greater than the product of load factor and the array capacity

numberOfEntries > loadFactor * arrayCapacity

Hashing operation has to be executed again for every entry in the buckets right after the execution of resize operation

Resize and rehash operations are costly in term of performance, so choosing the initial capacity when creating a new Hash table is important to minimize the performance impact

Implementation example

import java.util.Arrays;  
import java.util.Objects;

public class MyHashtable<K, V> {  
    private static final int INITIAL_CAPACITY = 16;
    private static final float LOAD_FACTOR = 0.75f;
    private int size = 0;

    Entry<K,V>[] entries;

    @SuppressWarnings({"rawtypes", "unchecked"})
    public MyHashtable() {
        entries = (Entry<K, V>[]) new Entry[INITIAL_CAPACITY];
    }

    @SuppressWarnings({"rawtypes", "unchecked"})
    public MyHashtable(int capacity) {
        entries = (Entry<K, V>[]) new Entry[capacity];
    }

    public V put(K key, V value) {
        int index = hashFunction(key);
        Entry<K, V> headEntry = entries[index];
        Entry<K, V> currentEntry = headEntry;

        // Find key in the chain
        while (Objects.nonNull(currentEntry)) {
            if (Objects.equals(currentEntry.key, key)) {
                // Ignore duplicate key
                return currentEntry.value;
            }

            currentEntry = currentEntry.next;
        }

        // Add first to the chain
        Entry<K, V> newEntry = new Entry<>(key, value);
        newEntry.next = headEntry;
        entries[index] = newEntry;

        size++;

        resize();

        return null;
    }

    public V get(K key) {
        int index = hashFunction(key);
        Entry<K, V> currentEntry = entries[index];

        // Find key in the chain
        while (Objects.nonNull(currentEntry)) {
            if (Objects.equals(currentEntry.key, key)) {
                return currentEntry.value;
            }

            currentEntry = currentEntry.next;
        }

        return null;
    }

    public V remove(K key) {
        int index = hashFunction(key);
        Entry<K, V> currentEntry = entries[index];
        Entry<K, V> previousEntry = null;

        // Find key in the chain
        while (Objects.nonNull(currentEntry)) {
            if (Objects.equals(currentEntry.key, key)) {
                break;
            }

            previousEntry = currentEntry;
            currentEntry = currentEntry.next;
        }

        if (Objects.isNull(currentEntry)) {
            return null;
        }

        // Remove
        if (Objects.isNull(previousEntry)) {
            entries[index] = currentEntry.next;
        } else {
            previousEntry.next = currentEntry.next;
        }

        size--;

        return currentEntry.value;
    }

    public int size() {
        return size;
    }

    private void resize() {
        if (size <= LOAD_FACTOR * entries.length) return;

        Entry<K,V>[] currentEntries = entries;
        entries = (Entry<K, V>[]) new Entry[entries.length*2] ;

        rehash(currentEntries);
    }

    private void rehash(Entry<K,V>[] entries) {
        size = 0;

        for (Entry<K, V> entry : entries) {
            while (Objects.nonNull(entry)) {
                put(entry.key, entry.value);
                entry = entry.next;
            }
        }
    }

    private int hashFunction(K key) {
        int hashCode = key.hashCode();
        int index = hashCode % entries.length;

        return index;
    }

    public static class Entry<K, V> {
        K key;
        V value;

        Entry<K, V> next;

        Entry(K  key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    public static void main(String[] args) {
        MyHashtable<String, Integer> myHashtable = new MyHashtable<>(2);
        myHashtable.put("k1", 1);
        myHashtable.put("k2", 2);
        myHashtable.put("k2", 2);
        myHashtable.put("k3", 3);

        Arrays.stream(myHashtable.entries)
            .forEach(e -> {
                if (Objects.nonNull(e)) {
                    System.out.printf("%s=%d%s", e.key, e.value, System.lineSeparator());
                }
            });

        System.out.println(myHashtable.size());

        System.out.println(myHashtable.get("k2"));

        System.out.println(myHashtable.remove("k1"));

        System.out.println(myHashtable.get("k1"));

        System.out.println(myHashtable.size());
    }
}

Output

k3=3  
k1=1  
k2=2  
3  
2  
1  
null  
2

Applications

Hash table can be used to implement associative arrays, caches, database indexing and sets data structure

References

https://en.wikipedia.org/wiki/Hash_table

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link

Tutorial on Hash Table Data Structure

Attributes

Public operations

Internal operations

Time complexity

The load factor

Hashing operation

Follow Us

POPULAR POSTS

POPULAR CATEGORY

Britain hopes to rival the U.S. as a crypto hub