with implementation in Python!
by Sahil Shaikh, GDSC MCE DSA CO-ORDINATOR
Why Hash MapsHash MapsThe Hash Function1) The Remainder Method:2) Folding Method:3) Mid Square Method:Some things to note :Collision Resolution1) Open Addressing:2) ChainingImplementation:Two SumMore Problems:
Why Hash Maps
- In all the data structures we've looked at so far (list, linked list) the elements are stored sequentially in the order they are entered.
- Consider a list consisting of 10 elements:
- If we want to find the element 7 in this list, we would have to traverse the list elements one by one and check if that element is equal to 7.
This process is called linear search.
- So in the example, we find 7 after traversing through 3 list indexes.
- Let's say that the process of traversing one element takes us 1 millisecond, going by this assumption we can say that it takes us 3 milliseconds to find the element 7 in the above list.
- Similarly, if we had to search for the element 8 in this list, we can say that it would take us 5 milliseconds.
- We can observe that as the position of the element increases the amount of time it takes to search for the element also increases.
- Thus in the worst case, it would take us 'n' milliseconds to find the element we are looking for, where 'n' is the length of the list.
The time complexity of linear search is O(n).
- Hash maps look to solve this problem by allowing us to search for an element in a (almost) constant amount of time irrespective of the position of the element in the list.
- How can this be possible? The only way to access an element in constant time is if we know the index of the element beforehand.
- In a hash map, a particular element will always have the same position, so if we had to search for an element we would just find its position using the hash function and return the element at that position (this is not always the case as we will see later on).
Hash Maps
- Hash maps (also called hash tables or just hashes) are a data structure in which each element is stored in such a way that searching for the element takes us a constant amount of time.
- Each position of the hash map is called a slot, it can hold an item and is named by an integer value starting at 0.
- Hash maps do not store elements sequentially, instead, the position of the element is decided based on a hash function.
The Hash Function
- The mapping between an item and the slot where that item belongs in the hash map is called the hash function.
- The slot of an element in a hash map is decided by the hash function.
- There are many different types of hash functions (remainder method, mid square method, folding method etc. ).
1) The Remainder Method:
- The first hash function we'll look at is the remainder method, it is the most basic hash function and it is the hash function we'll use in our implementation.
- Let's suppose that we have the elements 33, 46, 93, 61, 79 and 94.
- The remainder method simply takes an item in the collection and divides it by the hash table size, returning the remainder as its hash value(i.e. the slot number).
eg: 33/11 = 0 (where 11 is the size of the hash table) , this means that the element 33 will be stored in slot 0.
similarly the slots of the elements 46, 93, 61, 80, 96 are 2, 5, 6, 3 and 8 respectively.
- Once the hash values are computed, we can insert each element into the hash table at the designated slot.
- Our hash map after insertion would look like:
- 6 of the 11 available slots are now filled up. This is referred to as the load factor(λ).
Here λ = 6/11.
- Now when we want to search for an item we simply use the remainder method (i.e. the hash function) to compute the slot name of the item and check the hash table to see if it is present.
The time complexity of this search operation is O(1).
- It is important to note that this technique would only work if each item maps to a unique location in the hash map.
eg: if we were to add the element 77 to our hash map, it would have the hash value of 0 (77%11 = 0).
- According to the hash function, two or more items would need to be in the same slot. This is referred to as a collision.
- We'll see how to deal with collisions in the coming sections, but a more sophisticated hash function than our remainder method can reduce the number of collisions we encounter.
2) Folding Method:
- In the folding method, the item to be added is divided into equal sizes (the last piece may not be of equal size).
- These pieces are added together to give the resulting hash value which is then divided by the size of the hash table to determine the slot of the item.
eg: Let the item to be added be 4365554601, on dividing it into groups of two we get 43, 65, 55, 46, 01. On added these pieces we get 210 (43 + 65 + 55 + 46 + 01). 210 is then divided by 11 and we get the remainder 10. This is the hash value of 4365554601.
3) Mid Square Method:
- In this method, the item to be added is first squared and then a portion of the result is extracted. This portion is then divided by the length of the hash to obtain the hash value.
eg: Let the item be 44, on squaring 44 we get 1936. Then we can extract the middle two digits (i.e. 93) and divide it by the hash table length. The remainder obtained is the hash value.
we can also create hash functions for character-based items by using their ASCII values.
Some things to note :
- Given a collection of items, a hash function that maps each item into a unique slot is referred to as a perfect hash function.
- If we know the items and if the collection will never change then it is possible to construct a perfect hash function.
- Unfortunately, given an arbitrary collection of items, it is not possible to construct a perfect hash function.
- A good hash function should be sophisticated enough to prevent a large number of collisions but it should not be too complex, this would slow down the search operation severely thereby defeating the purpose of hashing.
Collision Resolution
- When two items have the same slot, we must have a systematic method for placing the second item in the hash table. This process is called collision resolution.
- There are several different ways of resolving collisions some more effective than others, we'll look at a few here.
1) Open Addressing:
- In this method, when a collision occurs we try to find another open slot to place the item which caused the collision.
- One way to do this would be to move from the original hash position sequentially until we encounter an empty slot.
- This method is called open addressing and since we are sequentially visiting each slot, this technique is called linear probing.
- It is important to note that, when we search for the items we should utilize this same technique
eg. Consider the hash map from our previous example:
Let's say we want to insert 77 into the hash table. On performing the remainder method we find that the hash value for 77 is 0. Since 33 is already present at 0, a collision occurs. Now using linear probing we move to the next slot, since that slot is empty we need not look for any other slot now. We can insert 77 at slot 1. Now our hash map looks like:
- A disadvantage of linear probing is the tendency for clustering. when many collisions occur at the same hash value, a number of surrounding slots will get filled up. This will have an impact on other items being inserted.
- To avoid clustering, instead of moving sequentially to find a new slot we can skip ahead by n (i.e. 2, 3...) slots, this will distribute the values more evenly thereby preventing clustering.
2) Chaining
- In this method, we make each slot hold a reference to a collection of items that have the same hash value.
- This allows many items to exist at the same location in the hash table.
eg: Using chaining our hash map from the previous example would look like:
- Now to find the item we search through the collection present at the hash value.
Implementation:
- Dictionaries in python are implementations of hash maps.
- We will now try to implement a dictionary using a simple remainder method as our hash function and we will use linear probing for collision resolution.
Python uses a far superior hash function and collision resolution technique to implement dictionaries.
- A dictionary holds key-value pairs. The key is used to look up the associated data value.
- We will use two lists to implement our dictionary. The first list, called slots, will hold the keys and the second list, called data, will hold the values.
- We will treat the key list as a hash table. The initial size of our dictionary will be 11.
class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
- Now we need a hash function to compute the hash value for an item.
- We will use a basic remainder method as our hash function:
class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def hash_function(self, key, size):
return key % size
- Now we need to add a
put
method to insert values into the hash map.
- We will also need a method to perform collision resolution if a collision occurs. We will call this method
rehash
.
- The
rehash
method performs linear probing with a skip value of 1. It increments the hash value by one and then divides it with the length of the hash to obtain the new hash value.
class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def put(self, key, data):
hash_value = self.hash_function(key, len(self.slots))
if self.slots[hash_value] is None:
self.slots[hash_value] = key
self.data[hash_value] = data
else:
if self.slots[hash_value] == key:
self.data[hash_value] = data # replace
else:
next_slot = self.rehash(hash_value, len(self.slots))
while (
self.slots[next_slot] is not None
and self.slots[next_slot] != key
and next_slot != hash_value
):
next_slot = self.rehash(next_slot, len(self.slots))
if next_slot == hash_value:
print ("hash is full")
return
if self.slots[next_slot] is None:
self.slots[next_slot] = key
self.data[next_slot] = data
else:
self.data[next_slot] = data
def hash_function(self, key, size):
return key % size
def rehash(self, old_hash, size):
return (old_hash + 1) % size
- Our
put
method first calls the hash function to get the hash value for the item to be inserted.
- Then we check if there is an item already present at that position.
- If that slot is unoccupied we go ahead and insert the key and value at this slot in their respective lists.
- If that slot is already occupied, we first check if the key present there is the same as our key, if this is the case then we replace the current value of the key with the new value.
- If that slot is occupied by another key, then we call the rehash function to get a new hash value.
- We repeatedly call the
rehash
function until a free slot is obtained, or until we reach back to the initial hash value.
- If our new hash value becomes equal to the initial hash value it means that we've traversed our way back to the original slot. This means that there is no free slot in which case we print a message saying that the hash is full.
- Then we check if our new slot is empty, if this is true we can go ahead and insert our key and value into the slot and data lists respectively.
- If the slot is not None we can safely assume that the new hash position already has our key value, then we can just overwrite the existing data with the new data.
- Now we need a method to get the values we inserted in our hash.
def get(self, key):
start_slot = self.hash_function(key, self.size)
position = start_slot
while self.slots[position] is not None:
if self.slots[position] == key:
return self.data[position]
else:
position = self.rehash(position, len(self.slots))
if position == start_slot:
return None
- To get an item from our hash we need to use the same techniques we used to insert the elements.
- Our
get
method makes use of the samehash_function()
method as before to find the hash value of our key.
- Then we check if our key is present in this slot, our key could be in a different slot if a collision occurred during insertion.
- If the key is present we return its corresponding value from the data list.
- If the key is not present in this slot, it means that a collision must have occurred during insertion.
- So we need to make use of our rehash function to find its next possible position.
- Then we also need to check if we have returned back to our initial slot, this means that the key is not present in this hash and we can return None.
- We repeat the above process until the key is found or until the rehash value becomes equal to the initial slot.
- Now we also need to implement a method to delete a key-value pair from our dictionary:
def remove(self, key):
start_slot = self.hash_function(key, self.size)
position = start_slot
while self.slots[position] != key:
position = self.rehash(position, len(self.slots))
if position == start_slot:
return "key not found"
self.slots[position] = None
self.data[position] = None
- To delete a key-value pair, we first need to locate the key in our hash table.
- So our
remove
method calls thehash_function()
method to calculate the hash value of our key.
- Then we check if our key is present in this hash value, if it is not present we call the
rehash
method to calculate the new hash position.
- We repeat this process until our key is found or until we reach back to our initial slot which would mean the key is not present in our hash.
- Then we set the key-value pair to None, effectively deleting it from our dictionary.
- The complete code for our dictionary is given below:
class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def put(self, key, data):
hash_value = self.hash_function(key, len(self.slots))
if self.slots[hash_value] is None:
self.slots[hash_value] = key
self.data[hash_value] = data
else:
if self.slots[hash_value] == key:
self.data[hash_value] = data
else:
next_slot = self.rehash(hash_value, len(self.slots))
while (
self.slots[next_slot] is not None
and self.slots[next_slot] != key
and next_slot != hash_value
):
next_slot = self.rehash(next_slot, len(self.slots))
if next_slot == hash_value:
print ("hash is full")
return
if self.slots[next_slot] is None:
self.slots[next_slot] = key
self.data[next_slot] = data
else:
self.data[next_slot] = data
def hash_function(self, key, size):
return key % size
def rehash(self, old_hash, size):
return (old_hash + 1) % size
def get(self, key):
start_slot = self.hash_function(key, self.size)
position = start_slot
while self.slots[position] is not None:
if self.slots[position] == key:
return self.data[position]
else:
position = self.rehash(position, len(self.slots))
if position == start_slot:
return None
def remove(self, key):
start_slot = self.hash_function(key, self.size)
position = start_slot
while self.slots[position] != key:
position = self.rehash(position, len(self.slots))
if position == start_slot:
return "key not found"
self.slots[position] = None
self.data[position] = None
Two Sum
- Let's solve an easy problem to get an idea of the use case of hash maps.
- The problem we'll be solving is the two sum problem, you can see the problem statement here.
Given an array of integersnums
and an integertarget
, return indices of the two numbers such that they add up totarget
. You may assume that each input would have exactly one solution, and you may not use the same element twice.
- The solution to this problem is very simple. we can loop through each element and check if there is another element in the list which when added with the initial element would be equal to the target.
def twoSum(self, nums, target):
"""
:type nums: List[int]
:type target: int
:rtype: List[int]
"""
length = len(nums)
for i in range (0,length-1,1):
for j in range(i+1,length,1):
sum = nums[i] + nums[j]
if sum == target:
return [i,j]
else:
continue
- This code would get the job done, but it is not very efficient.
- Since we are using a nested loop in our program our time complexity is O(n).
- We reduce this time complexity by using hash maps.
class Solution(object):
def twoSum(self, nums, target):
"""
:type nums: List[int]
:type target: int
:rtype: List[int]
"""
num_map=dict()
for i in range(len(nums)):
complement=target-nums[i]
if (complement in num_map):
return [num_map[complement],i]
else:
num_map[nums[i]]=i
print(num_map)
- Here we create a dictionary called
num_map
.
- As we traverse through each element in the list we calculate its complement.
for eg: let the list be [1, 5, 2, 3] and let the target be 7. The complement would be the element required to reach the target value. The complement of 1 is 6 (7-1). The complement of 5 would be 2 and so on.
- If the complement of that element does not exist in the dictionary, then we can add that element into our dictionary and assign the index of that element as the value.
- If the complement of that element exists in the dictionary then we return the position of the complement and the position of the element.
- Solving the problem this way, while not intuitive at first, will provide us with a massive performance gain.
- This time complexity of this solution is O(n).
More Problems:
You can practice more problems here