Our Final Data Structure:
What we've created now is called a
- Inputs are converted by a hash function (
hashcode) into an integer. Then, they're converted to a valid index using the modulus operator. Then, they're added at that index (dealing with collisions using LinkedLists).
containsworks in a similar fashion by figuring out the valid index, and looking in the corresponding LinkedList for the item.
Dealing with Runtime
The only issue left to solve is the issue of runtime. If we have 100 items, and our ArrayList is of size 5, then
- In the best case, all items get sent to the different indices evenly. That is, we have 5 linkedLists, and each one contains 20 of the items.
- In the worst case, all items get sent to the same index! That is, we have just 1 LinkedList, and it has all 100 items.
There are two ways to try to fix this:
- Dynamically growing our hashtable.
- Improving our Hashcodes
Dynamically growing the hash table
Suppose we have buckets (indices) and items. We say that our load factor is .
(Note that the load factor is equivalent to our best case runtime from above.)
So... we have incentive to keep our load factor low (after all, it is the best runtime we could possible achieve!).
And note that if we keep (the number of buckets) fixed, and keeps increasing, the load factor consistently keeps increasing.
Strategy? Every so often, just double . The way we do this is as follows:
- Create a new HashTable with 2M buckets.
- Iterate through all the items in the old HashTable, and one by one, add them into this new HashTable.
- We need to add elements one by one again because since the size of the array changes, the modulus also changes, therefore the item probably belongs in a different bucket in the new hashtable than in the old one.
We do this by setting a load factor threshold. As soon as the load factor becomes bigger than this threshold, we resize.
Take a look at the example below. The hashcode for the "helmet" is 13. In the first hashtable, it gets sent to bucket . In the second hash table, it gets sent to bucket . Note that resizing the hash table also helps with shuffling the items in the hashtable!
At this point, assuming items are evenly distributed, all the lists will be approximately items long, resulting in runtime. Remember that is only allowed to be under a constant load factor threshold, and so, .
Note also that resizing takes time. Why? Because we need to add items to the hashtable, and each add takes time.
A small point: when doing the resize, we don't actually need to check if the items are already present in the LinkedList or not (because we know there are no duplicates), so we can just add each item in time for sure by adding it to the front of the linked list. (Recall that usually we have to search the LinkedList to make sure the item isn't there... but we can skip that step when resizing.)
Of course, we need to revisit our assumption of assuming items are evenly distributed. If items are not evenly distributed, our runtime will be because there could be a single linked list of size .
Assuming that items are evenly distributed?
Items will distribute evenly if we have good hash codes (i.e. hashcodes which give fairly random values for different items.) Doing this in general is.. well... hard.
Some general good rules of thumb:
- Use a 'base' strategy similar to the one we developed earlier.
- Use a 'base' that's a small prime.
- Base 126 isn't actually very good, because using base 126 means that any string that ends in the same last 32 characters has the same hashcode.
- This happens because of overflow.
- Using prime numbers helps avoid overflow issues (i.e., collisions due to overflow).
- Why a small prime? Because it's easier to compute.
Wow, we've just gone through the creation of a data structure from scratch! Proud of you guys. To apply your knowledge finish HW3: https://sp19.datastructur.es/materials/hw/hw3/hw3\