Hashing
Issues with what we've seen so far
So far, we've looked at a few data structures for efficiently searching for the existence of items within the data structure. We looked at Binary Search Trees, then made them balanced using 2-3 Trees.
However, there are some limitations that these structures impose (yes, even 2-3 trees!)
- They require that items be comparable. How do you decide where a new item goes in a BST? You have to answer the question "are you smaller than or bigger than the root"? For some objects, this question may make no sense.
- They give a complexity of . Is this good? Absolutely. But maybe we can do better.
A first attempt: DataIndexedIntegerSet
Let us begin by considering the following approach.
For now, we're only going to try to improve issue #2 above (improve complexity from to . We're going to not worry about issue #1 (comparability). In fact, we're going to only consider storing and searching for int
s.
Here's an idea: let's create an ArrayList of type boolean
and size 2 billion. Let everything be false by default.
- The
add(int x)
method simply sets thex
position in our ArrayList to true. This takes time. - The
contains(int x)
method simply returns whether thex
position in our ArrayList istrue
orfalse
. This also takes time!
public class DataIndexedIntegerSet {
private boolean[] present;
public DataIndexedIntegerSet() {
present = new boolean[2000000000];
}
public void add(int x) {
present[i] = true;
}
public boolean contains(int x) {
return present[i];
}
There, we have it. That's all folks.
Well, not really. What are some potential issues with this approach?
- Extremely wasteful. If we assume that a
boolean
takes 1 byte to store, the above needs2GB
of space pernew DataIndexedIntegerSet()
. Moreover, the user may only insert a handful of items... - What do we do if someone wants to insert a
String
?- Let's look at this next. Of course, we may want to insert other things, like
Dog
s. That'll come soon!
- Let's look at this next. Of course, we may want to insert other things, like