Solving the word-insertion problem

Our DataIndexedIntegerSet only allowed for integers, but now we want to insert the String "cat" into it. We'll call our data structure that can insert strings DataIntexedEnglishWordSet Here's a crazy idea: let's give every string a number. Maybe "cat" can be 1, "dog" can be 2 and "turtle" can be 3.

(The way this would work is –– if someone wanted to add a "cat" to our data structure, we would 'figure out' that the number for "cat" is 1, and then set present[1] to be true. If someone wanted to ask us if "cat" is in our data structure, we would 'figure out' that "cat" is 1, and check if present[1] is true.)

But then if someone tries to insert the word "potatocactus", we'll not know what to do. We need to develop a general strategy so that given a string, we can figure out a number representation for it.

Strategy 1: Use the first letter.

A simple idea is to just use the first character of any given string to convert it to its number representation. So "cat" -> "c" -> 3. "Dog" -> "d" -> 4. But also, "drum" -> "d" -> 4.

What if someone wanted to insert "dog" and "drum" into our DataIntexedEnglishWordSet? All bets are off, and we don't know how to do it.

Note that when two different inputs ("dog" and "drum") map to the same integer, we call that a collision. We don't know how to deal with collisions yet, so let's figure out a way to avoid them. (What we do with most of our problems lol...)

Strategy 2: Avoiding collisions

To motivate this part, let's understand how our number system works.

A four digit number, say 5149, can be written as

5103+1102+4101+91005 \cdot 10^3 + 1 \cdot 10^2 + 4 \cdot 10^1 + 9 \cdot 10^0.

Actually, any 4 digit number can be written uniquely in this form. What that means is given 4 digits, a,b,c,da, b, c, d, we can write a103+b102+c101+d100a \cdot 10^3 + b \cdot 10^2 + c \cdot 10^1 + d \cdot 10^0 and that gives us a unique 4 digit number: abcdabcd.

Notice that the 1010 is important here. If we chose a bad number, say 2, the same is not true. Let's make sure we see what happens if we chose 22 as the multiplier.

a,b,c,d=1,1,1,1a, b, c, d = 1, 1, 1, 1 gives 123+122+121+120=151 \cdot 2^3 + 1 \cdot 2^2 + 1 \cdot 2^1 + 1 \cdot 2^0 = 15.
a,b,c,d=0,3,1,1a, b, c, d = 0, 3, 1, 1 gives 023+322+121+120=150 \cdot 2^3 + 3 \cdot 2^2 + 1 \cdot 2^1 + 1 \cdot 2^0 = 15.

A collision on inputs (1, 1, 1, 1) and (0, 3, 1, 1)!

So why is 1010 important? It's because there are 10 unique digits in our decimal system: 0,1,2,3,4,5,6,7,8,90, 1, 2, 3, 4, 5, 6, 7, 8, 9.

Similarly, there are 2626 unique characters in the english lowercase alphabet. Why not give each one a number: a=1,b=2,,z=26a=1, b=2, \ldots, z=26. Now, we can write any unique lowercase string in base 26. (Note that base 26 simply means that we will use 26 as the multiplier, much like we used 10 and 2 as examples above.)

  • "cat" = "c" 26226^2 + 'a' 26126^1 + 't' 26026^0_ = 3262+1261+20260=20743_26^2 + 1_26^1 + 20_26^0 = 2074.

Quick check

  • How do you represent "dog"?

This representation gives a unique integer to every english word containing lowercase letters, much like using base 10 gives a unique representation to every number. We are guaranteed to not have collisions.

Our Data Structure DataIndexedEnglishWordSet

public class DataIndexedEnglishWordSet {
    private boolean[] present;

    public DataIndexedEnglishWordSet() {
        present = new boolean[2000000000];
    }

    public void add(String s) {
        present[englishToInt(s)] = true;
    }

    public boolean contains(int i) {
        resent present[englishToInt(s)];
    }
}

Uses a helper method

public static int letterNum(String s, int i) {
    /** Converts ith character of String to a letter number.
    * e.g. 'a' -> 1, 'b' -> 2, 'z' -> 26 */
    int ithChar = s.charAt(i)
    if ((ithChar < 'a') || (ithChar > 'z')) {
        throw new IllegalArgumentException();
    }

    return ithChar - 'a' + 1;
}

public static int englishToInt(String s) {
    int intRep = 0;
    for (int i = 0l i < s.length(); i += 1) {
        intRep = intRep * 26;
        intRep += letterNum(s, i);
    }

    return intRep;
}

Where are we?

Recall, we started with wanting to

(a) Be better than Θ(logN)\Theta(\log N). We've now done this for integers and for single english words.

(b) Allow for non-comparable items. We haven't touched this yet, although we are getting there. So far, we've only learnt how to add integers and english words, both of which are comparable, but, have we ever used the fact that they are comparable? I.e., have we ever tried to compare them (like we did in BSTs)? No. So we're getting there, but we haven't actually inserted anything non-comparable yet.

(c) We have data structures that insert integers and english words. Let's make a quick visit to inserting arbitrary String objects, with spaces and all that. And maybe even insert other languages and emojis!

(d) Further recall that our approach is still very wasteful of memory. We haven't solved that issue yet!

results matching ""

    No results matching ""