Build trie faster

Question 1

Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.

Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).

Question 2

You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.

public class SimpleTrie {
    public class TrieNode {
        boolean valid;
        int[] children;
    }
    private TrieNode[] nodes;
    private int numberOfNodes;

    private TrieNode getNode() {
        TrieNode t = nodes[++numberOnNodes];
        return t;
    }
}

Question 3

Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.

Prefix look-up example:

Compare method:

private static int compare(String string, String prefix) {
    if (prefix.length()>string.length()) return Integer.MIN_VALUE;

    for (int i=0; i<prefix.length(); i++) {
        char s = string.charAt(i);
        char p = prefix.charAt(i);
        if (s!=p) {
            if (p<s) {
                // prefix is before string
                return -1;
            }
            // prefix is after string
            return 1;
        }
    }
    return 0;
}

Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)

private static int recursiveFind(String[] strings, String prefix, int start, int end) {
    if (start == end) {
        String lastValue = strings[start]; // start==end
        if (compare(lastValue,prefix)==0)
            return start; // start==end
        return Integer.MAX_VALUE;
    }

    int low = start;
    int high = end + 1; // zero indexed, so add one.
    int middle = low + ((high - low) / 2);

    String middleValue = strings[middle];
    int comp = compare(middleValue,prefix);
    if (comp == Integer.MIN_VALUE) return comp;
    if (comp==0)
        return middle;
    if (comp>0)
        return recursiveFind(strings, prefix, middle + 1, end);
    return recursiveFind(strings, prefix, start, middle - 1);
}

Gets a String array and prefix, prints out occurrences of prefix in array

private static boolean testPrefix(String[] strings, String prefix) {
    int i = recursiveFind(strings, prefix, 0, strings.length-1);
    if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
        // not found
        return false;
    }
    // Found an occurrence, now search up and down for other occurrences
    int up = i+1;
    int down = i;
    while (down>=0) {
        String string = strings[down];
        if (compare(string,prefix)==0) {
            System.out.println(string);
        } else {
            break;
        }
        down--;
    }
    while (up<strings.length) {
        String string = strings[up];
        if (compare(string,prefix)==0) {
            System.out.println(string);
        } else {
            break;
        }
        up++;
    }
    return true;
}

Question 4

Here's a reasonably compact format for storing a trie on disk. I'll specify it by its (efficient) deserialization algorithm. Initialize a stack whose initial contents are the root node of the trie. Read characters one by one and interpret them as follows. The meaning of a letter A-Z is "allocate a new node, make it a child of the current top of stack, and push the newly allocated node onto the stack". The letter indicates which position the child is in. The meaning of a space is "set the valid flag of the node on top of the stack to true". The meaning of a backspace (\b) is "pop the stack".

For example, the input

TREE \b\bIE \b\b\bOO \b\b\b

gives the word list

TREE
TRIE
TOO

. On your desktop, construct the trie using whichever method and then serialize by the following recursive algorithm (pseudocode).

serialize(node):
    if node is valid: put(' ')
    for letter in A-Z:
        if node has a child under letter:
            put(letter)
            serialize(child)
            put('\b')

Question 5

This isn't a magic bullet, but you can probably reduce your runtime slightly by doing one big memory allocation instead of a bunch of little ones.

I saw a ~10% speedup in the test code below (C++, not Java, sorry) when I used a "node pool" instead of relying on individual allocations:

#include <string>
#include <fstream>

#define USE_NODE_POOL

#ifdef USE_NODE_POOL
struct Node;
Node *node_pool;
int node_pool_idx = 0;
#endif

struct Node {
    void insert(const std::string &s) { insert_helper(s, 0); }
    void insert_helper(const std::string &s, int idx) {
        if (idx >= s.length()) return;
        int char_idx = s[idx] - 'A';
        if (children[char_idx] == nullptr) {
#ifdef USE_NODE_POOL
            children[char_idx] = &node_pool[node_pool_idx++];
#else
            children[char_idx] = new Node();
#endif
        }
        children[char_idx]->insert_helper(s, idx + 1);
    }
    Node *children[26] = {};
};

int main() {
#ifdef USE_NODE_POOL
    node_pool = new Node[400000];
#endif
    Node n;
    std::ifstream fin("TWL06.txt");
    std::string word;
    while (fin >> word) n.insert(word);
}

Question 6

Tries that prealloate space all possible children (256) have a huge amount of wasted space. You are making your cache cry. Store those pointers to children in a resizable data structure.

Some tries will optimize by having one node to represent a long string, and break that string up only when needed.

Question 7

Instead of a simple file you can use a database like sqlite and a nested set or celko tree to store the trie and you can also build a faster and shorter (less nodes) trie with a ternary search trie.

Question 8

I don't like the idea of addressing nodes by index in array, but only because it requires one more addition (index to the pointer). But with array of preallocated nodes you will maybe save some time on allocation and initialization. And you can also save a lot of space by reserving first 26 indices for leaf nodes. Thus you'll not need to allocate and initialize 180000 leaf nodes.

Also with indices you will be able to read the prepared nodes array from disk in binary format. This has to be several times faster. But I'm not sure how to do this on your language. Is this Java?

If you checked that your source vocabulary is sorted, you may also save some time by comparing some prefix of the current string with the previous one. E.g. first 4 characters. If they are equal you can start your

for(int level=0 ; level < key.length() ; level++) {

loop from the 5-th level.

Question 9

Is it space inefficient or time inefficient? If you are rolling a plain trie then space may be part of the problem when dealing with a mobil device. Check out patricia/radix tries, especially if you are using it as a prefix look-up tool.

Trie: http://en.wikipedia.org/wiki/Trie

Patricia/Radix trie: http://en.wikipedia.org/wiki/Radix_tree

You didn't mention a language but here are two implementations of prefix tries in Java.

Regular trie: http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/Trie.java

Patricia/Radix (space-effecient) trie: http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/PatriciaTrie.java

Question 10

Generally speaking, avoid using a lot of object creations from scratch in Java, which is both slow and it also has a massive overhead. Better implement your own pooling class for memory management that allocates e.g. half a million entries at a time in one go.

Also, serialization is too slow for large lexicons. Use a binary read to populate array-based representations proposed above quickly.