DAWG = Directed Acyclic Word Graphs

What is a DAWG

First, we'll define DAWG (skip if you know already) and cover the specifics of tesseract below.

=========== this definition by Mark & Ceal Wutka, link below ============

A Directed Acyclic Word Graph, or DAWG, is a data structure that permits extremely fast word searches. The entry point into the graph represents the starting letter in the search. Each node represents a letter, and you can travel from the node to two other nodes, depending on whether you the letter matches the one you are searching for.

It's a Directed graph because you can only move in a specific direction between two nodes. In other words, you can move from A to B, but you can't move from B to A. It's Acyclic because there are no cycles. You cannot have a path from A to B to C and then back to A. The link back to A would create a cycle, and probably an endless loop in your search program.

The description is a little confusing without an example, so imagine we have a DAWG containing the words CAT, CAN, DO, and DOG. The graph woud look like this:

     C --Child--> A --Child--> N (EOW)
     |                         |
     |                       Next
   Next                        |
     |                         v
     |                         T (EOW)
     v
     D--Child--> O (EOW) --Child --> G (EOW)

Now, imagine that we want to see if CAT is in the DAWG. We start at the entry point (the C) in this case. Since C is also the letter we are looking for, we go to the child node of C. Now we are looking for the next letter in CAT, which is A. Again, the node we are on (A) has the letter we are looking for, so we again pick the child node which is now N. Since we are looking for T and the current node is not T, we take the Next node instead of the child. The Next node of N is T. T has the letter we want. Now, since we have processed all the letters in the word we are searching for, we need to make sure that the current node has an End-of-word flag (EOW) which it does, so CAT is stored in the graph.

One of the tricks with making a DAWG is trimming it down so that words with common endings all end at the same node. For example, suppose we want to store DOG and LOG in a DAWG. The ideal would be something like this:

   D --Child--> O --Child--> G(EOW)
   |            ^
  Next          |
   |            |
   v            |
   L --Child----

In other words, the OG in DOG and LOG is defined by the same pair of nodes.

=========== Creating a DAWG ============

[...] The idea is to first create a tree, where a leaf would represent the end of a word and there can be multiple leaves that are identical. For example, DOG and LOG would be stored like this:

  D --Child--> O --Child--> G (EOW)
  |
 Next
  |
  v
  L --Child-> O --Child--> G (EOW)

Now, suppose you want to add DOGMA to the tree. You'd proceed as if you were doing a search. Once you get to G, you find it has no children, so you add a child M, and then add a child A to the M, making the graph look like:

  D --Child--> O --Child--> G (EOW) --Child--> M --Child--> A (EOW)
  |
 Next
  |
  v
  L --Child-> O --Child--> G (EOW)

As you can see, by adding nodes to the tree this way, you share common beginnings, but the endings are still separated. To shrink the size of the DAWG, you need to find common endings and combine them. To do this, you start at the leaf nodes (the nodes that have no children). If two leaf nodes are identical, you combine them, moving all the references from one node to the other. For two nodes to be identical, they not only must have the same letter, but if they have Next nodes, the Next nodes must also be identical (if they have child nodes, the child nodes must also be identical).

Take the following tree of CITIES, CITY, PITIES and PITY:

 C --Child--> I --Child--> T --Child--> I --Child--> E --Child--> S (EOW)
 |                                      |
 |                                     Next
Next                                    |
 |                                      v
 |                                      Y (EOW)
 P --Child--> I --Child--> T --Child--> I --Child--> E --Child--> S (EOW)
                                        |
                                       Next
                                        |
                                        v
                                        Y (EOW)

Continue reading this explanation at:

See also:

What does Tesseract use DAWG for?

Tesseract uses the Directed Acyclic Word Graphs to very compactly store and efficiently search several list(s) of words. There are four DAWGs in tesseract: (right?)

Disclosure: I don't know the order of preference - which DAWG does tesseract check first AND which DAWG over-rides the others. ex. "thls" is not in #1 but, say, is in #4 - will tesseract NOT jiggle the 'l' into an 'i' (which then matches in #1) or will it go with #4? Ray?

Let's say that tesseract thinks it found a word with four letters, "thls". Before this word is output, tesseract will:

So, the answer to "Why does tesseract bother with DAWGs" is that when a typical English word has one or two letters that have permutations possible, WITHOUT using the compact and fast DAWG's this lookup task would quickly become a huge bottle-neck.

=========== DAWG-related ToDo's ============

Todo:
Need to add info here on:

Generated on Wed Feb 28 19:49:30 2007 for Tesseract by  doxygen 1.5.1