Glossary of OCR terms (as used in Tesseract) V0.04


Please add terms and definitions of OCR terms/concepts as they are used in Tesseract and provide links to places in the source code where they are defined/manipulated. Comparisons with other GNU OCR packages help the reader triangulate their understanding so it is encouraged.

Please post all comments on tesseract's forums and be sure to specify the version of the docs: V0.03!



See O0



Floating point coordinates, X & Y. See fpoint.h

See xform2d.cpp



Adapt or Adaptive



An APplication Interface (API) is a formalized and very well documented way of interfacing with tesseract. Also, patches have been posted which turn the 'stand-along' tesseract into a static version that can be called without having to re-load all the stuff in 'tessdata/'.





Usually the vertical position/elevation of the line in question with reference to some coordinate system's (0,0) position. Since tesseract normalizes characters to themselves, this is not as critical as in other OCR applications.


Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.


[Place holder]

Block occupancy

FIX: See blkocc.cpp

Bounding Box/BB

Two pairs of coordinates which describe the opposing corners of a rectangle. In tesseract, a Bounding Box (BB) delineates the extent of each character in a blob, and a bunch of other things, which need to be "FIX:"ed.





[Place holder]


Short for Classification, usually in the sense that an outline or several outlines had their features extracted and these features matched a particular built-in template, ie: tesseract is pretty sure its found that particular character.


FIX: Associate the outlines and classify them, see associate.cpp


See Certainty


(From Ray Smith) Configuration, in the context of the classifier, not to be confused with config files, is a combination of features that make up a prototype. [..] [T]he training data as it stands today contains up to 32 different configs for each class, which loosely represent the 32 different fonts used for the training data. When making a full match in the classifier, it finds the best-matched configuration of features. It is basically a way of allowing some features to be present some of the time without assuming statistical independence between them.


Often, all tesseract has to go on when ambiguities are encountered is the context where the bad characters were found. Conext is used a lot not just for letters but also for numbers. Context is best used with dictionary. See context.cpp


Pair of integer or floating values that describe a point in 2D space in whatever coordinate system has been established.


[Place holder]



DAWG is short for 'Directed Accyclic Word Graph'. A good explanation with examples can be found under DAWG = Directed Acyclic Word Graphs DAWG in dawg.h but the actual code is in dawg.cpp.

In short, a DAWG is a very compact way of storing data that has a great deal of repetition and which, unlike more familar algorithms for compression such as ZIP/GZIP, permits direct and fast scanning.

Tesseract uses two DAWGs, one to store the built-in list of words (dictionary) and another to store the user's list of words, when checking the various combinations of letters in werds it has recognized, in its attempt to improve the accuracy.

In version 1.02, two dictionaries can be found in the 'tessdata' directory:


Multiply the unit-less measure by whatever normalizing factor was used originally.


Direction, and especially the eight (8) compass directions:
          NW  N  NE
            \   /
          W   +   E
            /   \
          SW  S  SE
are very special in tesseract because 'direction' is one the 'units' of information that describes each and every feature (See Feature). Directions of an outline are computed in FindDirectionChanges().



See Edge detection.

Edge detection

(Comes from my notes on line_edges() which iterates a scanline for edges and updates prevline - edges in progress. When edges close into loops, they're sent for approximation)

"Edge detection" finds changes (flips in the [binary] pixels) in each row. When these are "stacked" (by referencing the previous line(s)), cracks can be extracted. If/when these cracks "join" at some point, after some lines, they are approximated into outlines. Then, [a lot happens you never see] and these get eventually fed to the matcher...

I thought about "making up" an example but it was actually faster and conducive to my hacking [1] to trace the real thing.

("ccmain/tesseract testing/edges/FOX.tif fox" with right options :-)

But first, we need to be clear on one concept handy in edge-detection.

line_edges() is actually doing something that can be conceptualized as XOR/derivative - it's really after ONLY change. You should be able to understand exactly what I'm saying when you see what happens when the 'O' is filled in with solid black color using gimp (actual result, not simulated):


line_edges() still managed to find JUST the edges - the "unchanging" fill just disappeared. (For the observant, there's only 'one' edge while the previous "O" had two; it's not recognized as an "O" while the above WAS...)

Since even the tiny image of "FOX" is a bit big, let's just focus on the leading corner of the 'F' - we'll explain the meaning of the letters (and numbers) as we go along. By the way, if you're only interested in a summary, skip ahead to 'big picture'.


The first line starts with an 'e' because line_edges() was called. The subsequent '='s indicate that the previous line had no cracks and the color at each x position has not changed from the previous (x-1) and has not changed from the same x on the previous line (y-1) - line_edges() is bored.

On the second row, the 'D' indicates that the previous line is empty but that the current x's color changed from that of the previous pixel (x-1). 'D' also means that v_edge() got called. To better see what v_edge() and h_edge() are up to, let's quickly rebuild with -DTV_FOCUSC and -DTV_FOCUSD.

Our image becomes notably 'noisy':

Legend: a-h (e is reserved) are from h_edge() & j-p are from v_edge() ("ccmain/tesseract testing/edges/edges_5E.tif out")

The first 'D' means that the previous line is empty but that the current x's color changed from that of the previous pixel (x-1) and is output after v_edge() is called: started a new vertical crack ('k'); the sign [2] <= 0 ('m'); and there's nothing to join ('n'). (Incidentally, look on the 3rd row directly under the 'n' - there's a 'p' in its place meaning that on THAT row there WAS an edge in progress... the one that just now was started.

After the 'D', the 'E' means that color at 'x' changed from that at the same 'x' on previous line (y-1) - is output after h_edge() is called: started a new crack ('b'); the sign > 0 ('c'); and that an edge (here started by 'n') is continuing ('g'). (No 'j' printed because free_cracks is shared by v_edge & h_edge and that job was done by 'n').

Rest of the row is boring until the last 'D' which means the same thing it did before - the same x on the previous line (y-1) was different. What's key is that v_edge() now decides to: start a new vertical crack ('k'); the sign > 0; and there was a [fake'] edge to join to (we've been tracing the horizontal edge of something & it has come to an end so that 'end' is the vertical edge).

Of course, line_edges() keeps as many edges in progress as necessary. Three vertical lines (two on the sides and one in the middle) will look like this:

("ccmain/tesseract testing/edges/three_vert.tif out")

===== BIG PICTURE =====

line_edges() works scanline by scanline looking for CHANGES in color. When it finds these changes, it creates and tracks both vertical and horizontal edges:

e.g., a horizontal, one-pixel-width line has BOTH: at the start there's a vertical edge, during the run of its length there are two horizontal edges, and at the end there's a final vertical edge.

In each successive scanline, edges are detected and cracks followed until TWO cracks meet [3] (forming a loop). Then, this list of points - now called an "outline" - is sent in for "cleaning and approximation" by complete_edge() and the list of edges is freed.

Add reference to entry to read after Edge Detection

while callpicofeat() is called, it's from a somewhat unexpected place

above really needs a BS-check, courtesy Ray Smith.
For now, you can see Objects or just keep scrolling.

========= Q&A =================

Some of these questions (which were for myself) need answers and/or BS-check by someone more knowledgeable.
========= Notes: ===============

[1] My hacking revolves around removing/supressing noise in FAXes which is manifest as continous, 'vertical' streaks throughout the image at one or more x locations. This noise is disasterous to recognition and also, alas, chews a lot of CPU. I've had some success in eliminating the streaks by adding an additional state machine (very similar to line_edges()) that looks for these streaks. When these are confirmed (i.e., start below the header, last through the whole image & satisfy some other thresholds), they can be dealt with.

These streaks are much simpler to deal with than, for example, scans from ruled/lined paper because they are EXACTLY PARALLEL to the Y axis (they are caused by white-out/crud on the glass inside the FAX machine so fixed in location). And, all the pages FAXed on such machine have streaks in EXACTLY the same x positions, so that only the first page requires careful analysis.

[2] sign is just a way of keeping track of what the color chaged FROM. The sign > 0 on transition from edge to background (black-->white) and < 0 vice verse (white-->black).

need to verify exact relationship between color xsition & sign
[3] the margin of the image are forced to be white. This, I think, guarantees that all edges DO end (and begin, if you think about it, since there is a known starting state for line_edges()).
there have been reports to the contrary on the forums on this exact issue - need to check why


Stands for the letter 'm'. Tesseract uses special heuristics when it thinks it sees an "m" because it can also be "r" followed by "i"! Other characters that have special heuristics are "I" (upper-case i), "1" (one), and "l" (lower-case L) - can YOU tell which is which "lI1"?

The pre-release of tesseract used to have the file "DangAmbigs" which contained the following: "m rn", "rn m", "m in", "in m", "d cl", "cl d", "nn rm", "rm nn", "n ri", "ri n", "li h", "ii u", "ii n", "ni m", "iii m", "ll H", "I-I H", "vv w", "VV W", "t f", "f t", "a o", "o a", "e c", and "c e"

To my knowledge, only 'I1l' and 'm' receive special treatment in tesseract because the Context and DAWG (see the respective section) are used to increase the accuracy.



As previously mentioned, the features and their extraction are covered by Patent 5,237,627 ( The names of the inventors will be familar, if you have looked at ExtractIntFeat() :-)

There are several types of 'features' & it is best to keep them distinct:

How the heck does MySqrt2() work? What's the multiplication by '41943' for?
A 'proto-feature' is composed of three 'units' of information: So, a bunch of individual features describe one prototype and there is one prototype for each ASCII character that can be recognized by tesseract.

Let's compare gocr's definition of a feature with tesseract. Here's what gocr (v0.39) gives in its own docs (BTW, I'm not "picking on" gocr - it just happens of have clear documentation on what & how it approaches the same problem).

vvvv         vv- white regions
......@@......  <- crossing one line
....@..@@@....  <- white hole / crossing two lines
....@..@@@....  <- crossing two lines
..@@@@@@@@@@..  <- horizontal line near center
.@........@@@.  v- increasing width of pattern
.@........@@@.  v
.@........@@@.  v
    ^^^-- gap
"In the future, the program should detect edges, vertices, gaps, angles
and so on. This is called feature extraction (as far as I know)."

Other OCR programs, like gocr, attempt to extract high level features of each character and classify the character based on those features. If the characters on the page are of a high quality, such as an original typed page, these simple processing methods will work well for the process of converting the characters.

However, as document quality degrades, such as through multiple generations of photocopies, carbon copies, facsimile transmission, or in other ways, the characters on a page become distorted causing simple processing methods to make errors.

For example, a dark photocopy may join two characters together (like the "r" and "i", below), causing difficulty in separating these characters for the OCR processing. Joined characters can easily cause the process that segments characters to fail, since any method which depends on a "gap" between characters cannot easily distinguish characters that are joined.


Dark image joins adjacent letters

Light photocopies produce the opposite effect. Characters can become broken, and appear as two characters, such as the character "u" being broken in the bottom middle to create two characters (below), each of which may look like the "i" character. Also, characters such as the letter "e" may have a segment broken to cause them to resemble the character "c".


Light image breaks letters


Light image breaks closures

Programs like gocr begin by extracting higher level features from each character image in an attempt to select a set of features which would be insensitive to unimportant differences, such as size, skew, presence of serifs, etc., while still being sensitive to the important differences that distinguish between different types of characters.

High level features, however, can be very sensitive to certain forms of character distortion, as demonstrated above. That is, a simple break in a character can easily cause a closure to DISAPPEAR, and the feature extractor method that depends on such closures would probably classify the character incorrectly.

Often the high level feature representation of a character contains few such features. Therefore, when a single feature is destroyed, the classification (differentiation from other characters) is negatively impacted.

Tesseract takes a completely different approach, even if one much more complicated. (this is a high-level decription, as per the patent - a LOT happens inside tesseract before your image gets to the stage where features are recognized)

Like tessaract, we'll start at the highest/max y location in the following sample. In these examples, we'll only consider the OUTER outline (see note).

Note: due to the way in which edge-detection works, characters that are open (have no closures) like C, T, V, etc. have only ONE outline (the outer one). On the other hand, characters like O, P, A, have one closure and characters like B actually have two. See "Edge Detection" for the gory details.

           ...@.....i@...--------> 0 degrees (due East)
            6        14

Looking at the above picture, start at "1" and go Counter-Clock Wise (CCW):

After you arrive back at point "1", you should have a table of directions. This table is what tesseract stores for every character and (probably) for every font. Now, ponder this question: if you were to ignore 5 random lines in this table, would you still be able to recognize this particular character? The answer is "probably" because sufficuent features remain.

That is precisely why tesseract is insensitive to character segmentation boundaries. These features also have a low enough level to be insensitive to common noise distortions and sufficient number of features are created that some will remain to allow character classification even if others are destroyed by noise. Finally, these features are a lot less sensitive to font variations than many other matching techniques.

With that said, there are several problems. First, tesseract needs to be very carefully trained before any additional letter can be recognized. This is why languages other than English are currently not supported (as of version 1.02). Second, these trained features must be stored somewhere. This means that unless they are somehow incorporated into the executable (which has its own problems), they must be stored in a file. In version 1.02 that file is in tessdata directory and is called "inttemp" (a bit over 600KB). Third,

I'm not done with this section!


FIX: A character can be made of several blobs but, intially, features are extracted from single blobs. Up to MAX_NUM_INT_FEATURES (defined as 512 in classify/intproto.h) can be extracted from a character.


FIX: The pre-trained templates (See Prototype) are sufficiently 'smeared' (by SmearBulges() in classify/mfx.cpp) to comfortably match different fonts representing any particular character.

FIX: this is not finished. But it's better than it was before :-)


Stands for Feature Extraction, see Feature.



Space between blobs, not always a 'space', could be between text and table boundary, for example. See Gap map and gap_map.cpp

Gap Map

A block gap map is a quantised histogram of whitespace regions in the block. It is a vertical projection of wide gaps WITHIN lines

The map is held as an array of counts of rows which have a wide gap covering that region of the row. Each bucket in the map represents a width of about half an xheight - (The median of the xhts in the rows is used.)

The block is considered RECTANGULAR - delimited by the left and right extremes of the rows in the block. However, ONLY wide gaps WITHIN a row are counted. (definition moved from gap_map.cpp)


Hash Table

A hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location ("bucket") where the values should be. You use a hash table every time you look up a phone number in the phone book. See for more details. See closed.cpp, bestfirst.cpp, hashfn.cpp, and memblk.cpp


FIX: Definition is easy, how it's used is a different story.


In photography (, a histogram is a graph counting how many pixels are at each level between black and white. Black is on the left. White is on the right. The height of the graph at each point depends on how many pixels are that bright. Lighter images move the graph to the right. Darker ones move it to the left. In tesseract, a histogram is used to convert a RANGE of certainties into a table of 'quantized' buckets/units of certainty. See gap_map.cpp, tospace.cpp, cluster.cpp, etc.

How Tesseract Works

See Working out how Tesseract works The Page on How Tesseract Works. See What do all those letters for TEXT_VERBOSE mean? What do all those letters for TEXT_VERBOSE mean.



Three letters that give tess fits enough they get their own set of heuristics!

In fact, if you look at the .raw file from the testing/image_courR18.tif, you will see that ALL 'i's come out as '1's while this does not happen with testing/image_2helvR18.tif (.raw files are not output by v1.03)

Does this mean that the feature extractor should have a set of templates for different font families? I know the dictionary and word-heuristics certainly help (after all the .txt file is almost right) but this seems to indicate a serious shortcoming.

As an example of my comment above, 'Bristol' is not in the dictionary so the '1' in .raw gets turned into a 'l' by the word-heuristics, not the 'i' that it should have been. Ie: there is only so much rules can do and it's always better to improve the .raw output.


[Place holder]


[Place holder]

K-D search trees

FIX: Definition is easy, how it's used is a different story.


Linear Least Squares Fit

FIX: Definition is easy, how it's used is a different story.



FIX: Definition is easy, how it's used is a different story.







Neural Networks

Tesseract used to employ a neural network called Aspirin/Migraines (which has been removed due to licensing issues) which allowed training. How did this removal affect tesseract's accuracy? Ray Smith said, "When I took the NN code out, I measured the increase in error rate to be slightly less than 1% (relative). The benefit was a 10-15% improvement in speed. I have some accuracy/bug fixes coming in the next release (i.e., v1.03) that more than compensate by reducing the error rate by more than 3%. Of course any error rate changes are dependent on the test set and are not necessarily reproducible on a different test set..."

N-D space

[Place holder]



Usually, the only way tess can tell apart a zero from an upper-case 'o' is from the context.

If you look at the .raw file from the testing/image_courR18.tif, you will see that ALL 'o's come out as '0's while this does not happen with testing/image_2helvR18.tif. See the entry for 'I1l' for more info.


Hierarchical & abstracted objects/names for things being worked on and derived within tesseract; starting with the physical page and ending with words written to txt file.

It would be most helpful if Ray Smith could look this list over and point out dumb mistakes. Our deeper understanding of what's going on depends on this.

list should describe whether each object is temporary (i.e,, bucket); can be de/serialized (?block?); exists at the same time but for different purpose (outlines & blobs) and what those different purposes are; is derived from user's image or developer/pre-training; serves a pivotal/conceptually important derivation; etc.
(from the physical page to words written to txt file) (haven't figured out these well enough to say anything :-)

Outline feature

See Feature, outfeat.cpp



When the letters that make up a word are initially recognized, there may be several possibilities for each position; each one with decreasing certainty. Put another way, the initial word is made up of letters sorted by certainty.

However, sometimes the letter with the highest certainty may not be the correct letter by applying certain heuristics (i.e, a word generally doesn't have a number in the middle of it, so the '1' (one) is changed to an 'l' (L), etc.)

After these heuristics are applied, tesseract sends the word to the permuter which is a function that exchanges less-certain letters in each position and then checks whether that PARTICULAR combination of letters matches directly in DAWG/dictionary. Every (??) word is permuted in this manner in the hope that at least one of these directly matches the dictionary.

See entry for DAWG and permute.cpp


FIX: See picofeat.cpp


Depends on font type, how 'close together' the letters are. See topitch.cpp. Similar to kerning space, see tospace.cpp ?


FIX: Definition is easy, how it's used is a different story.


FIX: Definition is easy, how it's used is a different story.


A prototype is a model for each letter which allows tesseract to match up that model with any particular blob and 'recognize' that character.

The included file 'tessdata/inttemp' contains built-in/pre-trained prototypes for all ASCII characters for a dozen or more different fonts.

Ray Smith has released the code (training/training.cpp) used to generate this (tessdata/inttemp) file but we do not yet know how to a) make the training image(s), b) do the adapting, c) work the training code.


(From Ray Smith) The class pruner is a pre-classifier that is used to create a short-list of classification candidates (pruning the possible classes) so that the full distance metric can be calculated on the short-list without taking excessive time, instead of exhaustively matching against each character possibility. The class pruner uses a faster, but approximate method of matching the features, so while it does make mistakes, the mistakes are rare. [With that said,] make_config_pruner() in adaptmatch.cpp is never [presently] called, so it is irrelevant.



FIX: Definition is easy, how it's used is a different story.

?? Used in approximating the outline - converting a 'shape' from a list/series of coordinate-pairs (in one coordinate system) into a 'formula' (in any coordinate system). ?? A, B, C are coefficients of the quadratic ?? Making each feature a quadratic, of sorts ??


Raw output

It is easy to get tesseract to output the 'raw' output in addition to the regular .txt file. HOW? Edit ccmain/output.cpp and change the FALSE following tessedit_write_raw_output to TRUE. Rebuild and tesseract will now ALSO generate a .raw file.

You might be interested in the raw file when fiddling with the feature extractor and accuracy of tesseract PRIOR to DAWG/dictionary.

Radius of gyration

Radius of gyration of the character shape about its center of mass, in both the X and Y directions, is used by tesseract to scale each character prior to recognition.

See ExtractIntFeat() in classify/intfx.cpp

Read more about it on





Reject map



Horizontal series of blobs split off from those above & below by seams. See makerow.cpp



Tesseract plays with seams until the recognition confidence improves, see findseam.cpp
Q: What's the difference between seam and edge (A: Edges are pre-outline while seams are post-row)?


Noise on the scanned page, dust, dirt, etc. Can wreak havoc on the blob classification routines. Internally done by speckle.cpp Note that in tesseract the dot above the 'i' is detected IMMEDIATELY PRIOR to noise removal.
Put "i" dot ref and note here.
External option is to run pbmclean, which 'flips' isolated pixels.






List of type SPLIT (split.h)


Proprietary HP graphics sub-system. Remember the alphabet soup of graphics 'standards' in the late 1980's (MDA/CGA/EGA/VGA/etc)? Ugh. I'm sure this worked a lot better.



Stopping criteria

Set of rules that finalize the results from the word classifier, see stopper.cpp







Text Ordering, see tordvars.cpp for variables and tordmain.cpp for code.



Underlines, like 'I1l' and 'm', give tesseract fits. Underlines need to be chopped from the letter-blobs (based on baseline?) See underlin.cpp







Wise Owl

Tesseract's old classification subsystem, see choices.cpp


Blobs are chopped into words, see wordseg.cpp




WARNING for including links in glossary

Do not be tempted to use URL links LOCAL on YOUR machine. Only give external resources. Internal sources can be referenced with filename.cpp or FunctionName() <\- the parentheses and spacing are important.

Fix broken references:
282: Warning: unable to resolve reference to `callfeatextr' for command 284: Warning: unable to resolve reference to `callpicofeat' for command


Generated on Wed Feb 28 19:49:29 2007 for Tesseract by  doxygen 1.5.1