Glossary of OCR terms (as used in Tesseract) V0.04

Introduction

Please add terms and definitions of OCR terms/concepts as they are used in Tesseract and provide links to places in the source code where they are defined/manipulated. Comparisons with other GNU OCR packages help the reader triangulate their understanding so it is encouraged.

Please post all comments on tesseract's sourceforge.net forums and be sure to specify the version of the docs: V0.03!

Zero

0O

See O0

One

Two

Floating point coordinates, X & Y. See fpoint.h

See xform2d.cpp

FIX:

A

Adapt or Adaptive

FIX:

API

An APplication Interface (API) is a formalized and very well documented way of interfacing with tesseract. Also, patches have been posted which turn the 'stand-along' tesseract into a static version that can be called without having to re-load all the stuff in 'tessdata/'.

Attribute

FIX:

B

Baseline

Usually the vertical position/elevation of the line in question with reference to some coordinate system's (0,0) position. Since tesseract normalizes characters to themselves, this is not as critical as in other OCR applications.

Blob

Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.

Block

[Place holder]

Block occupancy

FIX: See blkocc.cpp

Bounding Box/BB

Two pairs of coordinates which describe the opposing corners of a rectangle. In tesseract, a Bounding Box (BB) delineates the extent of each character in a blob, and a bunch of other things, which need to be "FIX:"ed.

C

Certainty

FIX:

"I'm VERY certain that the sun will rise in the East".
"I'm SOMEWHAT certain that traffic doesn't suck at 12AM"
"I'm NOT certain that driving 150MPH is safe" This type of 'certainty' is useful in tesseract. See, for example, dict/stopper.cpp, classify/adaptmatch.cpp, classify/intmatcher.cpp, classify/speckle.cpp, etc.

Chop

[Place holder]

Class

Short for Classification, usually in the sense that an outline or several outlines had their features extracted and these features matched a particular built-in template, ie: tesseract is pretty sure its found that particular character.

Classify

FIX: Associate the outlines and classify them, see associate.cpp

Confidence

See Certainty

Configuration

(From Ray Smith) Configuration, in the context of the classifier, not to be confused with config files, is a combination of features that make up a prototype. [..] [T]he training data as it stands today contains up to 32 different configs for each class, which loosely represent the 32 different fonts used for the training data. When making a full match in the classifier, it finds the best-matched configuration of features. It is basically a way of allowing some features to be present some of the time without assuming statistical independence between them.

Context

Often, all tesseract has to go on when ambiguities are encountered is the context where the bad characters were found. Conext is used a lot not just for letters but also for numbers. Context is best used with dictionary. See context.cpp

Coordinate

Pair of integer or floating values that describe a point in 2D space in whatever coordinate system has been established.

Cut

[Place holder]

D

DAWG

DAWG is short for 'Directed Accyclic Word Graph'. A good explanation with examples can be found under DAWG = Directed Acyclic Word Graphs DAWG in dawg.h but the actual code is in dawg.cpp.

In short, a DAWG is a very compact way of storing data that has a great deal of repetition and which, unlike more familar algorithms for compression such as ZIP/GZIP, permits direct and fast scanning.

Tesseract uses two DAWGs, one to store the built-in list of words (dictionary) and another to store the user's list of words, when checking the various combinations of letters in werds it has recognized, in its attempt to improve the accuracy.

In version 1.02, two dictionaries can be found in the 'tessdata' directory:

word-dawg = "compressed" list of many common English words
user-words = plain ASCII list of words you want tesseract to know about, that are not in 'word-dawg'

Denormalize

Multiply the unit-less measure by whatever normalizing factor was used originally.

Direction

Direction, and especially the eight (8) compass directions:

are very special in tesseract because 'direction' is one the 'units' of information that describes each and every feature (See Feature). Directions of an outline are computed in FindDirectionChanges().

E

Edge

See Edge detection.

Edge detection

(Comes from my notes on line_edges() which iterates a scanline for edges and updates prevline - edges in progress. When edges close into loops, they're sent for approximation)

"Edge detection" finds changes (flips in the [binary] pixels) in each row. When these are "stacked" (by referencing the previous line(s)), cracks can be extracted. If/when these cracks "join" at some point, after some lines, they are approximated into outlines. Then, [a lot happens you never see] and these get eventually fed to the matcher...

I thought about "making up" an example but it was actually faster and conducive to my hacking [1] to trace the real thing.

e=====================================================
e=====================================================
e=====================================================
e=====================================================
e===DEEEEEEED=============DEEEED===============DEED=====DED===
e===K=DEEEEEC==========DEEJEEEHED=============K=K===DEJC==
e===K=K==============DECDEC===LD=LD============LD=LD==DEJC===
e===K=K=============DECDEC=====LD=LD============LD=LD=KK====
e===K=K=============K=K=======K=K============LD=LJC====
e===K=LEEED==========K=K=======K=K=============K=K=====
e===K=DEEEEC=========K=K=======K=K============DEJD=K=====
e===K=K=============K=LD=======KDEC============KKKLD=====
e===K=K=============LD==K=====DECK============DEJCLD=LD====
e===K=K==============LD=LD====DEJEC===========DEJC==LD=K===
e===K=K===============LD=LEEEJC============DECK====KLD===
e===LEC================LEEEEC=============LEC====LEC==
e=====================================================

("ccmain/tesseract testing/edges/FOX.tif fox" with right options :-)

But first, we need to be clear on one concept handy in edge-detection.

line_edges() is actually doing something that can be conceptualized as XOR/derivative - it's really after ONLY change. You should be able to understand exactly what I'm saying when you see what happens when the 'O' is filled in with solid black color using gimp (actual result, not simulated):

e=================
e====DEEEEED======
e===DEC====LD=====
e==DEC======LED===
e=DEC=========K===
e=K==========LD===
e=K===========K===
e=K===========K===
e=K===========K===
e=LD==========DEC=
e==LD========DEC==
e===LED====DEEC===
e=====LEEEC=======
e=================

line_edges() still managed to find JUST the edges - the "unchanging" fill just disappeared. (For the observant, there's only 'one' edge while the previous "O" had two; it's not recognized as an "O" while the above WAS...)

Since even the tiny image of "FOX" is a bit big, let's just focus on the leading corner of the 'F' - we'll explain the meaning of the letters (and numbers) as we go along. By the way, if you're only interested in a summary, skip ahead to 'big picture'.

e==============
e===DEEEEEEED==
e===K=DEEEEEC==
e===K=K========
e===K=K========
e===K=K========

The first line starts with an 'e' because line_edges() was called. The subsequent '='s indicate that the previous line had no cracks and the color at each x position has not changed from the previous (x-1) and has not changed from the same x on the previous line (y-1) - line_edges() is bored.

On the second row, the 'D' indicates that the previous line is empty but that the current x's color changed from that of the previous pixel (x-1). 'D' also means that v_edge() got called. To better see what v_edge() and h_edge() are up to, let's quickly rebuild with -DTV_FOCUSC and -DTV_FOCUSD.

Our image becomes notably 'noisy':

e==========================================
e===kmnDbcgEbcgEbcgEbcgEbcgEbcgEbcgEkloD===
e===kmpK=klnDbdhEbdhEbdhEbdhEbdhEC=========
e===kmpK=kloK==============================
e===kmpK=bcgLbcgEbcgEbcgEkloD==============
e===kmpK=klnDbdhEbdhEbdhEbdhEC=============
e===kmpK=kloK==============================
e===bdhLbdhEC==============================
e==========================================

Legend: a-h (e is reserved) are from h_edge() & j-p are from v_edge() ("ccmain/tesseract testing/edges/edges_5E.tif out")

The first 'D' means that the previous line is empty but that the current x's color changed from that of the previous pixel (x-1) and is output after v_edge() is called: started a new vertical crack ('k'); the sign [2] <= 0 ('m'); and there's nothing to join ('n'). (Incidentally, look on the 3rd row directly under the 'n' - there's a 'p' in its place meaning that on THAT row there WAS an edge in progress... the one that just now was started.

After the 'D', the 'E' means that color at 'x' changed from that at the same 'x' on previous line (y-1) - is output after h_edge() is called: started a new crack ('b'); the sign > 0 ('c'); and that an edge (here started by 'n') is continuing ('g'). (No 'j' printed because free_cracks is shared by v_edge & h_edge and that job was done by 'n').

Rest of the row is boring until the last 'D' which means the same thing it did before - the same x on the previous line (y-1) was different. What's key is that v_edge() now decides to: start a new vertical crack ('k'); the sign > 0; and there was a [fake'] edge to join to (we've been tracing the horizontal edge of something & it has come to an end so that 'end' is the vertical edge).

Of course, line_edges() keeps as many edges in progress as necessary. Three vertical lines (two on the sides and one in the middle) will look like this:

e=========
ekmnDbcgEkloD===kmnDbcgEkloD===kmnDbcgEkloN
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ekmpKkloK==kmpKkloK==kmpKkloO
ebdhLC==adhLC==adhLM
e=========

("ccmain/tesseract testing/edges/three_vert.tif out")

===== BIG PICTURE =====

line_edges() works scanline by scanline looking for CHANGES in color. When it finds these changes, it creates and tracks both vertical and horizontal edges:

e.g., a horizontal, one-pixel-width line has BOTH: at the start there's a vertical edge, during the run of its length there are two horizontal edges, and at the end there's a final vertical edge.

In each successive scanline, edges are detected and cracks followed until TWO cracks meet [3] (forming a loop). Then, this list of points - now called an "outline" - is sent in for "cleaning and approximation" by complete_edge() and the list of edges is freed.

Todo:

Add reference to entry to read after Edge Detection

while callpicofeat() is called, it's from a somewhat unexpected place

Baseline normalized Feature: type of 'Extracted Feature', NOT called usually, see GetBaselineFeatures
Character normalized Feature: type of 'Extracted Feature', see GetIntCharNormFeatures

above really needs a BS-check, courtesy Ray Smith.

For now, you can see Objects or just keep scrolling.

========= Q&A =================

Todo:: Some of these questions (which were for myself) need answers and/or BS-check by someone more knowledgeable.

Q: What happens when nested outlines are encountered? A: This happens for an "O" - you end up with TWO outlines. The inner one always completes first. After the bounding box is computed, it is simple to find outlines which are contained within other outlines (e.g., OL_BUCKETS::count_children())
Q: What's the sign used for? A: The sign will be used by check_path_legal() to determine if the outline is "outer" or "inner" - with, say, letter 'B', there is ONE (1) outside outline and TWO (2) inner outlines. Q: Does sign have more uses?
Q: What is the next logical file? A: probably edgblob.cpp
Q: Are the inner outlines ever discarded or are they needed and used by the feature matcher?

========= Notes: ===============

[1] My hacking revolves around removing/supressing noise in FAXes which is manifest as continous, 'vertical' streaks throughout the image at one or more x locations. This noise is disasterous to recognition and also, alas, chews a lot of CPU. I've had some success in eliminating the streaks by adding an additional state machine (very similar to line_edges()) that looks for these streaks. When these are confirmed (i.e., start below the header, last through the whole image & satisfy some other thresholds), they can be dealt with.

These streaks are much simpler to deal with than, for example, scans from ruled/lined paper because they are EXACTLY PARALLEL to the Y axis (they are caused by white-out/crud on the glass inside the FAX machine so fixed in location). And, all the pages FAXed on such machine have streaks in EXACTLY the same x positions, so that only the first page requires careful analysis.

[2] sign is just a way of keeping track of what the color chaged FROM. The sign > 0 on transition from edge to background (black-->white) and < 0 vice verse (white-->black).

Todo:: need to verify exact relationship between color xsition & sign

[3] the margin of the image are forced to be white. This, I think, guarantees that all edges DO end (and begin, if you think about it, since there is a known starting state for line_edges()).

Todo:: there have been reports to the contrary on the forums on this exact issue - need to check why

Ems

Stands for the letter 'm'. Tesseract uses special heuristics when it thinks it sees an "m" because it can also be "r" followed by "i"! Other characters that have special heuristics are "I" (upper-case i), "1" (one), and "l" (lower-case L) - can YOU tell which is which "lI1"?

The pre-release of tesseract used to have the file "DangAmbigs" which contained the following: "m rn", "rn m", "m in", "in m", "d cl", "cl d", "nn rm", "rm nn", "n ri", "ri n", "li h", "ii u", "ii n", "ni m", "iii m", "ll H", "I-I H", "vv w", "VV W", "t f", "f t", "a o", "o a", "e c", and "c e"

To my knowledge, only 'I1l' and 'm' receive special treatment in tesseract because the Context and DAWG (see the respective section) are used to increase the accuracy.

F

Feature

As previously mentioned, the features and their extraction are covered by Patent 5,237,627 (http://www.freepatentsonline.com/5237627.html). The names of the inventors will be familar, if you have looked at ExtractIntFeat() :-)

There are several types of 'features' & it is best to keep them distinct:

Proto-features: built-in features in the template
Extracted Feature or just 'Feature': callfeatextr features extracted from outlines for an unknown sample being 'recognized'
Micro-feature: type of 'Extracted Feature',
Todo:
While InitMicroFXVars() called, Microfeatures never extracted...

Todo:
next part needs work, hard-hat & barf-bag area
CharNormFeatures are computed somewhat like this:
1. Find the 'center' of the blob/character; Center of Mass (CoM) of the shape
2. Clip everything beyond the Radius of Gyration of that shape
3. Start with the highest y-value of the outline and, following that outline 'around':
  1. ONE: 'Draw' a tangent to the outline
  2. Find angle between tangent line and the East-ward vector from the CoM
  3. Save the compass direction that corresponds to that angle with SaveFeature()
  4. Go another 'unit' along the outline
  5. Until got back to where we started, GOTO ONE.

Todo:: How the heck does MySqrt2() work? What's the multiplication by '41943' for?

A 'proto-feature' is composed of three 'units' of information:

Direction
Length of this feature
Relationship of this feature to the others that make up one character

So, a bunch of individual features describe one prototype and there is one prototype for each ASCII character that can be recognized by tesseract.

Let's compare gocr's definition of a feature with tesseract. Here's what gocr (v0.39) gives in its own docs (BTW, I'm not "picking on" gocr - it just happens of have clear documentation on what & how it approaches the same problem).

vvvv         vv- white regions
......@@......  <- crossing one line
......@@......
.....@@@@.....
.....@@@@.....
.....@@@@.....
....@..@@@....  <- white hole / crossing two lines
....@..@@@....  <- crossing two lines
....@..@@@....
...@....@@@...
...@....@@@...
...@....@@@...
..@@@@@@@@@@..  <- horizontal line near center
..@......@@@..
..@......@@@..
.@........@@@.  v- increasing width of pattern
.@........@@@.  v
.@........@@@.  v
@@@......@@@@@
    ^^^-- gap
                                                                                           
"In the future, the program should detect edges, vertices, gaps, angles
and so on. This is called feature extraction (as far as I know)."

Other OCR programs, like gocr, attempt to extract high level features of each character and classify the character based on those features. If the characters on the page are of a high quality, such as an original typed page, these simple processing methods will work well for the process of converting the characters.

However, as document quality degrades, such as through multiple generations of photocopies, carbon copies, facsimile transmission, or in other ways, the characters on a page become distorted causing simple processing methods to make errors.

For example, a dark photocopy may join two characters together (like the "r" and "i", below), causing difficulty in separating these characters for the OCR processing. Joined characters can easily cause the process that segments characters to fail, since any method which depends on a "gap" between characters cannot easily distinguish characters that are joined.

Dark image joins adjacent letters

Light photocopies produce the opposite effect. Characters can become broken, and appear as two characters, such as the character "u" being broken in the bottom middle to create two characters (below), each of which may look like the "i" character. Also, characters such as the letter "e" may have a segment broken to cause them to resemble the character "c".

Light image breaks letters

Light image breaks closures

Programs like gocr begin by extracting higher level features from each character image in an attempt to select a set of features which would be insensitive to unimportant differences, such as size, skew, presence of serifs, etc., while still being sensitive to the important differences that distinguish between different types of characters.

High level features, however, can be very sensitive to certain forms of character distortion, as demonstrated above. That is, a simple break in a character can easily cause a closure to DISAPPEAR, and the feature extractor method that depends on such closures would probably classify the character incorrectly.

Often the high level feature representation of a character contains few such features. Therefore, when a single feature is destroyed, the classification (differentiation from other characters) is negatively impacted.

Tesseract takes a completely different approach, even if one much more complicated. (this is a high-level decription, as per the patent - a LOT happens inside tesseract before your image gets to the stage where features are recognized)

First, assume that you can only follow around the outline of any character in ONE of the EIGHT compass directions (see Direction).
Second, every time the direction of the outline changes from one of the eight directions to another, we will consider that a NEW feature.
Third, for every new feature, we will record the new direction in a table immediately under the previous feature (thus, we're really recording the DIRECTION and LOCATION of the feature in the traversal).
Finally, we keep going until we arrive at the point we started.

Like tessaract, we'll start at the highest/max y location in the following sample. In these examples, we'll only consider the OUTER outline (see note).

Note: due to the way in which edge-detection works, characters that are open (have no closures) like C, T, V, etc. have only ONE outline (the outer one). On the other hand, characters like O, P, A, have one closure and characters like B actually have two. See "Edge Detection" for the gory details.

           .......1......
           ......@@......
           .....2@@......
           .....@ii@.....
           .....@ii@.....
           .....@ii@18...
           ....@..@i@....
           ...2@..@i@....
           ....@..@i@....
           ...@....@i@...
           ...@....@i@17.
           ...@.....i@...--------> 0 degrees (due East)
           .3@@@@@@@@i@..
           ..@9.1011@i@16
           ..@......@i@..
           .@8.......@i@.
           4@......12@i@.
           .@........@i@. 
          5@@@7...13@@@@@15
            6        14

Looking at the above picture, start at "1" and go Counter-Clock Wise (CCW):

1 = Going West
2 to 4 = Going South-West
5 = Going South
6 = Going East
7 = Going North
8 = Going North-East
etc...
19 = 1, STOP

After you arrive back at point "1", you should have a table of directions. This table is what tesseract stores for every character and (probably) for every font. Now, ponder this question: if you were to ignore 5 random lines in this table, would you still be able to recognize this particular character? The answer is "probably" because sufficuent features remain.

That is precisely why tesseract is insensitive to character segmentation boundaries. These features also have a low enough level to be insensitive to common noise distortions and sufficient number of features are created that some will remain to allow character classification even if others are destroyed by noise. Finally, these features are a lot less sensitive to font variations than many other matching techniques.

With that said, there are several problems. First, tesseract needs to be very carefully trained before any additional letter can be recognized. This is why languages other than English are currently not supported (as of version 1.02). Second, these trained features must be stored somewhere. This means that unless they are somehow incorporated into the executable (which has its own problems), they must be stored in a file. In version 1.02 that file is in tessdata directory and is called "inttemp" (a bit over 600KB). Third,

I'm not done with this section!

==================

FIX: A character can be made of several blobs but, intially, features are extracted from single blobs. Up to MAX_NUM_INT_FEATURES (defined as 512 in classify/intproto.h) can be extracted from a character.

==================

FIX: The pre-trained templates (See Prototype) are sufficiently 'smeared' (by SmearBulges() in classify/mfx.cpp) to comfortably match different fonts representing any particular character.

FIX: this is not finished. But it's better than it was before :-)

FX

Stands for Feature Extraction, see Feature.

G

Gap

Space between blobs, not always a 'space', could be between text and table boundary, for example. See Gap map and gap_map.cpp

Gap Map

A block gap map is a quantised histogram of whitespace regions in the block. It is a vertical projection of wide gaps WITHIN lines

The map is held as an array of counts of rows which have a wide gap covering that region of the row. Each bucket in the map represents a width of about half an xheight - (The median of the xhts in the rows is used.)

The block is considered RECTANGULAR - delimited by the left and right extremes of the rows in the block. However, ONLY wide gaps WITHIN a row are counted. (definition moved from gap_map.cpp)

H

Hash Table

A hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location ("bucket") where the values should be. You use a hash table every time you look up a phone number in the phone book. See http://en.wikipedia.org/wiki/Hash_table/ for more details. See closed.cpp, bestfirst.cpp, hashfn.cpp, and memblk.cpp

Heap

FIX: Definition is easy, how it's used is a different story.

Histogram

In photography (http://kenrockwell.com/), a histogram is a graph counting how many pixels are at each level between black and white. Black is on the left. White is on the right. The height of the graph at each point depends on how many pixels are that bright. Lighter images move the graph to the right. Darker ones move it to the left. In tesseract, a histogram is used to convert a RANGE of certainties into a table of 'quantized' buckets/units of certainty. See gap_map.cpp, tospace.cpp, cluster.cpp, etc.

How Tesseract Works

See Working out how Tesseract works The Page on How Tesseract Works. See What do all those letters for TEXT_VERBOSE mean? What do all those letters for TEXT_VERBOSE mean.

I

I1l

Three letters that give tess fits enough they get their own set of heuristics!

In fact, if you look at the .raw file from the testing/image_courR18.tif, you will see that ALL 'i's come out as '1's while this does not happen with testing/image_2helvR18.tif (.raw files are not output by v1.03)

Does this mean that the feature extractor should have a set of templates for different font families? I know the dictionary and word-heuristics certainly help (after all the .txt file is almost right) but this seems to indicate a serious shortcoming.

As an example of my comment above, 'Bristol' is not in the dictionary so the '1' in .raw gets turned into a 'l' by the word-heuristics, not the 'i' that it should have been. Ie: there is only so much rules can do and it's always better to improve the .raw output.

J

[Place holder]

K

[Place holder]

K-D search trees

FIX: Definition is easy, how it's used is a different story.

L

Linear Least Squares Fit

FIX: Definition is easy, how it's used is a different story.

M

Mask

FIX: Definition is easy, how it's used is a different story.

Matcher

FIX:

Microfeature

FIX:

N

Normalize

Neural Networks

Tesseract used to employ a neural network called Aspirin/Migraines (which has been removed due to licensing issues) which allowed training. How did this removal affect tesseract's accuracy? Ray Smith said, "When I took the NN code out, I measured the increase in error rate to be slightly less than 1% (relative). The benefit was a 10-15% improvement in speed. I have some accuracy/bug fixes coming in the next release (i.e., v1.03) that more than compensate by reducing the error rate by more than 3%. Of course any error rate changes are dependent on the test set and are not necessarily reproducible on a different test set..."

N-D space

[Place holder]

O

O0

Usually, the only way tess can tell apart a zero from an upper-case 'o' is from the context.

If you look at the .raw file from the testing/image_courR18.tif, you will see that ALL 'o's come out as '0's while this does not happen with testing/image_2helvR18.tif. See the entry for 'I1l' for more info.

Objects

Hierarchical & abstracted objects/names for things being worked on and derived within tesseract; starting with the physical page and ending with words written to txt file.

It would be most helpful if Ray Smith could look this list over and point out dumb mistakes. Our deeper understanding of what's going on depends on this.

Todo:: list should describe whether each object is temporary (i.e,, bucket); can be de/serialized (?block?); exists at the same time but for different purpose (outlines & blobs) and what those different purposes are; is derived from user's image or developer/pre-training; serves a pivotal/conceptually important derivation; etc.

(from the physical page to words written to txt file)

(01) image - bitmap of printed page (at proper DPI): read in, one row at a time, to look for and keep track of beginning/continuing/ending edges
(03) edge - 1st derivative of color/change in color in a scan-line: indicate the beginning OR continuation OR end of a crack; edges can be horizontal or vertical.
(05) crack/CRACKEDGE - a list of points defining series of edges: built while edges are being followed; will make up an outline when two cracks join/meet.
(07) outline - a validated (checked by ) list of points which defines a crack
(08) bucket - a 'normalized' OUTER outline in an object-based 'coordinate' system: used for the 'normalization' in the delineation of blobs
(09) blob - area delineated by a validated OUTER outline/the bitmap encompassed by the outline (??): space between blobs affects the outline but, at a latter stage, blobs can be grouped with other blobs to match a letter better (i.e., one of the multiple conceptual layers which can be later 'jiggled' to [maybe] improve rating)
(08-09) block - 'name' for unit of image fed to edge-detector; usually one page == one block (checked in cleanup_blocks())
(10) row - one of many 'lines' to which blobs are assigned so that the text comes out properly ordered (tordmain.cpp):

(haven't figured out these well enough to say anything :-)

() coutline -
() werd -
() seam -
() word -
() split -
() spline -
() region -
() feature -
() microfeature -
() picofeature -
() prototype -
() template -

Outline feature

See Feature, outfeat.cpp

P

Permute

When the letters that make up a word are initially recognized, there may be several possibilities for each position; each one with decreasing certainty. Put another way, the initial word is made up of letters sorted by certainty.

However, sometimes the letter with the highest certainty may not be the correct letter by applying certain heuristics (i.e, a word generally doesn't have a number in the middle of it, so the '1' (one) is changed to an 'l' (L), etc.)

After these heuristics are applied, tesseract sends the word to the permuter which is a function that exchanges less-certain letters in each position and then checks whether that PARTICULAR combination of letters matches directly in DAWG/dictionary. Every (??) word is permuted in this manner in the hope that at least one of these directly matches the dictionary.

See entry for DAWG and permute.cpp

Picofeature

FIX: See picofeat.cpp

Pitch

Depends on font type, how 'close together' the letters are. See topitch.cpp. Similar to kerning space, see tospace.cpp ?

Polygonal

FIX: Definition is easy, how it's used is a different story.

Priority

FIX: Definition is easy, how it's used is a different story.

Prototype

A prototype is a model for each letter which allows tesseract to match up that model with any particular blob and 'recognize' that character.

The included file 'tessdata/inttemp' contains built-in/pre-trained prototypes for all ASCII characters for a dozen or more different fonts.

Ray Smith has released the code (training/training.cpp) used to generate this (tessdata/inttemp) file but we do not yet know how to a) make the training image(s), b) do the adapting, c) work the training code.

Pruner

(From Ray Smith) The class pruner is a pre-classifier that is used to create a short-list of classification candidates (pruning the possible classes) so that the full distance metric can be calculated on the short-list without taking excessive time, instead of exhaustively matching against each character possibility. The class pruner uses a faster, but approximate method of matching the features, so while it does make mistakes, the mistakes are rare. [With that said,] make_config_pruner() in adaptmatch.cpp is never [presently] called, so it is irrelevant.

Q

Quadratics

FIX: Definition is easy, how it's used is a different story.

?? Used in approximating the outline - converting a 'shape' from a list/series of coordinate-pairs (in one coordinate system) into a 'formula' (in any coordinate system). ?? A, B, C are coefficients of the quadratic ?? Making each feature a quadratic, of sorts ??

R

Raw output

It is easy to get tesseract to output the 'raw' output in addition to the regular .txt file. HOW? Edit ccmain/output.cpp and change the FALSE following tessedit_write_raw_output to TRUE. Rebuild and tesseract will now ALSO generate a .raw file.

You might be interested in the raw file when fiddling with the feature extractor and accuracy of tesseract PRIOR to DAWG/dictionary.

Radius of gyration

Radius of gyration of the character shape about its center of mass, in both the X and Y directions, is used by tesseract to scale each character prior to recognition.

See ExtractIntFeat() in classify/intfx.cpp

Read more about it on http://en.wikipedia.org/wiki/Radius_of_gyration

Region

FIX:

Rejects

FIX:

Reject map

FIX:

Row

Horizontal series of blobs split off from those above & below by seams. See makerow.cpp

S

Seam

Tesseract plays with seams until the recognition confidence improves, see findseam.cpp

Todo:: Q: What's the difference between seam and edge (A: Edges are pre-outline while seams are post-row)?

Speckle

Noise on the scanned page, dust, dirt, etc. Can wreak havoc on the blob classification routines. Internally done by speckle.cpp Note that in tesseract the dot above the 'i' is detected IMMEDIATELY PRIOR to noise removal.

Todo:: Put "i" dot ref and note here.

External option is to run pbmclean, which 'flips' isolated pixels.

Todo:: Fix broken references:

282: Warning: unable to resolve reference to `callfeatextr' for command 284: Warning: unable to resolve reference to `callpicofeat' for command

  THE END THE END THE END THE END THE END THE END THE END THE END THE END

Generated on Wed Feb 28 19:49:29 2007 for Tesseract by

1.5.1