Hacking Tesseract V0.05

Changes for:

1.03f4

Introduction to hacking Tesseract v1.04

How Tesseract Works: What's going on

I encourage you to keep the following list in mind when doing your own hacking and help me add more relevant details. If possible, please reference a TEXT_VERBOSE letter or provide function(s) doing key work.

READING INPUT

EDGE DETECTION/OUTLINES LINES/SKEW WORDS/SEGMENTER CLASSIFICATION QUALITY WRITING OUTPUT

Entry Points

Heuristics

(just adding them here for now, will organize it later!)

Tips and Hints

Tess has comments, sometimes in big blocks, scattered within the code. Please add any others you find!

Segmenters

Be sure to check out the link to "Related Pages" (in left frame).

Working out how Tesseract works

This section lists the sequence of events that tesseract 1.02 executes to convert the input image 'scan.tif' into the output ASCII file 'scan.txt'. If you notice something wrong, please post corrections on sourceforge.net.

By the way, if you define TEXT_PROGRESS you will get a period ('.') when tesseract finds a seam between words, which gives you a good idea that it DID NOT hang.

If you ALSO define TEXT_VERBOSE, key functions in tesseract will print one character that shows you what is going on, ie: what is tesseract doing at any point. See next section for what those letters are and what they mean.

There is also a separate file that has Stack traces for some interesting/common functions RUNNING, see How Tesseract Works: Procedure stack traces Procedure stack traces. Together with TEXT_VERBOSE, these will give you a way to play with tesseract without neccessarily being a C++ wizard, per se :-)

What do all those letters for TEXT_VERBOSE mean?

If you define TEXT_VERBOSE in addition to TEXT_PROGRESS, instead of a period, you will get other letters which are defined as follows:

Notes: To give you an idea of what you'd get, below you can see what happened when I ran tesseract on a file from the 'testing/' directory. I generated it with 'pbmtext' using the included '2helvR18.bdf' font. Other tools used were pgmtopbm and pnmtotiff.

The input text was the tesseract License (See testing/Run_Tests.sh for more details):

This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:

** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.

Again, please note that different output is generated using different fonts because the letters in the image will 'interfere' differently and the word-spacing will differ. Also, different fonts have different features so that phase will also differ!

(I wrapped the output with 'fold -w 76')

[blah blah]

(gdb) r
gkTesseract Open Source OCR Engine
Using LIBTIFF
Opened and reading 'testing/image_2helvR18.tif'...
Recognizing page
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeer
lqmmmmmmmmmmmmmnxlmmmmmmmmmmmmmlllmmmmmmmmmmmmmummmmmmmmmmmmmjtttttttttttttt
tttttttttttttttttttttttttttttttr
pppoooopppoooooopppooopppppppppspppppppppppppppppppppopppooopppsppppppoor
pppspppppppppppppppooopppppppppsppppppppppppspppspppppppppppppppoopppspppppp
ppppppsspppr
pppppppppsppppppppppppspppspppppppppppppppspppoopppoopppsspppppppppppppppppp
pppr
pppppppppppppppopppppppppppppppppppppr
pppppppppspppppppppppppppppppppppppppopppoopppppppppor
pppppppppopppppppppppppppppppppopppsppppppopppppppppppppppppppppr
ppppppsoppppppppppppppppppppppppppppppppppppr
ppppppspppppppppppppppppppppppppppppppppppppppsssspppppppppppppppppppppppppp
ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppr
ppppppopppoppppppppppppppppppppppppppppppsppppppspppr
pppppppppspppppppppppppppppppppppppppppppppppppppppppppppppppooor
ppppppoopppooppppppopppopppppppppoppppppspppspppppppppppppppr
pppppppppppppppppppppspppppppppppppppppppppsspppr
pppppppppppppppppphhhpppssppphpppspppsssspppppppppppppppppphhhpppsppphhpppss
ppppppppphhpppsppphpppspppsspppppphpppspppssspppppppppppppppppphhpppsppppppp
pphpppsssppphpppsppphpppspppsspppppphpppspppssspppppppppppppppppphpppsppphpp
phpppssspppppphpppsppphpppssppphhhppphhppphhppphhpppssppphpppssssppphhppphhp
ppssssppppppppphpppssssppphpppssppphhhpppssppphhppphhpppshpppppphpppssppphpp
phpppssppphpppspppppphhhpppppphpppssppphhppphhpppshhpppsppphhpppppphpppssppp
hhpppsppphppphpppsppppppppppppppppppppppppppppppppppppspppsppppppppppppsssss
sssssspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppphppphhhh
hpppsppphhhpppppphhpppsppphhppphhpppssssppppppppphhppphhpppppphpppsppphpppsp
pphpppppphpppsppphppphhhhhhhpppppphhpppspppshhhppphpppsssssppphpppssppphhppp
shpppssppphhppphhhpppsssppphppphppphhpppssppphhzzzpppspppspppsspppsspppssppp
sppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppphzzzpppspppspppspppsppppppppppppppppppppppppppppppppppppppppp
ppppppphzzzpppspppsssppppppppppppppppppppppppppphzzzpppspppspppspppspppppppp
pppppppppppppppppppppppppppppppppppppppphzzzzzzppppppppphzzzppphpppssppphzzz
pppspppsspppsspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppppppppppppphzppppppppppppppppppppppppppppppppppppppppppppppppp
pppppppppppppppppppppppppppppppppppppppppppphzzzpppspppspppppppppppppppppppp
phzzzpppsppppppppppppppphzzzpppssppphzzzpppsspppppphzzzpppspppsssspppppppppp
pppppppphzzzpppspppssssppppppppphzpppspppsssppppppppppppppppppppppppppphzzpp
phpppssppphzzzzpppsppppppppppppppphzzzpppspppppppppppppppppppppppphzzzpppspp
ppppppppppppppppppppppppppppppphzzzzzzpppspppspppppppppppppppppppppppppppppp
pppppppppppppppppphzzzpppspppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppphzzvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvya

Program exited normally.

Links to utilities/projects/fun hacks for tesseract-ocr files

The End


Generated on Thu Nov 30 18:45:59 2006 for Tesseract by  doxygen 1.5.1