Hacking Tesseract V0.04
(Minor changes June 2007)
1.03f3
- Before plowing into these hacking notes, you might want to read what
tesseract
is
all about (3 different links).
- Tesseract is a commercial quality OCR engine originally developed at Hewlett-Packard
between 1985 and 1995. In 1995, this engine was among the top three evaluated
by UNLV.
In 2005, Hewlett-Packard and UNLV open-sourced and it is now freely
available under the
Apache 2.0 license.
Most notably, this means that you CAN use tesseract in a commercial
product WITHOUT releasing the sources. That means that you can only
politely ask to see them... keep this in mind in the forums!
- While Tesseract OCR is now
hosted on Google Code, it used to be on Sourceforge. This is relevant because a LOT of
discussion and patches have been posted on
Sourceforge Forums.
Alas, because the Tesseract developers switched from CVS to Subversion (and sf.net
only provides CVS), they had to move to Google Code. Thus, all newer issues and
patches (and, certainly, to post something NEW!) should be done on
Tesseract Issues on
Google. The bummer is that you'll need to set up another account on Google unless
you happen to have one already.
- Tesseract is being used as a plug-in for a state-of-the-art document analysis and OCR system (featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities) called
ocropus. You can see some
details on the Supplementary Documentation ("processing steps" should look familar)
- For those looking for Tesseract on Mac OS, have a look at cff2doc.
- Tesseract
is
already
being
used
to
do
work...
there
WILL be more...
- While there are
functional
hooks in the source for both X11 and Win32 graphical functions
there is presently no such support.
- The doxyfied v1.03 sources have been released (~6MB). See the announcement for more info on what this contains.
- tesseract 1.03 sources includes A LOT of code that is not needed to carry out OCR from the sense of an end-user. For example, code is included for:
- Adaptive matcher and training code. I am under the impression that Ray Smith is currently working on the training code. Due to how character templates are used in the recognition process (as well as word-lists), before any lanuages
other than English can be
reliably recognized, Mr. Smith will need to complete his work.
- Starbase and Win32 Dialogs - see note above.
- Tesseract API - see note above as well.
- There also appears to be code from previous 'generations' of tesseract or
maybe from future version that never got completed?
- The features and their extraction from blobs (and ONLY that part) are covered under Patent
5,237,627 This link will give you
all the gory details (my patches will have it under docs/ directory as
FeatureExtraction_patent_5237627.pdf). The down-side is that the document is written in
legalese/patenteese which makes it a tedious read. I have tried to reference columns and
rows from the patent in the comments within the sources. Please do the same if you can!
- The sources have been marked up with http://www.doxygen.org/ compatible comments. I did this with several hacked perl scripts which turned existing C++ comments into something doxyen likes. I also marked up by hand some of the functions I was trying to understand. Please add your documentation this way too.
- The Glossary needs some work. The page you're reading now come from tesseractmain.cpp in ccmain/
- A little something about the registered developers of tesseract. While it's easy to find
info on Luc Vincent, Ray Smith is
proving more
elusive - anyone have any leads?
I encourage you to keep the following list in mind when doing your own hacking and help me add more relevant details. If possible, please reference a TEXT_VERBOSE letter or provide function(s) doing key work.
READING INPUT
- Lines are read in from scanned image, in edge detection, e
EDGE DETECTION/OUTLINES
- Black pixels are split into blobs, aka edge detection, e
- Blobs are processed to extract outlines, in edge detection, e
LINES/SKEW
- Lines are derived from strings of blobs with outlines, l
- Gradient/rotation of page is calculated, q
- Lines are adjusted for skew, m
- Final touches on assigning blobs, now that lines KNOWN, underlines: u
WORDS/SEGMENTER
- Higher-level procedure to order blobs into words, j
- Blobs in lines are segmented into words, t
- Fine-tuning of vertically seams/splits between some blobs, spacing: s
CLASSIFICATION
- Classification of features in letters of all words performed, o
- Words are checked in dictionary and permuter to improve them, p
- Play with xht (height of letter 'x') for words, h
- Words are fitted to lines and assigned to rows that fit them best, r
QUALITY
- Quality of words and letters is checked, v and y
WRITING OUTPUT
- Words are written out to .txt file, w
(just adding them here for now, will organize it later!)
Tess has comments, sometimes in big blocks, scattered within the code. Please add any others you find!
Be sure to check out the link to "Related Pages" (in left frame).
This section lists the sequence of events that tesseract 1.02 executes to convert the input image 'scan.tif' into the output ASCII file 'scan.txt'. If you notice something wrong, please post corrections on sourceforge.net.
By the way, if you define TEXT_PROGRESS you will get a period ('.') when tesseract finds a seam between words, which gives you a good idea that it DID NOT hang.
If you ALSO define TEXT_VERBOSE, key functions in tesseract will print one character that shows you what is going on, ie: what is tesseract doing at any point. See next section for what those letters are and what they mean.
There is also a separate file that has Stack traces for some interesting/common functions RUNNING, see How Tesseract Works: Procedure stack traces Procedure stack traces. Together with TEXT_VERBOSE, these will give you a way to play with tesseract without neccessarily being a C++ wizard, per se :-)
If you define TEXT_VERBOSE in addition to TEXT_PROGRESS, instead of a period, you will get other letters which are defined as follows:
- a =
- b =
- c =
- d =
- e = Reading & scanning line of image for edges, building outlines, in line_edges()
- f =
- g = Loading DAWGs ('word-dawg'+'user-dict'), in init_permute()
- h = Playing with xht for one word, in re_estimate_x_ht()
- i =
- j = Arranging blobs into words, make_words()
- k = Initializing speckle params, in InitSpeckleVars()
- l = Assigning blobs to one line, in assign_blobs_to_rows()
- m = Fitting LMS line to a row, in fit_parallel_lms()
- n = Computing linespacing and offset, delete_non_dropout_rows()
- o = Extracting outlines for a class NOT SEEN BEFORE, in ExtractOutlineFeatures()
- p = Using DAWG to improve a word, in dawg_permute_and_select()
- q = Computing gradient of whole page, in compute_page_skew()
- r = Assembling recognized blobs into rows, in make_rows()
- . or s = Found good seam between words to split a blob, in attempt_blob_chop()
- t = Finding optimal segmentation, in check_pitch_sync2()
- u = Processing underlines, in separate_underlines()
- v = Checking quality of words, in word_blob_quality()
- w = Writing output of recognition, in output_pass()
- x = Expanding rows to touch neighbors, in expand_rows()
- y = Checking quality of characters in words, in word_char_quality()
- z = Evaluating word spacing, in eval_word_spacing()
Notes:
- "o" will not print for EVERY letter because tesseract only needs to see it once. Thus, if any letters repeat and are very similar in appearance, ie. are not messed up in some way by noise, an "o" will only appear for the FIRST occurance of that letter. ex: "PERLLREP" palindrome will only print four "o"s.
- "r" also prints a new-line (\n)
- If you want to add a new character, make sure that it doesn't print too often, preferably once per some logical unit like "per word" or "perl outline". If we're out of lower-case letters, start using upper-case but avoid ambiguous letters like upper-case 'i' ("I" vs "l"?), etc.
- Because tesseract calls functions based on 'difficulties' encountered in the image, you may get different set of letters for different images, but the overall structure should be the same.
To give you an idea of what you'd get, below you can see what happened when I ran tesseract on a file from the 'testing/' directory. I generated it with 'pbmtext' using the included '2helvR18.bdf' font. Other tools used were pgmtopbm and pnmtotiff.
The input text was the tesseract License (See testing/Run_Tests.sh for more details):
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Again, please note that different output is generated using different fonts because the letters in the image will 'interfere' differently and the word-spacing will differ. Also, different fonts have different features so that phase will also differ!
(I wrapped the output with 'fold -w 76')
GNU gdb Red Hat Linux (6.0post-0.20040223.19rh)
[blah blah]
(gdb) r
gkTesseract Open Source OCR Engine
Using LIBTIFF
Opened and reading 'testing/image_2helvR18.tif'...
Recognizing page
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeer
lqmmmmmmmmmmmmmnxlmmmmmmmmmmmmmlllmmmmmmmmmmmmmummmmmmmmmmmmmjtttttttttttttt
tttttttttttttttttttttttttttttttr
pppoooopppoooooopppooopppppppppspppppppppppppppppppppopppooopppsppppppoor
pppspppppppppppppppooopppppppppsppppppppppppspppspppppppppppppppoopppspppppp
ppppppsspppr
pppppppppsppppppppppppspppspppppppppppppppspppoopppoopppsspppppppppppppppppp
pppr
pppppppppppppppopppppppppppppppppppppr
pppppppppspppppppppppppppppppppppppppopppoopppppppppor
pppppppppopppppppppppppppppppppopppsppppppopppppppppppppppppppppr
ppppppsoppppppppppppppppppppppppppppppppppppr
ppppppspppppppppppppppppppppppppppppppppppppppsssspppppppppppppppppppppppppp
ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppr
ppppppopppoppppppppppppppppppppppppppppppsppppppspppr
pppppppppspppppppppppppppppppppppppppppppppppppppppppppppppppooor
ppppppoopppooppppppopppopppppppppoppppppspppspppppppppppppppr
pppppppppppppppppppppspppppppppppppppppppppsspppr
pppppppppppppppppphhhpppssppphpppspppsssspppppppppppppppppphhhpppsppphhpppss
ppppppppphhpppsppphpppspppsspppppphpppspppssspppppppppppppppppphhpppsppppppp
pphpppsssppphpppsppphpppspppsspppppphpppspppssspppppppppppppppppphpppsppphpp
phpppssspppppphpppsppphpppssppphhhppphhppphhppphhpppssppphpppssssppphhppphhp
ppssssppppppppphpppssssppphpppssppphhhpppssppphhppphhpppshpppppphpppssppphpp
phpppssppphpppspppppphhhpppppphpppssppphhppphhpppshhpppsppphhpppppphpppssppp
hhpppsppphppphpppsppppppppppppppppppppppppppppppppppppspppsppppppppppppsssss
sssssspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppphppphhhh
hpppsppphhhpppppphhpppsppphhppphhpppssssppppppppphhppphhpppppphpppsppphpppsp
pphpppppphpppsppphppphhhhhhhpppppphhpppspppshhhppphpppsssssppphpppssppphhppp
shpppssppphhppphhhpppsssppphppphppphhpppssppphhzzzpppspppspppsspppsspppssppp
sppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppphzzzpppspppspppspppsppppppppppppppppppppppppppppppppppppppppp
ppppppphzzzpppspppsssppppppppppppppppppppppppppphzzzpppspppspppspppspppppppp
pppppppppppppppppppppppppppppppppppppppphzzzzzzppppppppphzzzppphpppssppphzzz
pppspppsspppsspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppppppppppppphzppppppppppppppppppppppppppppppppppppppppppppppppp
pppppppppppppppppppppppppppppppppppppppppppphzzzpppspppspppppppppppppppppppp
phzzzpppsppppppppppppppphzzzpppssppphzzzpppsspppppphzzzpppspppsssspppppppppp
pppppppphzzzpppspppssssppppppppphzpppspppsssppppppppppppppppppppppppppphzzpp
phpppssppphzzzzpppsppppppppppppppphzzzpppspppppppppppppppppppppppphzzzpppspp
ppppppppppppppppppppppppppppppphzzzzzzpppspppspppppppppppppppppppppppppppppp
pppppppppppppppppphzzzpppspppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppphzzvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvya
Program exited normally.
BTW, I'm running an ancient Fedora 2 release; time to upgrade! :-)
The End
Generated on Thu Nov 30 18:45:59 2006 for Tesseract by
1.5.1