Hacking Tesseract V0.05
Changes for:
- September 2012: Added local copy of V3.01 documentation and fixed up some links. Note: Nothing new otherwise, per se.
- May 2009: Added links to Ray Smith's & Thomas Breuel's bios, fixed bad links &
added a list of pointers to utilities for hacking tess.
Since these hacks, I've changed careers and code much less BUT these pages are here to
stay (until I'm asked to remove them, e.g., because
better documentation exists).
- July 2007: Linked to v2.0 release, more prominent link to the glossary
1.03f4
- Version 2.03 of tesseract-ocr is out and it supports a slew of European languages
(French, Italian, German, Spanish, and Dutch)
in addition to English. Also, finally,
there is support for both testing and training. Get your copy today BUT, unlike in the past, now be sure to get both the source AND
a language file and start hacking! Read the Release Notes for the
many key changes since v1.04!
- Before plowing into these notes, you might want to read what
tesseract-ocr
is
all about.
-
If you'd like something with more 'meat', read both Ray's
April 2008 presentation
Overview of the Tesseract OCR (optical character recognition) engine, and its possible
enhancement for use in Wales in a pre-competitive research stage (Prepared by the Language
Technologies Unit of Canolfan Bedwyr, Bangor University)
(local copy)
AND
October 2007 paper
Overview of the Tesseract OCR Engine (local copy)
- Tesseract is a commercial quality OCR engine originally developed at Hewlett-Packard
between 1985 and 1995. In 1995, this engine was among the top three evaluated
by UNLV (link's dead - read background here).
In 2005, Hewlett-Packard and UNLV open-sourced and it is now freely
available under the
Apache 2.0 license.
Most notably, this means that you CAN use tesseract in a commercial
product WITHOUT releasing the sources. That means that you can only
politely ask to see them... keep this in mind in the forums!
- While Tesseract OCR is now
hosted on Google Code, it used to be on Sourceforge. This is relevant because a LOT of
discussion and patches have been posted on
Sourceforge Forums.
Alas, because Tesseract developers switched from CVS to Subversion (and sf.net
only provides CVS), they had to move to Google Code. Thus, all newer issues and
patches (and, certainly, to post something NEW!) should be done on
Tesseract Issues on
Google. The bummer is that you'll need to set up another account on Google unless
you happen to have one already.
- Tesseract is being used as a plug-in for a state-of-the-art document analysis and OCR system (featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities) called
ocropus.
- For those looking for Tesseract on Mac OS, have a look at cff2doc.
- Tesseract
is
already
used to
do
work...
- While there are
functional
hooks in the source for both X11 and Win32 graphical functions
there is presently no such support. However, there are several wrappers that do this - search
on google for latest.
- The doxyfied v1.03 sources have been released (~6MB). See the announcement for more info on what this contains.
- For anyone interested, there's a Glossary of Tesseract Terms (not new, but it was hidden before).
- The features and their extraction from blobs (and ONLY that part) are covered under Patent
5,237,627 This link will give you
all the gory details (but may differ from implementation). The down-side is that the document is written in
legalese/patenteese which makes it a tedious read. I have tried to reference columns and
rows from the patent in the comments within the sources. Please do the same if you can!
- The sources have been marked up with http://www.doxygen.org/ compatible comments. I did this with several hacked perl scripts which turned existing C++ comments into something doxyen likes. I also marked up by hand some of the functions I was trying to understand. Please add your documentation this way too.
- The Glossary needs some work but has a few things in it already.
- A little something about the developers of tesseract-ocr:
I encourage you to keep the following list in mind when doing your own hacking and help me add more relevant details. If possible, please reference a TEXT_VERBOSE letter or provide function(s) doing key work.
READING INPUT
- Lines are read in from scanned image, in edge detection, e
EDGE DETECTION/OUTLINES
- Black pixels are split into blobs, aka edge detection, e
- Blobs are processed to extract outlines, in edge detection, e
LINES/SKEW
- Lines are derived from strings of blobs with outlines, l
- Gradient/rotation of page is calculated, q
- Lines are adjusted for skew, m
- Final touches on assigning blobs, now that lines KNOWN, underlines: u
WORDS/SEGMENTER
- Higher-level procedure to order blobs into words, j
- Blobs in lines are segmented into words, t
- Fine-tuning of vertically seams/splits between some blobs, spacing: s
CLASSIFICATION
- Classification of features in letters of all words performed, o
- Words are checked in dictionary and permuter to improve them, p
- Play with xht (height of letter 'x') for words, h
- Words are fitted to lines and assigned to rows that fit them best, r
QUALITY
- Quality of words and letters is checked, v and y
WRITING OUTPUT
- Words are written out to .txt file, w
(just adding them here for now, will organize it later!)
Tess has comments, sometimes in big blocks, scattered within the code. Please add any others you find!
Be sure to check out the link to "Related Pages" (in left frame).
This section lists the sequence of events that tesseract 1.02 executes to convert the input image 'scan.tif' into the output ASCII file 'scan.txt'. If you notice something wrong, please post corrections on sourceforge.net.
By the way, if you define TEXT_PROGRESS you will get a period ('.') when tesseract finds a seam between words, which gives you a good idea that it DID NOT hang.
If you ALSO define TEXT_VERBOSE, key functions in tesseract will print one character that shows you what is going on, ie: what is tesseract doing at any point. See next section for what those letters are and what they mean.
There is also a separate file that has Stack traces for some interesting/common functions RUNNING, see How Tesseract Works: Procedure stack traces Procedure stack traces. Together with TEXT_VERBOSE, these will give you a way to play with tesseract without neccessarily being a C++ wizard, per se :-)
If you define TEXT_VERBOSE in addition to TEXT_PROGRESS, instead of a period, you will get other letters which are defined as follows:
- a =
- b =
- c =
- d =
- e = Reading & scanning line of image for edges, building outlines, in line_edges()
- f =
- g = Loading DAWGs ('word-dawg'+'user-dict'), in init_permute()
- h = Playing with xht for one word, in re_estimate_x_ht()
- i =
- j = Arranging blobs into words, make_words()
- k = Initializing speckle params, in InitSpeckleVars()
- l = Assigning blobs to one line, in assign_blobs_to_rows()
- m = Fitting LMS line to a row, in fit_parallel_lms()
- n = Computing linespacing and offset, delete_non_dropout_rows()
- o = Extracting outlines for a class NOT SEEN BEFORE, in ExtractOutlineFeatures()
- p = Using DAWG to improve a word, in dawg_permute_and_select()
- q = Computing gradient of whole page, in compute_page_skew()
- r = Assembling recognized blobs into rows, in make_rows()
- . or s = Found good seam between words to split a blob, in attempt_blob_chop()
- t = Finding optimal segmentation, in check_pitch_sync2()
- u = Processing underlines, in separate_underlines()
- v = Checking quality of words, in word_blob_quality()
- w = Writing output of recognition, in output_pass()
- x = Expanding rows to touch neighbors, in expand_rows()
- y = Checking quality of characters in words, in word_char_quality()
- z = Evaluating word spacing, in eval_word_spacing()
Notes:
- "o" will not print for EVERY letter because tesseract only needs to see it once. Thus, if any letters repeat and are very similar in appearance, ie. are not messed up in some way by noise, an "o" will only appear for the FIRST occurance of that letter. ex: "PERLLREP" palindrome will only print four "o"s.
- "r" also prints a new-line (\n)
- If you want to add a new character, make sure that it doesn't print too often, preferably once per some logical unit like "per word" or "perl outline". If we're out of lower-case letters, start using upper-case but avoid ambiguous letters like upper-case 'i' ("I" vs "l"?), etc.
- Because tesseract calls functions based on 'difficulties' encountered in the image, you may get different set of letters for different images, but the overall structure should be the same.
To give you an idea of what you'd get, below you can see what happened when I ran tesseract on a file from the 'testing/' directory. I generated it with 'pbmtext' using the included '2helvR18.bdf' font. Other tools used were pgmtopbm and pnmtotiff.
The input text was the tesseract License (See testing/Run_Tests.sh for more details):
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Again, please note that different output is generated using different fonts because the letters in the image will 'interfere' differently and the word-spacing will differ. Also, different fonts have different features so that phase will also differ!
(I wrapped the output with 'fold -w 76')
[blah blah]
(gdb) r
gkTesseract Open Source OCR Engine
Using LIBTIFF
Opened and reading 'testing/image_2helvR18.tif'...
Recognizing page
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeer
lqmmmmmmmmmmmmmnxlmmmmmmmmmmmmmlllmmmmmmmmmmmmmummmmmmmmmmmmmjtttttttttttttt
tttttttttttttttttttttttttttttttr
pppoooopppoooooopppooopppppppppspppppppppppppppppppppopppooopppsppppppoor
pppspppppppppppppppooopppppppppsppppppppppppspppspppppppppppppppoopppspppppp
ppppppsspppr
pppppppppsppppppppppppspppspppppppppppppppspppoopppoopppsspppppppppppppppppp
pppr
pppppppppppppppopppppppppppppppppppppr
pppppppppspppppppppppppppppppppppppppopppoopppppppppor
pppppppppopppppppppppppppppppppopppsppppppopppppppppppppppppppppr
ppppppsoppppppppppppppppppppppppppppppppppppr
ppppppspppppppppppppppppppppppppppppppppppppppsssspppppppppppppppppppppppppp
ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppr
ppppppopppoppppppppppppppppppppppppppppppsppppppspppr
pppppppppspppppppppppppppppppppppppppppppppppppppppppppppppppooor
ppppppoopppooppppppopppopppppppppoppppppspppspppppppppppppppr
pppppppppppppppppppppspppppppppppppppppppppsspppr
pppppppppppppppppphhhpppssppphpppspppsssspppppppppppppppppphhhpppsppphhpppss
ppppppppphhpppsppphpppspppsspppppphpppspppssspppppppppppppppppphhpppsppppppp
pphpppsssppphpppsppphpppspppsspppppphpppspppssspppppppppppppppppphpppsppphpp
phpppssspppppphpppsppphpppssppphhhppphhppphhppphhpppssppphpppssssppphhppphhp
ppssssppppppppphpppssssppphpppssppphhhpppssppphhppphhpppshpppppphpppssppphpp
phpppssppphpppspppppphhhpppppphpppssppphhppphhpppshhpppsppphhpppppphpppssppp
hhpppsppphppphpppsppppppppppppppppppppppppppppppppppppspppsppppppppppppsssss
sssssspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppphppphhhh
hpppsppphhhpppppphhpppsppphhppphhpppssssppppppppphhppphhpppppphpppsppphpppsp
pphpppppphpppsppphppphhhhhhhpppppphhpppspppshhhppphpppsssssppphpppssppphhppp
shpppssppphhppphhhpppsssppphppphppphhpppssppphhzzzpppspppspppsspppsspppssppp
sppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppphzzzpppspppspppspppsppppppppppppppppppppppppppppppppppppppppp
ppppppphzzzpppspppsssppppppppppppppppppppppppppphzzzpppspppspppspppspppppppp
pppppppppppppppppppppppppppppppppppppppphzzzzzzppppppppphzzzppphpppssppphzzz
pppspppsspppsspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppppppppppppphzppppppppppppppppppppppppppppppppppppppppppppppppp
pppppppppppppppppppppppppppppppppppppppppppphzzzpppspppspppppppppppppppppppp
phzzzpppsppppppppppppppphzzzpppssppphzzzpppsspppppphzzzpppspppsssspppppppppp
pppppppphzzzpppspppssssppppppppphzpppspppsssppppppppppppppppppppppppppphzzpp
phpppssppphzzzzpppsppppppppppppppphzzzpppspppppppppppppppppppppppphzzzpppspp
ppppppppppppppppppppppppppppppphzzzzzzpppspppspppppppppppppppppppppppppppppp
pppppppppppppppppphzzzpppspppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppphzzvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
vyvyvyvyvyvyvyvyvyvyvyvyvyvya
Program exited normally.
-
TessBoxer
(Reference),
-
Empirical DangAmbigs generator
(Reference),
-
WaveTesseract
(seems to be gone)
-
Limited
Japanese + digits,
-
Using
tesseract in a C# application,
-
A Java/.NET GUI frontend for Tesseract OCR engine with Vietnamese language
(Reference),
-
Some Clue on Generating Probablity scores for each character/word,
-
Tessnet2: .NET 2.0 Open Source OCR assembly using Tesseract engine
(Reference),
-
Usage of dictionary files ( freq-dawg & word-dawg ) in tesserac,
-
Creating a Borland dll,
-
Boxeditor for Tesseract OCR
(Reference),
The End
Generated on Thu Nov 30 18:45:59 2006 for Tesseract by
1.5.1