NOISE TOLERANT OPTICAL CHARACTER
RECOGNITION SYSTEM
CROSS-REFERENCE TO RELATED
APPLICATIONS
This application is related to application Sex. No.
07/875,000 which is continuation of 07/599,522 of Dan
S. Johnson for Noise Tolerant Optical Character Rec-
ognition System, fled Oct. 17, 1990; application Se:.
No. 07/705.838 of Oscar A. Zuniga for Automatic Sep-
aration of Text from Background in Scanned Imag  of
Complex Documents, filed May 28, 1991, now pending;
and application Ser. No. 07/898,392 now U.S. Pat. No.
5,l79,599, of Lynn J. Forrnanek for Dynamic Threshol-
dino System for Documents Using Structural Informa-
tion of the Documents, filed lun. 17, l99l; all owned by
the same entity.
FIELD OF THE INVENTION
This invention relates to pattem recognition systems
and more particularly to computerized pattern recogni-
tion systems. Even more particularly, the invention
relates to computerized optical character recognition
systems.
BACKGROUND OF THE INVENTION
Optical character recognition, or OCR, is the process
of transforming a graphical bit image of a page of tex-
tual information into a text tlle wherein the text infor-
mation is stored in a common computer processable
format. such as ASCII. The text file can then be edited
using standard word processing software.
In the process of transforming each of the characters
on the page from a graphical image into an ASCII for-
mat character. prior art OCR methods first break the
graphical page image into a series of graphical images.
one for each character found on the page. They then
extract high level features of each character and classify
the character based on those features. If the characters
on the page are of a high quality, such as an original
typed page, these simple processing methods will work

well for the process of converting the characters, How-
ever, as document quality degrades, such as through
multiple generations of photocopies, carbon copies,
facsimile transmission, or in other ways, the characters
on a page become distorted causing simple processing
methods to make errors, For example, a dark photo-
copy may join two characters together, causing diffi-
culty in separating these characters for the OCR pro-
cessing. Joined characters can easily cause the process
that segments characters to fail, since any method
which depends on a "gap" between characters cannot
easily distinguish characters that are joined.
Light photocopies produce the opposite effect. Char-
acters can become broken, and appear as two charac-
ters, such as the character "u" being broken in the bot-
tom middle to create two characters, each of which
may look like the "i" character. Also, characters such as
the letter "e" may have a segment broken to cause them
to resemble the character "e".
Early prior art OCR methods did not extract charac-
ter features from a character, instead they simply com-
pared a graphical bit map of the character to a template
bit map of a knoum character. This method was com-
monly called "matrix matching". One problem with
matrix matching is that it is very sensitive to small
changes in character size, skew, shape, etc. Also, this

technology was not "omni font", that is, it had to be
carefully trained on each type font to be read and would
not generalize easily to new type fonts.
To solve the "omni font" problem, prior art methods
begin to extract higher level features from a character
image. The goal was to select a set of features which
would be insensitive to unimportant differences, such as
size, skew, presence of serifs, etc., while still being sensi-
tive to the important differences that distinguish be-
tween different types of characters. High level features,
however, can be very sensitive to certain forms of char-
acter distortion. For example, many feature extractors
detect the presence of "closures", such as in the letters
"e", "o", "b", "d", etc., and the feature extractors use
this information to classify the character. Unfortu-
nately, a simple break in a character can easily cause a
closure to disappear, and the feature extractor method
that depends on such closur  would probably classify
the character incorrectly.
Often the high level feature representation of a char-
acter contains very few features. Therefore, when a
feature is destroyed, such as a break in a closure, there
is insullicient information left to correctly classify the
character.
There is need in the art then for an optical character
recognition system that classifies characters by creating
a set of features that is insensitive to character segmen-
tation boundaries. There is further need in the art for
such a system that creates features having a low enough
level to be insensitive to common noise distonions.
Another need in the art is for such a system that creates
a suflicient number of features that some will remain to
allow character classification even if others are de-
stroyed by noise. A still further need in the an is for
such a system that provides a set of features that are
insensitive to font variations. The present invention
meets these needs.

A description of other aspects of OCR can be found
in the following applications;
(a) Application Ser. No. 07/599,522 of Dan S. John-
son for Noise Tolerant Optical Character Recognition
System, filed Oct. l7, 1990;
(1:) Application Ser. No. 07/705.338 of Oscar A.
Zuniga for Automatic Separation of Text from Back-
ground in Scanned lmages of Complex Documents,
filed May 28, 1991; and
(o) Application Ser. No. 07/898,392, of Lynn J. For-
manek for Dynamic Thresholding System for Docu-
mcnts Using Structural Information of the Documents,
Eled Jun. 17, l99l;
each of which is specifically incorporated herein by
reference for all that is disclosed therein.
SUMMARY OF THE INVENTION
It is an aspect of the present invention to provide a
system for recognizing textual characters from a bit
image of a page of text.
It is another aspect of the invention to define a set of
features for each of the characters on the page of text.
Another aspect is to define such a set of features that
are at a low enough level that they are insensitive to
common noise distortions.
Yet another aspect is to detine such a set of features
for each character within a word so that if a few fea-
tures are destroyed by noise, the remaining features will
still be suflicient to yield a correct classification.

A further aspect of the invention is to provide a set of
features that are at a high enough level that they are
insensitive to font variations, such as size, shape, skew,
etc.
The above and other objects of the invention are
accomplished in a method of optical character recogni-
tion that first segments a page image into character
images A set of features is extracted by traversing the
outlines of the dark regions in a character image, keep-
ing the dark area to r.he left, to identify small sections
called features. Once extracted, the features of the un-
known character are compared to fi tures of a proto-
type character from a template in order to classify the
unknown character, and convert it into a character
code. The features of the prototype character are called
proto-features.
~I`he comparison of the features and the proto-features
is perfumed by selecting the features from a character
image to be analyzed. Next, one of the templates is
selected and each of the features from the character
image is compared to each of the proto-features in the
template to create an average feature match evidence.
Each of the proto-features is then compared to each of
the features to create an average proto match evidence.
These two evidences are then summed and divided by
the total feature and proto-feature lengths to create a
match rating. The features are then compared to all
other templates to create a match rating list which is
sorted into descending order. The top match rating is
selected for output, and any ratings that are very close
to the top rating are also output to allow a dictionary
look-up routine or a lexical analyzer to make the final
selection.
To create the match evidences, the angles and lengths
of the features and proto-features are compared and
then the result is normalized to a specific range of val-
ues.

The above and other objects, features, and advan-
tages of the invention will be better understood by read-
ing the following more particular description of the
invention, presented in conjunction with the following
drawings, wherein:
FIG. I shows an example of character distortions that
commonly occur because of noise and illustrates the
problems solved by the present invention;
FIG. 2 shows a set of template proto-features for the
letters "o" and "I";
FIG. 3 shows a diagram of a proto-feature created for
a template character;
FIG. 4 shows an example set of features that could be
extracted from analyzing the letters "o" and "I";
FIG. 5 shows a diagram of a feature extracted from a
character;
FIG. 6 shows a block diagram of the hardware of the
present invention.
FIG. 7 shows a flow diagram of the overall process of
the present invention;
FIG. 8 shows a flowchart of the extract features
process of FIG. 7; and
FIG. 9-12 shows a flowchart of the classify character
process of FIG. 7.
DESCRIPTION OF THE PREFERRED
EMBODIMENT
The following description is of the best presently
contemplated mode of carrying out the present inven-

tion. 'l~his description is not to be taken in a limiting
sense but is made merely for the purpose of describing
the general principles of the invention. The scope of the
invention should be determined by referencing the ap-
pended claims.
Optical character recognition, or OCR, is a process
that transforms an image of a page of textual informa-
tion into a text file on a computer. The text Ele can then
be edited using standard word processors on the com-
puter system. The process tirst involves "training" the
OCR machine to segment characters within a page
image, extract character features and build a set of tem-
plates for each class of characters For example, a class
for the character "a" might include a template for each
font the OCK machine is capable of recognizing. After
the templates have been created, a page of tmknown
textual information is scanned, the characters are seg-
mented, features from each of the characters are ex-
tracted and th  these features are compared to the
templates created earlier in order to classify the charac-
ters.
The training process is usually performed by the
designers of the OCR machine. lts purpose is to "teach"
the machine what the "shapes" of the different charac-
ters look like. When several character templates match
the incoming character fairly well, the character classi-
fier can output several choices for the character. Other
processes, such as dictionaries and lexical rules, can be
used to decide between the choices for a particular
character. Therefore, it is important for the character
clasitier within an OCR machine to be able to pass
multiple choices when it is unsure ofa character classifi-
cation.

On high-quality documents, simple algorithms will
work well for segmentation, feature extraction. and
character classification. However, as document quality
degrades, the characters on the page become distorted
and simple algorithms begin to make errors. Causes of
document quality degradation include multiple genera-
tion photocopies, small point sizes, carbon copies, fax,
dot matrix printers, etc.
FIG. I shows an example of character distortions that
commonly occur because of noise and illustrates the
problems solved by the present invention. Referring
now to FIG. I, the characters enclosed by the dotted
outline 102 are the characters "r" and "i" which have
been `joined" because of noise, as might occur for ex-
ample, by a dark photocopy. The character within the
dotted outline 104 is the character "u" which has been
broken because of noise such as might be caused by a
light photocopy. A light photocopy of the character
"e" could also result in the outline enclosed in the dot-
ted line 106. This type of noise distortion might cause
prior art methods to  =~\ ify this character ns a *`c".
To solve the character classification problems de-
fined by the characters of FIG. I, the present invention
uses a new method of optical character recognition that
first segments a page image into character images. Char-
acter image separation is well known in the art, and will
not be further described here. The present invention
obtains a set of features by extracting the outlines ol` the
"black" regions in a character image, and then further
djssecting each outline into small sections called fea-
tures.
Early OCR methods did not use a feature extractor.
They simply compared the bit map image of the charac-
ter against the template bitmaps of known characters.
This method was commonly called "matrix-matching".

The problem with matrix-matching is that it is too
sensitive to small changes in character size, skew, shape,
etc. The technology was not "omni-font". That is, it
had to be carefully trained on each font to be read and
would not generalize well to new fonts. To solve the
omni-font problem, designers of OCR machines began
to extract higher lcvcl features from the character im-
age. Some example high level features are !.he "clo~
sures'* such as in "e*', "a*', "b", "d", etc., or "bays" such
as in `*n", `*u", "rn", etc. The goal was to select a set of
features which would be insensitive to unimportant
diflerences such as size, skew, presence of serifs, etc.,
while still being sensitive to the important differences
that distinguish between characters of different fonts.
The problem with high level features is that they can
be very sensitive to certain forms of character distor-
tion. For example, a simple break in a character, such as
the "e" 106 of FIG. l, can easily cause a closure to
disappear. Any method that depended heavily on clo-
sures would probably classify this character as a "c".
The solution of the present invention is to use features
which are at a lower level than closures, bays, etc., but
at a higher level than bit maps. ~I`he invention uses very
small features, features which approximate the size of
the smallest possible outline segment which can still
convey meaningful information about a character.
These features are compared to prototype features in
the character templates. These prototype features are
also called proto-features. In the preferred embodiment,
the proto-feature are the same size or larger than fea-
tures, however, in other embodiments, they could be
smaller. Also, in other embodiments, larger features
could be used.

The invention delin  proto-features within a tem-
plate to be an approximation of the outline of a charac-
ter. ln the preferred embodiment, a straight line approx-
imation is used, however, other approximations, such as
for example arcs, could be used. FIG. 2 shows a set of
template proto-features for the letters "o" and "I".
Each straight line segment corresponds to one proto-
feature. Referring now to FIG. 2, the letter "o" is
shown having proto-features 202 and 204. ~I`he proto-
features ar formed by starting at a point on the outline of
the character and traversing the character in a direction
such that the black area of the character is on the left
side of the arrow. Proto-features are formed usiing the
eight points of the compass, i.e. at 0 degrees. 45 degrees,
90 degrees, etc., therefore when the outline of the char-
acter changes to a new eighth compass point, a new
proto-feature will be started. In the case of the letter
"o", there are 8 proto-features on the outside of the
outline and 8 proto-features on the inside of the outline.
In forming the proto-features for the letter "I" the ar-
rows traveise in the same manner and continue as long
as a straight line approximation is appropriate. In this
manner, proto feature 206 is created along the top out-
line of the character, and when the character outline
changes direction downward, proto feature 208 is cre-
ated. As the outline swings inward, proto-feature 210 is
created, and as the outline descends vertically, a very
long proto-feature 212 is created. Proto-features con-
tinue to be created in this manner until the entire outline
of the character is traversed.
FIG. 3 shows an example of how proto-features are
defined. Referring now to FIG. 3, a proto-feature, as
represented by arrow 302, contains a midpoint 303
which is represented by is X, and Y, coordinates 304
and 306. The angle 6,, 303 of the proto-feature 302 is

recorded relative to the direction east. That is, east is 0
degrees, with degrees being counted counterclockwise
until a full circle is complete. The degrees of the angle
are converted to a value between O and I where () repre-
sents 0 degrees and l represents 360 degrees. A length
310 of the proto-feature 302 is also recorded for the
proto~feature. Thus a proto-feature is defined by the
X-Y coordinates of its center, its angle, and its length.
Additional parameters are derived for each proto-fea-
ture to improve computational speed when comparing
features to proto-features. These parameters are defmed
as A, B, C, Xmin, Xmax, Ymin, and Ymax.
A, B, and C are the normalized parameters for the
general form of a line, i.e. Ax+By+C==O. A, B, and C
are computed as follows:
SLOPE = un(@,)
INTERCEPT = Y, ~ SLOPE * X
NORMALIZE! = x I SQRT(SLOl*E**2 + t)
A = SLOPE * NOIIMALIZER
B = o - NORMALXZER
C = XNTERCEFT * NORMALIZE}!
where SQRT nun uk: the square rom of the equation in parenthesis.
t-: ====== =gw= in vane or snort;.
* menu mulxiplieuion.
I means divtion.
t nuns addition.
- means mbtnetion, and
we means uk: the ungesu.
Xmin, Xmax, Ymin, and Yxnax d cribe a padded
bounding box containing the proto-feature. The bound-
ing box is used to quickly eliminate featurelproto-fea-
ture pairs from further consideration when they are not
a good match. This bounding box is computed as fol-
lows:

OPAD is a constant value of orthogonal pad, which
is 2.5 times the feature  at gth in the preferred embodi-
ment.
TPAD and OPAD are used to provide a small amount
of additional space around a proto-feature which allows
close features to still be considered.
FIG. 4 shows an example set of features that would
be extracted from the letters "o" and "I", as those char-
acters are being analyzed on an unknown page of text.
As a character is being analyzed, the fi tures extracted
are much smaller than the proto-features that are cre-
ated for a character template. Referring now to FIG. 4,
feature 402 would be created by starting at an arbitrary
point on the outline on the letter "o", and placing the
center of the first feature at this arbitrary point. One
method of picking an arbitrary point would be to select
the portion of the outline that is located at the largest Y

coordinate value. The angle of the feature is defined as
the angle of the outline of the dark area of the character
at the point. The feature extractor then moves along the
outline, keeping the dark area on the left, for a distance
of one feature length and places the center of the second
feature, feature 403, at this new point. Unlike proto-fea-
tures, all features from a character being analyzed are of
a fixed length. This length could be any length so long
as it is consistent, and in the preferred embodiment, the
feature length is one-tenth of the x-height (defined be-
low) of the current line of text. This process would
continue until the entire outline has been traversed to
create all the featur  around the outline. The process
would be performed for all outlines, including the inner
outline to create features 404, etc.
Similarly, an arbitrary point would he picked on the
"I" character and features would be created in the same
manner by traversing the outline.
FIG. S shows a diagram of a feature extracted from a
character being analyzed. When features are extracted
from a character being analyzed, the features are all of
the same, fixed, length. Therefore, no length parameter
is needed for a feature extracted from a character being
analyzed. Referring now to FIG. 5, a feature 502 has a
midpoint 504 which is defmed by its Xjcoordinate 506
and its Y;-coordinate 503. The angle of the feature 502,
9;-510, is specified in the same manner as the angle OF
308 with respect to proto-features.

FIG. 6 shows a block diagram of the hardware of the
present invention. Referring now to FIG. 6, a scanning
device 600 contains a processor 602 which communi-
cates to other elements of the scanning device 600 over
a system bus 604. Scanner electronics 606 scan a page of
textual information and produce a graphical bit image
of the contents of the page. Memory 610 contains the
OCR process software 612 of the present invention
which uses an operating system 614 to communicate to
the scanner electronics 606, and to communicate with a
host computer system over a host bus 616 through sys-
tem interface electronics 608. The OCR process soft-
ware 612 reads the pixels of the graphical bit image
from the scanner electronics 606, processes that image
according to the method of the pr ent invention, and
sends the result to a host system over the host bus 616.
The OCR process software could also nm in the host
system. `
FIG. 7 shows a flow diagram of the overall process of
the present invention. Referring now to FIG. 7, a page
image 702 is received from the scanner electronics 606
(FIG. 6). This page image is processed by an extract
character process 704 which identities each individual
character on a page and placn that character into a
character image data stream 706. The extraction of
characters is well known in the an. '1`he character im-
ages 706 are sent to an extract features process 700 of
the present invention. The extract features process 708
will be described in detail with respect to FIG. 8. The
extract features process 703 creates a list of character
fe tures 710 which is sent to a classify character process
712. The classify character process 712 will be de-
scribed below with respect to FIGS. 9 through 12. The
output of the classified character process 712 is a coded
characters data stream 714 which contains one or more
choices for each character being analyzed. This output
is sent to a word processor 716 where it is edited and
displayed by the user of the system. The output may be
sent through a host system bus to a word processor
within a host system.

FIG. 8 shows a flow chart of the extract features
process of FIG. 7. FIG. 8 is called after character im-
ages 706 (FIG. 7) have been created by the extract
characters process 704. Referring now to FIG. S, after
entry, block 802 determines whether all character im-
ages have been processed. If more character images
remain to be processed, block 802 transfers to block 804
which gets the next character image from the character
images data str  .. 706 (FIG. 7). Block $06 then nor-
malizes the character image.
When proto-features and features are created, as de-
fined above with respect to FIGS. 2 through 5 respec-
tively, the location of the feature is deined by an X and
Y coordinate. The ranges for X and Y depend upon the
coordinate system used in the matching process. Many
different coordinate systems are possible It is desirable
to choose a coordinate system in which characters have
been normalized to a constant size. This allows the
character classitication process to be insensitive to size
variation in the characters. There are two basic tech-
niques which can be used to normalize characters, ei-
ther of which will work with the system of the present
invention. The only requirement is that the same form
of normalization be used for both features and proto-
features.
One such normalization technique is line normaliza-
tion, where all characters within a line of text are scaled
by a single factor. Scaling is uniform in both the X and
Y directions, and the scale factor is chosen to force the
X-height of the line, that is, the height of a lower case x
character in the font, to be a constant. ln the preferred
embodiment, this constant is 0.5. The base line of the
text is also translated so that the position of the baseline
for all characters on a line is a constant. In the preferred
embodiment, this constant is zero.

A second technique is character normalization,
where each character is individually scaled to a fixed
size. This scaling can be anisotropic, that is, using differ-
ent scale factors in the X and Y directions. Some char-
acter normalization techniques scale the bounding box
of a character to a fixed size. Other techniques compute
the radius of gyration of the character shape about the
center of mass in both the X and Y directions and then
scale the character according to these numbers. In the
preferred embodiment, the character is scaled to a range
of 0 to 1.
After normalizing the character image, control trans-
fers to block 808 which determines whether all outlines
of the character image have been processed. Some char-
acters inherently have multiple outlines, such as the
character "i", which has an outline for the base part and
an outline for the "dot". In other situations, a character
may have multiple outlines due to distortion as de-
acnbed above with respect to FIG. 1. If an outline re-
mains to be processed, block 808 transfers to block 810
which gets the next outline from the image. Block 812
then determines an arbitrary point on the outline to start
feature extraction. As descn'bed above with respect to
FIG. 4, the features extracted from a character being
analyzed may be extracted starting at any arbitrary
point on the outline. After determining an arbitrary
point, block 812 transfers to block 814 which places the
center of the feature at the point just selected. Block 816
determines the angle of the feature by detennining the
tangent to the dark portion of the outline at the point
just selected. Block 818 then writes the feature statistics
just collected, that is, the X,Y location of the center of
the feature, and the angle of the feature, to the character

features data stream 710 (FIG. 7). Block 820 then tra-
verses along the character outline for one feature
length. That is, the outline is followed keeping the dark
portion of the outline to the left of the direction being
followed, until the fused length of a feature has been
traversed. After traversing one feature length, block
820 transfers to block 822 which determines whether
the end of the outline has been reached and if not, block
822 transfers back to block 814 to create a new feature
at the new location point on the outline. After the entire
outline has been traversed, block 822 transfers back to
block 808 to determine whether additional outlines exist
within the character. After all outlines within the char-
acter image have been processed, block 808 transfers
back to block 802 to determine if additional characters
remain in the character images data stream 706 (FIG.
7). After all character images have been processed,
FIG. 8 retums to its caller.
FIGS. 9 through 12 show flow charts of the classify
character process 712 (FIG. 7). The method involv 
matching each feature extracted from a character being
analyzed to each proto-feature within a template, to
determine if the two are "similar". Similarity is deter-
mined by the difference in the angles between the fea-
ture and the proto-feature, as well as the distance from
the feature to the proto-feature. The similarity number
is then normalized to the range of zero to one to create
a match evidence. A match evidence of zero means that
there is no evidence that the feature matches the proto-
feature, and a match evidence of one means that the

feature is a perfect match to the proto-feature. After the
evidence is determined, a match rating is computed by
comparing all the features to the proto-features within
the templates, and then by also comparing all the proto-
features of a template to the features within the charac-
ter being analyzed. Both cases must be analyzed in
order to make sure that the character being analyzed is
neither a subset nor a superset of the proto-features
within the template. After these two comparisons are
made, a match rating of the features of the character
being analyzed to the proto-features within this tem-
plate is computed. The character image is then com-
pared to the next template, until it has been compared to
all templates possible. After all these comparisons are
made, the match rating with the highest numbers are
sent as the coded characters data stream 714 (FIG. 7)
The details of this method are described below.
FIG. 9 is a flow chart of the top level processing
module of the classify character process. Referring now
to FIG. 9, after entry, block 902 determines whether all
characters have been processed and if not, transfers
control to block 904 which gets the character features
for the next character. Block 906 then sets a RATEL-
IST variable to "empty" and block 908 determines
whether all templates have been processed against this
character. If all templates have not been processed
against this character, block 908 transfers to block 910
which gets the proto-features from the next template.
Block 912 then calls FIG. 10 to match the features from
the character being analyzed to the proto-features of the
template. Block 914 then calls FIG. 12 to match the
proto-features from the template to the features from

the character being analyzed, and block 916 then deter-
min  the match rating for this character and template
combination. The match rating is determined by the
following formula:

RATING = EFA VG * LFTOTAL i EPA YG * LFTOTAL
LFTOTAL + LPTOTAL

The above formula computes the match rating as a
weighted average of the average feature evidence
(EFAVG) and the average proto evidence (EPAVG).
This will be a number between zero and one where one
is a perfect match and zero is no match. The feature
evidence is weighted by the total length of all features
(LFTOTAL) and the proto-features evidence is
weighted by the total length of all proto-features
(LPTOTAL).
After computing the match rating for this characterl-
template combination, block 918 puts this match rating
in the RATELIST and transfers back to block 908 to
process the next template. After the character image has
been processed against all templat , block 908 transfers
to block 920 which sorts the KATELIST in order of
descending match rating. Block 922 then extracts the
highest matches and all matches that are close to the
highest match. In this manner, the method can output
the best possible choices for the character. In the pre-
ferred embodiment, the character represented by the
highest match rating is output, and all characters that
are within 0.15 of the highest match rating are consid-
cred "close" and are also output.
After selecting the character with the highest match
rating, and any characters that are close to the highest
match rating, block 924 sends the coded characters for
the extracted matches, such as coded characters from
the ASCII character set, to the output data stream 714
(FIG. 7) before retuming to block 902 to process the
next character. After all characters have been pro-
cessed, block 902 returns to its caller to allow a higher
level of character level classification to proceed.

FIG. 10 shows a flow chart of the determine features
to proto average called from block 912 of FIG. 9. Re-
ferring now to FIG. 10, after entry, block 1002 sets a
variable TOTAL equal to zero and sets another vari-
able NUMFEAT, which represents the number of fea-
tures, equal to zero. Block 1004 then determines
whether all features of the character being analyzed
have been processed, and if not, block 1004 transfers to
block 1006 which increments the value of the variable
NUMFEAT. Block 1008 then gets the next feature and
block 1010 sets a value of a variable BEST equal to
zero. Block 1012 then determines whether all proto-fea-
tures of the template have been processed and if not,
transfers to block 1020 which gets the next proto-fea-
ture. Block 1022 then calls FIG. 11 to determine the
match evidence for the feature obtained in block 1008
compared to the proto-feature obtained in block 1020.
After determining the match evidence, block 1024 then
determines whether the evidence returned from FIG.
11 is greater than the best evidence determined so far. If
the evidence is greater than BEST, block 1024 transfers
to block 1026 which sets BEST equal to the new evi-
dence. If the evidence is less than or equal to BEST, or
after setting BEST equal to evidence, control transfers
back to block 1012 to check the next proto-feature
within the  -t== plate. After all proto-features within the
template have been processed, block 1012 transfers to
block 1014 which adds the value of the variable BEST
to the value of the variable TOTAL. Block 1014 then
transfers back to block 1004 to determine whether all
features within the character being analyzed have been
processed. After all features in the character have been

processed, block 1004 transfers to block 1016 which
computes a variable EFAVG to the value of the vari-
able TOTAL divided by the value of the variable
NUMFEAT. Block 1018 then computes the value of a
variable LI-'TOTAL equal to the value of the variable
NUMFEAT multiplied by the value of a variable
FEATLEN, which is the length of each feature. As
described above the length of a feature extracted from a
character being analyzed is always a fixed value, so the
value of LFIOTAL is simply the value of the number
of features multiplied by this tixed length. After com~
puting the values for EFAVG and LFTOTAL, control
retums to FIG. 9.
FIG. 11 shows a flow chart of the determine match
evidence process called from block 1022 of FIG. 10.
Referring now to FIG. 11, after entry, block 1102 deter-
mines whether the feature is within the bounding box of
the template. As described above with respect to FIGS.
2 and 3, a proto-feature within a template has a bound-
ing box defined for it. If a feature from a character being
analyzed is located outside this bounding box, as de-
{med by comparing the X and Y coordinates of the
feature midpoint to the Xmin, Xmax, Ymin, and Ymax
parameters defined above, the similarity of the feature
to the proto-feature will be so large as to be beyond
consideration. Therefore, if the feature is not within the
bounding box, block 1102 transfers to block 1112 which
sets the match evidence value to zero before returning
to FIG. 10.

If the feature is within the bounding box, block 1102
transfers to block 1104 which computes a variable AN-
GLEDIFF to the square of the angle of the feature
minus the angle of the proto-feature. The angle differ-
ence is computed is a circular fashion, that is, the differ-
ence between an angle of zero and an angle of 1 is zero
and ANGLEDIFF is never greater than 0.5 squared,
i.e. 0.25. After computing the angle difference, block
1104 transfers to block 1106 which computes the value
of a distance variable by squaring the value of the pa-
rameter A for the proto-feature multiplied by the X
location of the feature, plus the value of B for the proto-
feature multiplied by the Y value of the feature, plus the
value of the variable C for the proto-feature. The pa-
rameters A, B, and C were defined above with respect
to FIGS. 2 and 3. This distance is the distance between
the location of the center of the feature and the line of
the proto-feature. Block 1108 then computes a similarity
variable as the angle difference times a constant K, plus
the distance computed in block 1106. The constant K is
used to adjust the relative contribution of the angle
difference and the distance difference to the similarity
measure. In the present invention, the constant K is set
to a value of one. After computing the similarity, block
1108 transfers to block 1110 which comput  the match
evidence by dividing the similarity by a constant SM,
squaring this result, adding one to the square, and divid-
ing all of this into one. In this manner, the match evi-
dence will very from zero to one where zero means no
match, and one is a perfect match. The constant SM
defm  what values of similarity will map to an evi-
dence value of 0.5, that is, the midpoint. In the system of
the present invention, the constant SM is set to a value
of 0.0075. After computing the match evidence, FIG. 11
retums to FIG. 10.

FIG. 12 shows a flow chart of the determine proto to
features average process called from block 914 of FIG.
9. This process will match each proto-feature of the
template to each feature from the character being ana-

lyzed. Referring now to FIG. 12, after entry, block 1202
sets 3 variables equal to zero, the variabl  TOTAL,
LPTOTAL, and NUMMATCH. Block 1204 then de-
termines whether all proto-features have been pro-
cessed and if not, transfers control to block 1206 which
gets the next proto-feature from the template. Block
1208 then adds the length of this proto-feature to the
value of the variable LHTOTAL, and block 1210 sets
the value of a variable called MATCHLIST to
"empty". Block 1112 then determines whether all fea-
tures in the character have been processed and if not,
transfers control to block 1213 which gets the next
feature. Block 1214 calls FIG. 11 to determine the
match evidence between this feature and the proto-fea-
ture retrieved in block 1206. After retuming from FIG.
11, block 1216 appends the match evidence from the
comparison to MATCHLIST and then retums to block
1212 to process the next feature. After all features
within the character have been processed, block 1212
transfers to block 1220 which sorts MATCHLIST in
the order of decreasing evidence, thus, the features that
most closely match this proto-feature would sort to the
top of the list. Block 1222 then sets a variable
NUM..TO..KEEP to the value of the length of this
proto-feature divided by the feature length of the fea-
tures from the character being analyzed. As discussed
above, this feature length is a fumed number for all the
features from the character being analyzed. Thus, the

variable NUM_.TO.KEEP identili  how many fea-
tures would fit along the length of the proto-feature.
Block 1224 then extracts this number of elements from
the front of the match list and block 1226 adds the evi-
dence values of all these elements together. Block 1228
then adds the sum just created to the value of the vari-
able TOTAL. Block 1230 adds the value of the variable
NUM._TO_KEEP to the value of the variable NUM-
MATCH before returning to block 1204 to determine if
additional proto-features need to be processed for this
template. After all proto-features have been processed
for this template, block 1204 transfers to block 1218
which creates a value for the variable EPAVG by di-
viding the value of the variable TOTAL by the value of
the variable NUMMATCH before returning EPAVG
and LPTOTAL to FIG. 9.
After all characters have been analyzed, and a coded
character selected for each character image, the coded
characters are sent to a host system over the host system
bus 616 (FIG. 6) via the system interface 608 (FIG. 6).
The host system then stores the coded characters in a
tile where they can be processed using standard word
processing systems.

It is also possible to use the f#= .~ and match pro-
cess described above to detect the presence of high-
level features which are then used in a further matching
process to classify a character. Using this process, tem-
plates contain high-level features, such as bays, clo-
sures, etc., rather that an entire character shape. Once
the features have been extracted from an tmlmown
character they are matched to the template, as de-
scn^bed above, to identify the high-level features that
are contained in the unknown character. Decision trees
or other standard matching methods are then used to
classify the unknown character based on the high-level
features found.  
Having thus described a presently preferred embodi-
ment of the present invention, it will now be appreci-
ated that the objects of the invention have been fully
achieved, and it will be tmderstood by those skilled in

the art that many changes in construction and circuitry
and widely differing embodiments and applications of
the invention will suggest themselves without departing
from the spirit and scope of the present invention. The
disclosures and the description herein are intended to be
illustrative and are not in any sense limiting of the in-
vention, more preferably defined in scope by the fol-
lowing claims.
What is claimed is:
1. A method for optical character recognition com-
prising the steps of:
(a) converting a page having a plurality of text
printed thereon into a graphical image containing a
plurality of pixel elements representative of said
text;
(b) separating said graphical image into a plurality of
character imag ;
(c) scanning said character images to produce a set,
containing a plurality of features, for each charac-
ter image of said plurality of character images;
(d) convening each of said sets of features into at least .
one coded character equivalent to each of said
character images comprising the steps of
(dl) selecting one of said sets,
(d2) selecting one of a plurality of templates, one of
said templates having been previously defined
for each character to be converted, wherein each
of said templates contains a plurality of proto-
features,
(d3} comparing each of said features to each of said
proto-features to create a rating comprising the
steps of

(d3a) comparing each of said features from said
selected set to each of said proto-features from
said selected template to create an average
feature match evidence,
(d3b) comparing each of said proto-features from
said selected template to each of said features
from said selected set to create an average
proto match evidence,
(d3c) computing a rating by averaging said fea-
ture match evidence and said proto match
evidence;
(d4) adding said rating to a rating list, _
(d5) repeating steps (d2) through (d4) for each of
said templates,
(d6) selecting at least one highest rating within said
rating list as said coded character equivalent to
said character image, and
(d7) repeating steps (d2) through (d6) for each of
said character images; and
(e) sending said coded characters to a word processor
for editing and display.

2. The process of claim 1 wherein step (d3a) further
comprises the steps of:
(d3al) selecting one of said features;
(d3a2) selecting one of said proto-features;
(d3a3) computing an angle difference value between
an angle of said selected feature and an angle of
said selected proto-feature;
(d3a4) computing a distance difference value between
a center of said feature and said proto-feature;
(d3a5) computing a similarity as the sum of said angle
difference value and said distance difference value;
(d3a6) computing an evidence value by normalizing
said similarity to a predetermined range of values;

(d3a7) repeating steps (d3a2) through (d3a6) for each
of said proto-features and selecting a highest of said
evidence values;
(d3a8) adding said selected highest evidence value to
a total value;
(d3a9) repeating steps (d3al) through (d3a8) for each
of said features; and
(d3alO) dividing said total value by a number of fea-
tures to create said average feature match evi-
dence.
3. The process of claim 2 wherein step (d3a2) further
comprises the steps of:
(d3a2a) comparing a location of said selected feature
to locations of each of said proto-featur ; and
(d3a.2b) setting said evidence value to zero and con-
tinuing with step (d3a7} if said feature is located
outside a predefmed area surrounding said proto-
feature.

4. The process of claim I wherein step (d3b) further
comprises the steps of:
(d3bl) selecting one of said proto-features;
(d3b2) selecting one of said features;
(d3b3) computing an angle difference value between
an angle of said selected feature and an angle of
said selected proto-feature;
(d3b4) computing a distance difference value be-
tween a center of said feature and said proto-fea-
ture;
(d3b5) computing a similarity as the sum of said angle
difference value and said distance difference value;
(d3b6) computing an evidence value by normalizing
said similarity to ma predetermined range of values;
(d3b7) repeating steps (d3b2) through (d3b6) for each
of said features and appending each of said evi-
dence values to a match list.
(d3b8) sorting said match list;
(d3b9) determining a largest number of integral fea-
tures that will lit within a length of said proto-fea-
ture;
(d3bl0) selecting a number of evidence values from a
first of said sorted match list equal to said largest
number of integral features and adding said number
to a number matched value;
(d3b1l) adding a sum of said selected number of evi-
dence values to a total value;
(d3bl2) repeating steps (d3bl) through (d3b1l) for
each of said proto-features; and
(d3bl3) dividing said total value by said number
matched value to create said average proto match
evidence.

5. The process of claim 4 wherein step (d3b2) further
comprises the steps of:
(d3b2a) comparing a location of said selected he ture
to locations of each of said proto-features; and
(d3b2b) setting said evidence value to zero and con-
tinuing with step (d3a7) if said feature is located
outside a predefined area surrounding said proto-
feature.
6. The process of claim 1 wherein step (c) `further
comprises the steps of:
(cl) separating said character images into a plurality
of outlines each defined by a boundary between
pixels of different intensity within said character
images;
(c2) locating an arbitrary point on one of said out-
lines;

(c3) defining one of said plurality of features at said
point and adding said feature to said set for said
character image;
(c4) traversing said outline to a new point at a prede-
termined distance from said point;
(c5) repeating steps (c3) and (c4) until said outline is
completely traversed; and
(c6) repeating steps (c2) through (c5) for % ch of said
outlines within said character image to complete
said set; and
(c7) repeating steps (cl) through (c6) for each of said
character images.

7. A method for optical character recognition com-
prising the steps of:
(a) convening a page having plurality of text printed
thereon into a graphical image containing a plural-
ity of pixel elements representative of said text;
(b) separating said graphical image into a plurality of
character images;
(c) scanning said character images to produce a set,
containing a plurality of features, for each charac-
ter image of said plurality of character images com-
prising the steps of
(cl) separating said character images into a plural-
ity of outlines each defined by a boundary be-
tween pixels of different intensity within said
character images,
(c2) locating an arbitrary point on one of said out-
lines,
(c3) defining one of said plurality of features at said
point and adding said feature to said set for said
character image,
(c4) traversing said outline to a new point at a
predetermined distance from said point,
(c5) repeating steps (c3) and (c4) until said outline
is completely traversed, and
(c6) repeating steps (c2) through (c5) for each of
said outlines within said character image to com-
plete said set;
(c7) repeating steps (c2) through (c6) for each of
said character images;
(d) converting each of said sets of features into at least
one coded character equivalent to each of said
character images; and
(e) sending said coded characters to a word processor
for editing and display. `

selected template to create an average fag ture
match evidence,
(d3b) comparing each of said proto-features from said
selected template to each of said features from said
selected set to create an average proto match evi-
dence,
(d3c) computing a rating by averaging said feature
match evidence and said proto match evidence.
10. The process of claim 9 wherein step {d3a) further
comprises the steps of:
(d3al) selecting one of said features;
(d3a2) selecting one of said proto-features;
(d3a3) computing an angle difference value between
an angle of said selected feature and an angle of
said selected proto-feature;
(d3a4) computing a distance difference value between
a center of said feature and said proto-feature;
(d3a5) computing a similarity as the sum of said angle
difference value and said distance difference value;
(d3a6) computing an evidence value by normalizing
said similarity to a predetermined range of values;
(d3a7) repeating steps (d3a2) through (d3a6) for each
of said proto-features and selecting a highest of said
evidence values
(d3a8) adding said selected high t evidence value to
a total value;
(d3a9) repeating steps (d3al) through (d3a8) for ~a eh
of said features; and
(d3alO) dividing said total value by a number of fea-
tures to create said average feature match evi-
dence.

11. The process of claim 10 wherein step (d3a2) fur-
ther comprises the steps of:
(d3a2a) comparing a location of said selected feature
to locations of each of said proto-features; and
(d3a2b) setting said evidence value to zero and con-
tinuing with step (d3a7) if said feature is located
outside a predefined area surrounding said proto-
feature.

l2. The process of claim 9 wherein step (d3b) further
comprises the steps of:
(d3bl) selecting one of said proto-features;
(d3b2) selecting one of said features;
(d3b3) computing an angle difference value between
an angle of said selected feature and an angle of
said selected proto-feature;
(d3b4) computing a distance difference value be-
tween a center of said feature and said proto-fea-
tllre;
(d3b5) computing a similarity as the sum of said angle
difference value and said distance difference value;
(d3b6) computing an evidence value by normalizing
said similarity to a predetermined range of values;
(d3b7) repeating steps (d3b2) through (d3b6) for each
of said features and appending .,= ch of said evi-
dence values to a match list;
(d3b8) sorting said match list;
(d3b9) determining a larg t number of integral fea-
tures that will tit within a length of said proto-fea-
tuft;
(d3blO) selecting a number of evidence values from a
first of said sorted match list equal to said largest
number of integral features and adding said number
to a number matched value;
(d3bl 1) adding a sum of said selected number of evi-
dencc values to a total value;
(d3bl2) repeating steps (d3bl) through (d3bll) for
each of said proto-features; and

(d3b13) dividing said total value by said number
matched value to create said average proto match
evidence.
13. The process of claim 12 wherein step (d3b2) fur-
ther comprises the steps of;

(d3bZa) comparing a location of said selected feature
to locations of each of said proto-features; and
(d3b2b) setting said evidence value to zero and con-
tinuing with step (d3a7) if said feature is located
outside a predefined area surrounding said proto-
feature.