Improve x_ht and look out for case inconsistencies

Improve xht for one word

Walk the blobs in the word together with the text string and reject map.

NOTE: All evaluation is done on the baseline normalised word. This is so that the BOX class can be used (integer). The reasons for this are:

CONVINCED?

A) Try to re-estimatate x-ht and caps ht from confirmed pts in word.

  FOR each non reject blob
    IF char is baseline posn ambiguous
      Remove ambiguity by comparing its posn with respect to baseline.
    IF char is a confirmed x-ht char
      Add x-ht posn to confirmed_x_ht pts for word
    IF char is a confirmed caps-ht char
      Add blob_ht to caps ht pts for word

    IF Std Dev of caps hts < 2 (AND # samples > 0)
      Use mean as caps ht estimate (Dont use median as we can expect a
      fair variation between the heights of the NON_AMBIG_CAPS_HT_CHS)
    IF Std Dev of caps hts >= 2 (AND # samples > 0)
      Suspect small caps font.
      Look for 2 clusters, each with Std Dev < 2.
    IF 2 clusters found
      Pick the smaller median as the caps ht estimate of the smallcaps.

    IF failed to estimate a caps ht
      Use the median caps ht if there is one,
    ELSE use the caps ht estimate of the previous word. NO!!!

    IF there are confirmed x-height chars
      Estimate confirmed x-height as the median value
    ELSE IF there is a confirmed caps ht
      Estimate confirmed x-height as a fraction of confirmed caps ht value
    ELSE
      Use the value for the previous word or the row value if this is the
      first word in the block. NO!!!

B) Add in case ambiguous blobs based on confirmed x-ht/caps ht, changing case as necessary. Reestimate caps ht and x-ht as in A, using the extended clusters.

C) If word contains rejects, and x-ht estimate significantly differs from original estimate, return TRUE so that the word can be rematched


Generated on Wed Feb 28 19:49:29 2007 for Tesseract by  doxygen 1.5.1