Terms

General

document structure extraction
TBD
word spotting
“retrieve similar words in the image document through an image query”
fluctuation
see baseline fluctuation below.
superfluous information
non textual elements, textual elements from the verso (other side of the page)
verso
other side of the page
recto
frong side of the page (opposite of verso)
image binarization
TBA
taxonomy
classifying into groups (tasnif)
projection profiles
TBA
smearing
?
Hough-based
?
repulsive-attractive network
?
stochastic method
?

Lines and components

From paper1:1

baseline
fictitious line which follows and joins the lower part of the character bodies in a text line
median line
fictitious line which follows and joins the upper part of the character bodies in a text line
upper line
fictitious line which joins the top of ascenders.
lower line
fictitious line which joins the bottom of descenders
overlapping components
descenders and ascenders located in the region of an adjacent line
touching components
ascenders and descenders belonging to consecutive lines which are thus connected.

Author style

baseline fluctuation
the baseline may vary due to writer movement. It may be straight, straight by segments, or curved.
straight baseline
straight (duz, ama acili (rotated) olabilir)
straight baseline by segments
each segment is straight, but baseline is not continuous (kelime kelime duz olabilir baseline)
curved baseline
curved (yazarin eli cevrilmis)
line orientations
there may be different line orientations, especially on authorial works where there are corrections and annotations.
line spacing
space between lines. overall, the problem is less complex when spacing is large.
insertions
words or short text lines may appear between the principal text lines, or in the margins.

Image quality

imperfect preprocessing
smudges, variable background intensity and the presence of seeping ink from the other side of the document make image preprocessing particularly difficult and produce binarization errors.
stroke fragmentation and merging
punctuation, dots and broken strokes due to low-quality images and/or binarization may produce many connected components; conversely, words, characters and strokes may be split into several connected components. The broken components are no longer linked to the median baseline of the writing and become ambiguous and hard to segment into the correct text line.

Main problems

line fluctuation
baseline has a different angle; or it is not linear (segmented); or it is curved
line
Components too close to each other
writing fragmentation
components are split into several components (mesela silik yazi)

Text line representation

separating paths
fictitious lines separating text lines. can be uniformly straight, made of straight segments, or of curving joined strokes.
delimited strip
delimited strip between two separating lines receives the same text line label. So the text line can be represented by a strip with its couple of separating lines.
clusters
general set-based way of defining text lines. A label is associated with each cluster. Units within the same cluster belong to the same text line. They may be pixels, connected components, or blocks enclosing pieces of writing.
strings
strings are lists of spatially aligned and ordered units. Each string represents one text line.
baselines
baselines follow line fluctuations but partially define a text line. Units connected to a baseline are assumed to belong to it.

Notes

From paper1:1

There are two categories of text line segmentation approaches: searching for (fictitious) separating lines or paths, or searching for aligned physical units. The choice of a segmentation technique depends on the complexity of the text line structure of the document.

After the text part has been extracted and restored, top-down and smearing techniques are generally applied for text line segmentation

Preprocessing

Non-textual elements around the text such as book bindings, book sides, parts of fingers (thumb marks from someone holding the book open f.i.) should be removed upon criteria such as position and intensity level.

On the document itself, holes, stains, may be removed by high-pass filtering : From ext-paper1:2

Other non-textual elements (stamps, seals) but also ornamentation, decorated initials, can be removed using knowledge about the shape, the color or the position of these elements [3].

Extracting text from figures (text segmentation) can also be performed on texture grounds [4] [5] or by morphological filters [6] [7].

Linear graphical elements such as big crosses (called “St Andre’s crosses”) appear in some of Flaubert’s manuscripts. Removing these elements is performed through GUI by Kalman filtering in [8].

Textual but unwanted elements such as the writing on the verso (bleed through text) may be removed by filtering and wavelet techniques [9][10][11] and by combining the verso image (the reverse side image) with the recto one (front side image).

Binarization, if necessary, can be performed by global or local thresholding. Global thresholding algorithms are not generally applicable to historical documents, due to inhomogeneous background. Thus, global thresholding results in severe deterioration in the quality of the segmented document image. Several local thresholding techniques have already been proposed to partially overcome such difficulties [12]. These local methods determine the threshold values based on the local properties of an image, e.g. pixel-by-pixel or region-by-region, and yield relatively better binarization results when compared with global thresholding methods.

(binarization - cont’d) Writing may be faint so that over-segmentation or under-segmentation may occur. The integral ratio technique [13] is a two-stage segmentation technique adapted to this problem. Background normalization [14] can be performed before binarization in order to find a global threshold more easily.

Segmentation of text lines from clean text

After preprocessing, we have a clean text. Here are some methods of different approaches to do the segmentation:

Projection–based methods

basically: horizontal histogram for pixels

commonly used for printed document segmentation. This technique can also be adapted to handwritten documents with little overlap.

The vertical profile is not sensitive to writing fragmentation. Variants for obtaining a profile curve may consist in projecting black/white transitions such as in Marti and Bunke [15] or the number of connected components, rather than pixels. The profile curve can be smoothed, e.g. by a Gaussian or median filter to eliminate local maxima [16]. The profile curve is then analysed to find its maxima and minima.

There are two drawbacks: short lines will provide low peaks, and very narrow lines, as well as those including many overlapping components will not produce significant peaks.

In case of skew or moderate fluctuations of the text lines, the image may be divided into vertical strips and profiles sought inside each strip [17]. These piecewise projections are thus a means of adapting to local fluctuations within a more global scheme.

!!!!!!!!!!!!!!!BOOKMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!BOOKMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!BOOKMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Smearing methods
Grouping methods
Methods based on the Hough transform
Repulsive-Attractive network method
Stochastic method

Processing of overlapping and touching components

Non Latin documents

Ancient Arabic documents

Summary of methods

Scratch pad

My process for text line segmentation

  • Find text part (get rid of the borders etc.) with preprocessing. HOW?
  • Apply some additional preprocessing such as high-pass filters. That should make the text more clear.
  • Some additional things like “St Andre’s crosses” are easy to get rid of. Use Kalman filtering and a GUI for removing them manually.
  • I don’t care about the verso image.
  • Binarize the image. This basically means black-or-white kind of processing. Make the text black and everything else white. ** Use local thresholding
  • What approach to use? Projection methods look the best(?)

Papers

  1. Text Line Segmentation of Historical Documents: a Survey

    • Laurence Likforman-Sulem, Abderrazak Zahour, Bruno Taconet
    • 2006

     2

  2. Generierung einer semantischen Reprasentation aus Abbildungen handschriftlicher Kirchenbuchaufzeichnungen

    • Markus Feldbach
    • 2000

  3. Les documents anciens, Document numérique, Hermès, Vol. 3, no 1-2, june, pp. 57-7

    • not open
    • Gusnard de Ventadert, J. André, H. Richy, L. Likforman-Sulem, E. Desjardin
    • 1999

  4. Text segmentation using Gabor filters for Automatic Document Processing, MVA, Vol5, pp. 169-184.

    • not open
    • Jain A., Bhattacharjee S.
    • 1992

  5. Colorizing paper texture of green-scale image of historical documents, Proceedings of the 4th IASTED Conference on Visualization, Imaging and Image Processing, VIIP, Marbella, Spain.

    • not open
    • Mello C. A. B., Cavalcanti C. S.V.C., C. Carvalho
    • 2004

  6. Extraction de textes et de figures dans les livres anciens à l’aide de la morphologie mathématique, Actes de CIFED’2000, Colloque International Francophone sur l’Ecrit et le Document, Lyon , pp. 81-90.

    • not open
    • Granado I., Mengucci M., Muge F.
    • 2000

  7. Morphological Segmentation of text and figures in Renaissance books (XVI Century), in Mathematical Morphology and its applications to image processing, Kluwer, pp. 397-404.

    • not open
    • Mengucci M., Granado I., J. Goutsias, L. Vincent, D. Bloomberg( eds)
    • 2000

  8. Extraction d’éléments graphiques dans les images de manuscrits, Colloque International Francophone sur l’Ecrit et le Document (CIFED’98), Québec, pp. 223-232.

    • not open, nowhere to buy
    • L. Likforman-Sulem
    • 1998

  9. Séparation recto/verso d’images de manuscrits anciens, Proc. of Colloque National sur l’Ecrit et le Document CNED’96, Nantes, pp. 199-206.

    • not open
    • Lamouche I., Bellissant C.
    • 1996

  10. Restoration of archival documents using a wavelet technique, IEEE PAMI, 24 (10), pp. 1399-1404.

    • Tan C. L., Cao R., Shen P.
    • 2002

  11. An Environment for Processing Images of Historical Documents, Microprocessing and Microprogramming, 40, pp. 939-942.

    • not open
    • Lins R.D., Guimaraes Neto M., França Neto L., Galdino Rosa L.
    • 1994,

  12. Document image binarization based on topographic analysis using a water flow model, Pattern Recognition 35:265-277.

  13. Integral ratio: a new class of global thresholding techniques for handwriting images, IEEE PAMI, 21 (8) : 761 - 768

    • Solihin Y., Leedham, C.G.
    • 1999

  14. Historical Document Image Enhancement using background light intensity normalization, ICPR 2004, Cambridge.

  15. On the influence of vocabulary size and language models in unconstrained handwritten text recognition, Proc. of ICDAR’01, Seattle, pp. 260-265.

    • Marti U., Bunke H.
    • 2001

  16. Scale space technique for word segmentation in handwritten manuscripts, Proc. 2nd Int. Conf. on Scale Space Theories in Computer Vision, pp. 22-33.

    • Manmatha R., Srimal N
    • 1999

  17. Arabic hand-written text-line extraction, Proceedings of the 6th ICDAR, Seattle, pp. 281– 285.

    • not open
    • Zahour, A., Taconet, B., Mercy, P., Ramdane, S.
    • 2001

comments powered by Disqus