- document structure extraction
- word spotting
- “retrieve similar words in the image document through an image query”
- see baseline fluctuation below.
- superfluous information
- non textual elements, textual elements from the verso (other side of the page)
- other side of the page
- frong side of the page (opposite of verso)
- image binarization
- classifying into groups (tasnif)
- projection profiles
- repulsive-attractive network
- stochastic method
Lines and components
- fictitious line which follows and joins the lower part of the character bodies in a text line
- median line
- fictitious line which follows and joins the upper part of the character bodies in a text line
- upper line
- fictitious line which joins the top of ascenders.
- lower line
- fictitious line which joins the bottom of descenders
- overlapping components
- descenders and ascenders located in the region of an adjacent line
- touching components
- ascenders and descenders belonging to consecutive lines which are thus connected.
- baseline fluctuation
- the baseline may vary due to writer movement. It may be straight, straight by segments, or curved.
- straight baseline
- straight (duz, ama acili (rotated) olabilir)
- straight baseline by segments
- each segment is straight, but baseline is not continuous (kelime kelime duz olabilir baseline)
- curved baseline
- curved (yazarin eli cevrilmis)
- line orientations
- there may be different line orientations, especially on authorial works where there are corrections and annotations.
- line spacing
- space between lines. overall, the problem is less complex when spacing is large.
- words or short text lines may appear between the principal text lines, or in the margins.
- imperfect preprocessing
- smudges, variable background intensity and the presence of seeping ink from the other side of the document make image preprocessing particularly difficult and produce binarization errors.
- stroke fragmentation and merging
- punctuation, dots and broken strokes due to low-quality images and/or binarization may produce many connected components; conversely, words, characters and strokes may be split into several connected components. The broken components are no longer linked to the median baseline of the writing and become ambiguous and hard to segment into the correct text line.
- line fluctuation
- baseline has a different angle; or it is not linear (segmented); or it is curved
- Components too close to each other
- writing fragmentation
- components are split into several components (mesela silik yazi)
Text line representation
- separating paths
- fictitious lines separating text lines. can be uniformly straight, made of straight segments, or of curving joined strokes.
- delimited strip
- delimited strip between two separating lines receives the same text line label. So the text line can be represented by a strip with its couple of separating lines.
- general set-based way of defining text lines. A label is associated with each cluster. Units within the same cluster belong to the same text line. They may be pixels, connected components, or blocks enclosing pieces of writing.
- strings are lists of spatially aligned and ordered units. Each string represents one text line.
- baselines follow line fluctuations but partially define a text line. Units connected to a baseline are assumed to belong to it.
There are two categories of text line segmentation approaches: searching for (fictitious) separating lines or paths, or searching for aligned physical units. The choice of a segmentation technique depends on the complexity of the text line structure of the document.
After the text part has been extracted and restored, top-down and smearing techniques are generally applied for text line segmentation
Non-textual elements around the text such as book bindings, book sides, parts of fingers (thumb marks from someone holding the book open f.i.) should be removed upon criteria such as position and intensity level.
On the document itself, holes, stains, may be removed by high-pass filtering : From ext-paper1:2
Other non-textual elements (stamps, seals) but also ornamentation, decorated initials, can be removed using knowledge about the shape, the color or the position of these elements .
Linear graphical elements such as big crosses (called “St Andre’s crosses”) appear in some of Flaubert’s manuscripts. Removing these elements is performed through GUI by Kalman filtering in .
Textual but unwanted elements such as the writing on the verso (bleed through text) may be removed by filtering and wavelet techniques  and by combining the verso image (the reverse side image) with the recto one (front side image).
Binarization, if necessary, can be performed by global or local thresholding. Global thresholding algorithms are not generally applicable to historical documents, due to inhomogeneous background. Thus, global thresholding results in severe deterioration in the quality of the segmented document image. Several local thresholding techniques have already been proposed to partially overcome such difficulties . These local methods determine the threshold values based on the local properties of an image, e.g. pixel-by-pixel or region-by-region, and yield relatively better binarization results when compared with global thresholding methods.
(binarization - cont’d) Writing may be faint so that over-segmentation or under-segmentation may occur. The integral ratio technique  is a two-stage segmentation technique adapted to this problem. Background normalization  can be performed before binarization in order to find a global threshold more easily.
Segmentation of text lines from clean text
After preprocessing, we have a clean text. Here are some methods of different approaches to do the segmentation:
basically: horizontal histogram for pixels
commonly used for printed document segmentation. This technique can also be adapted to handwritten documents with little overlap.
The vertical profile is not sensitive to writing fragmentation. Variants for obtaining a profile curve may consist in projecting black/white transitions such as in Marti and Bunke  or the number of connected components, rather than pixels. The profile curve can be smoothed, e.g. by a Gaussian or median filter to eliminate local maxima . The profile curve is then analysed to find its maxima and minima.
There are two drawbacks: short lines will provide low peaks, and very narrow lines, as well as those including many overlapping components will not produce significant peaks.
In case of skew or moderate fluctuations of the text lines, the image may be divided into vertical strips and profiles sought inside each strip . These piecewise projections are thus a means of adapting to local fluctuations within a more global scheme.
!!!!!!!!!!!!!!!BOOKMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!BOOKMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!BOOKMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Methods based on the Hough transform
Repulsive-Attractive network method
Processing of overlapping and touching components
Non Latin documents
Ancient Arabic documents
Summary of methods
My process for text line segmentation
- Find text part (get rid of the borders etc.) with preprocessing. HOW?
- Apply some additional preprocessing such as high-pass filters. That should make the text more clear.
- Some additional things like “St Andre’s crosses” are easy to get rid of. Use Kalman filtering and a GUI for removing them manually.
- I don’t care about the verso image.
- Binarize the image. This basically means black-or-white kind of processing. Make the text black and everything else white. ** Use local thresholding
- What approach to use? Projection methods look the best(?)
- Laurence Likforman-Sulem, Abderrazak Zahour, Bruno Taconet
- Markus Feldbach
Les documents anciens, Document numérique, Hermès, Vol. 3, no 1-2, june, pp. 57-7
- not open
- Gusnard de Ventadert, J. André, H. Richy, L. Likforman-Sulem, E. Desjardin
Text segmentation using Gabor filters for Automatic Document Processing, MVA, Vol5, pp. 169-184.
- not open
- Jain A., Bhattacharjee S.
Colorizing paper texture of green-scale image of historical documents, Proceedings of the 4th IASTED Conference on Visualization, Imaging and Image Processing, VIIP, Marbella, Spain.
- not open
- Mello C. A. B., Cavalcanti C. S.V.C., C. Carvalho
Extraction de textes et de figures dans les livres anciens à l’aide de la morphologie mathématique, Actes de CIFED’2000, Colloque International Francophone sur l’Ecrit et le Document, Lyon , pp. 81-90.
- not open
- Granado I., Mengucci M., Muge F.
Morphological Segmentation of text and figures in Renaissance books (XVI Century), in Mathematical Morphology and its applications to image processing, Kluwer, pp. 397-404.
- not open
- Mengucci M., Granado I., J. Goutsias, L. Vincent, D. Bloomberg( eds)
Extraction d’éléments graphiques dans les images de manuscrits, Colloque International Francophone sur l’Ecrit et le Document (CIFED’98), Québec, pp. 223-232.
- not open, nowhere to buy
- L. Likforman-Sulem
Séparation recto/verso d’images de manuscrits anciens, Proc. of Colloque National sur l’Ecrit et le Document CNED’96, Nantes, pp. 199-206.
- not open
- Lamouche I., Bellissant C.
Restoration of archival documents using a wavelet technique, IEEE PAMI, 24 (10), pp. 1399-1404.
- Tan C. L., Cao R., Shen P.
An Environment for Processing Images of Historical Documents, Microprocessing and Microprogramming, 40, pp. 939-942.
- not open
- Lins R.D., Guimaraes Neto M., França Neto L., Galdino Rosa L.
Document image binarization based on topographic analysis using a water flow model, Pattern Recognition 35:265-277.
Integral ratio: a new class of global thresholding techniques for handwriting images, IEEE PAMI, 21 (8) : 761 - 768
- Solihin Y., Leedham, C.G.
Historical Document Image Enhancement using background light intensity normalization, ICPR 2004, Cambridge.
- not open
- Shi Z., V. Govindaraju
On the influence of vocabulary size and language models in unconstrained handwritten text recognition, Proc. of ICDAR’01, Seattle, pp. 260-265.
- Marti U., Bunke H.
Scale space technique for word segmentation in handwritten manuscripts, Proc. 2nd Int. Conf. on Scale Space Theories in Computer Vision, pp. 22-33.
- Manmatha R., Srimal N
Arabic hand-written text-line extraction, Proceedings of the 6th ICDAR, Seattle, pp. 281– 285.
- not open
- Zahour, A., Taconet, B., Mercy, P., Ramdane, S.