Text Segmentation

8 min readMar 29, 2019

Document scanner until word segmentation

Hey, welcome back! This project was about text segmentation. More specifically, the process of handwritten text segmentation using digital image process.

At first, the repository was created to be a start point to another project. So, the idea is to start the segmentation process to make something after with results. For code lovers:

https://github.com/arthurflor23/text-segmentation

The initial goal was to study and implement a handwritten text line segmentation. Simple… And here I come, with more complication, study, and implementation. I must say that at the end of the study, I learned that I should not use this implementation in a real-world problem (I’ll get there).

So, to make a line segmentation, for after do a word segmentation, first, you must think the binarization technique and most importantly: how your data will be.

Why did I say that? Because you can get simple data images from IAM dataset, or get some like Bentham dataset, for example. So it depends on what you are working on. I like working with more difficult scenarios, because if the project achieves good results, consequently it will bring great results in simpler scenarios.

I already notice that this project was in the simplest scenarios. This brought me a bit of frustration, but of course, a lot more knowledge.

Document Scanner

The first step, document scanner, isn’t necessary if you get ready datasets, but think that images come from a smartphone camera or simple scan. Anyway, is important normalize the input images to get only the important area to segmentation.

This is an automatic process. In reality, the perfect scenario to this would be to create an interface with the user, like the CamScanner app does, for example. In this step, we detecting the predominant contour in image and segment using a four-point perspective transformation. Here I leave two very good links:

Fig. 2: Document scanner process from How to Build a Kick-Ass Mobile Document Scanner in Just 5 Minutes

Binarization

Exist a lot of techniques of binarization. Among those studied and implemented that I found:

A Threshold Selection Method from Gray-Level Histograms (Otsu);
An Introduction to Digital Image Processing (Niblack);
Adaptive Document Binarization (Sauvola);
Text Localization, Enhancement, and Binarization in Multimedia Documents (Wolf);
Binarization of Historical Document Images Using the Local Maximum and Minimum (Su).

What is the best? Well, that depends. As the study aims to get the best from different inputs, I’ve put together some slightly different images to perform the tests (about 10), as Fig. 3 shows.

The point is, somehow, I found the paper “Efficient illumination compensation techniques for text images” about Illumination Compensation. This paper presents a new approach to light normalization of the image (Fig. 4).

Fig. 4: An illumination compensation test with an image a lot of shade

With this technique applied, I get the best results from images (Fig. 5), but of course, new processing in python too. To be faster, I changed all preprocessing to c++, and the response was immediate.

Fig. 5: New binarization test with illumination compensation before

In the end, I chose the Sauvola method with illumination compensation at all.

Line Segmentation

Finally, we arrive at the start point of this project! haha

To have a good orientation of what techniques are present in the middle of text segmentation, I recommend the paper “Text-Based Image Segmentation Methodology”. In it, I was able to find the levels of text segmentation (line, word, character) and methodologies (pixel counting approach, histogram approach, Y histogram projection, text line separation, false line exclusion, line region recovery, smearing approach, stochastic approach, water flow approach), very useful to start.

I’ve implemented 2 line segmentation methods to see which would be the most appropriate:

A Statistical approach to line segmentation in handwritten documents;
A* Path Planning for Line Segmentation of Handwritten Documents.

Well, both are really good (Fig. 6), but the second is really slow for this project, because depending on the image, the algorithm takes 2 minutes or more, to result in a very similar segmentation.

Fig. 6: “A Statistical approach to line segmentation in handwritten documents” (a), and “*A* Path Planning for Line Segmentation of Handwritten Documents*” implementation (b). Both with Saint Gall dataset image.

To improve the lines normalization, also was implement a deslanting process from “A New Normalization Technique For Cursive Handwritten Words”. This process removes the cursive writing style and it is used as a preprocessing step for handwritten text recognition (Fig. 7).

Fig. 7: Sample of deslanting image from https://github.com/githubharald/DeslantImg

Word Segmentation

In this step, I can highlight the papers:

Scale-Space Technique for Word Segmentation in Handwritten Documents;
Word Segmentation Method for Handwritten Documents based on Structured Learning;
Keyword spotting in historical handwritten documents based on graph matching;
Toward a Dataset Agnostic Word Segmentation Method.

For implementation, I really found it interesting to approach with deep learning, like “Toward a Dataset Agnostic Word Segmentation Method”. They presented a process with three models to do word segmentation (Fig. 8).

Fig. 8: A sample document from the ICDAR dataset (a), heatmap generated by the heatmap network (b), and a smooth heatmap generated by the smoother network (c).

To be honest, at this point, I just wanted a simple method, small and fast to implement. So “Scale-Space Technique for Word Segmentation in Handwritten Documents” gave me this, and of course, the method brought good results to simple images (Fig. 9).

Fig. 9: Sample word segmentation from lines

Another approach to the segmentation step that I found, this time with compressed images:

Extraction of Line-Word-Character Segments Directly from Run-Length Compressed Printed Text-Documents;
Text line Segmentation in Compressed Representation of Handwritten Document using Tunneling Algorithm;
Word Segmentation Directly in Run-Length Compressed Handwritten Document Images using Connected Component Analysis.

Results

It was 10 images to test, but all the code/results you can find in https://github.com/arthurflor23/text-segmentation. For now, here some final results:

Conclusions

It was a great experience for me, and with this project, I will change somethings about my computer vision project (it will be the next post, I think).

Then, this process of the segmentation is good to simple images of the manuscripts, because the final code is static. If anything comes up different from the input, the line/word segmentation will not go very well. And we don’t have to go very far with “anything different”, such as a book edges, for example.

Anyway, this factor directly impacts the results of my dissertation research, so I will have to find some way out in the coming months. ~maybe the computer vision project helps, maybe not~

References

Arivszhagan, M., Srinivasan, H., Srihari, S. (2007). A Statistical approach to line segmentation in handwritten documents. The International Society for Optical Engineering;
Axler, G., Wolf, L. (2018). Toward a Dataset Agnostic Word Segmentation Method. 2018 25th IEEE International Conference on Image Processing;
Chen, K.-N., Chen, C.-H., Chang, C.-C. (2012). Efficient illumination compensation techniques for text images. Digital Signal Processing, v. 22, p. 726–733;
Javed, M., Nagabhushan, P., Chaudhuri, B. (2013). Extraction of Line-Word-Character Segments Directly from Run-Length Compressed Printed Text-Documents. 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics;
Lazzara, G., Géraud, T. (2014). Efficient Multiscale Sauvola’s Binarization. 2014 International Journal on Document Analysis and Recognition manuscript;
Manmatha, R., Srimal, N. (1999). Scale-Space Technique for Word Segmentation in Handwritten Documents. SCALE-SPACE ’99 Proceedings of the Second International Conference on Scale-Space Theories in Computer Vision, p. 22–33;
Mehul, G., Ankita, P., Namrata, D., Rahul, G., Sheth, S. (2014). Text-Based Image Segmentation Methodology. 2nd International Conference on Innovations in Automation and Mechatronics Engineering;
Niblack, W. (1986). An Introduction to Digital Image Processing. Strandberg Publishing Company Birkeroed, Denmark;
Ntirogiannis, K., Gatos, B., Pratikakis, I. (2014). A Combined Approach for the Binarization of Handwritten Document Images. Pattern Recognition Letters 35;
Otsu, N. (1979). A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man and Cybernetics 9, 62–66;
Rosebrock, A. (2014). 4 Point OpenCV getPerspective Transform Example. Retrieved February 27, 2019, from https://www.pyimagesearch.com/2014/08/25/4-point-opencv-getperspective-transform-example;
Rosebrock, A. (2014). How to Build a Kick-Ass Mobile Document Scanner in Just 5 Minutes. Retrieved February 27, 2019, from https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes;
Ryu, J., Koo, H. Il, Cho, N. Ik. (2015). Word Segmentation Method for Handwritten Documents based on Structured Learning. IEEE Signal Processing Letters, v. 22;
Sauvola, J., Seppanen, T., Haapakoski, S., Pietikainen, M. (1997). Adaptive Document Binarization. IEEE Computer Society Washington;
Stauffer, M., Fischer, A., Riesen, K. (2018). Keyword Spotting in Historical Handwritten Documents Based on Graph Matching. Pattern Recognition 81, p. 240–253;
Su, B., Lu, S., Tan, C. L. (2010). Binarization of Historical Document Images Using the Local Maximum and Minimum. Document Analysis Systems;
Surinta, O., Holtkamp, M., Karabaa, F., Oosten, J.-P. van, Schomaker, L., Wiering, M. (2014). A* Path Planning for Line Segmentation of Handwritten Documents. 14th International Conference on Frontiers in Handwriting Recognition;
Vedere, A. R., Nagabhushan, P. (2018). Text line Segmentation in Compressed Representation of Handwritten Document using Tunneling Algorithm. International Journal of Intelligent Systems and Applications in Engineering, p. 251–261;
Vedere, A. R., Nagabhushan, P., Javed, M. (2018). Word Segmentation Directly in Run-Length Compressed Handwritten Document Images using Connected Component Analysis. IET Image Processing;
Vinciarelli, A., Luettin, J. (2001). A New Normalization Technique for Cursive Handwritten Words. Pattern Recognition Letters 22;
Wolf, C., Jolion, J.-M., Chassaing, F. (2002). Text Localization, Enhancement, and Binarization in Multimedia Documents. International Conference on Pattern Recognition (ICPR), v. 4, p. 1037–1040.