In the past years NLP (Natural Language Processing) has significantly developed. Many deep learning models are trained on billions of words and billions of parameters are optimized. These pre-trained Models could be adapted to various NLP tasks such as Text generation, question answering, NER and etc. However, before using such Models, one need to pre-process the data. If the task is to apply such Models on scanned documents, then one critical step of pre-processing is to apply OCR technics. There exists several OCR technologies and libraries. However, these solutions show poor performance when it comes to recognizing handwritten words and characters. In this article we compare famous Tesseract OCR vs few lines of Python code using OpenCV library to detect words in a German handwritten document.
This picture includes 12 words in German language. As we see on the picture characters are written joined together in a flowing manner. Moreover, shape of letters are quite unique and different from a formal typeface. Therefore, the goal is to detect and split words in a handwritten Post-it sentence in German. For this we will use Tesseract OCR tool in Python and compare it to OpenCV script in Python.
Tesseract OCR engine was originally developed at HP between 1985 and 1994. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google. You could refer to the GitHub link of the project for more information and implementation.
Tesseract allows us to apply language setting and also apply configurations. For this task we applied Page Segmentation Mode --psm 11 and Engine Mode --oem 3 to fine tune OCR results. --psm 11 sets Page Segmentation Mode to sparse mode. Here is the result of OCR on this German handwritten Post-it.
As we see Tesseract did a poor job in recognizing words. It didn't recognize 4 out of 12 words at all and out of those 8 it couldn't read a single word correctly. However, it is possible to train Tesseract to learn handwritten letters which is out of scope of this blog.
OpenCV (Open Source Computer Vision Library) is an open source computer vision software library. OpenCV with more than 2500 optimized algorithms was built to provide a infrastructures for CV applications and to accelerate the use of machine perception in the commercial products.
import operator import math import cv2 img = cv2.imread('IMG_3609.JPG') height, width, channels = img.shape img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 21) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) retval, dst = cv2.threshold(gray, 110, 255, cv2.THRESH_BINARY_INV)
fastNlMeansDenoisingColored() function performs image denoising for colored images using Non-local Means Denoising algorithm with several computational optimizations. Noise expected to be a gaussian white noise. We need to run this function to reduce noises on the image.
cvtColor() function converts an image from one color space to another. We need to execute this function to convert bgr to gray. Please note that imread() by default loads channels in the order of B G R.
threshold() function applies a fixed-level threshold to image array. In this case we have used THRESH_BINARY_INV which in case an array has value above our threshold (100) will be set to 0 otherwise will be set to maximum which we have set to 255.
After applying above steps, we have plot the image below on the right. On the left you could see the original image.
As we see after applying these functions we could highlight the text. However, the text appear quite thin. In order to enhance the resolution of handwritten, we apply cv2.dilate() function. The function dilates the source image using the specified structuring element that determines the shape of a pixel neighbourhood over which the maximum is taken. In order to dilate the image we need to apply a Kernel which is a structuring element of the specified size and shape for morphological operations.
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (4, 4)) dilated = cv2.dilate(dst, kernel, iterations = 5)
Once image is dilated, it will look like image below. As we see (on the right) the handwritten is better readable on the image compared to the non-dilated image (on the left).
Final step is to detect each word. For this purpose we use OpenCV function findContours() which detects contours on the image. Contours are simply curves joining all the continuous points (along the boundary), having same color or intensity. The contours are a useful tool for shape analysis and object detection and recognition. Which in our case, each word is a single contour as in handwriting the writer won't lift the pen before completing a word.
val, contours, hierarchy = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
Once contours are retrieved, we need to get the rectangular boundaries using boundingRect() function. After applying boundingRect() to all contours, we will remove contours which are too big or too small and append the coordinates to a list.
coordinates =  for contour in contours: [x, y, w, h] = cv2.boundingRect(contour) if h > 400 and w > 400: continue if h < 60 or w < 60: continue coordinates.append((x, y, w, h, math.ceil(x/width*10), math.ceil(y/height*10)))
Now we could plot these rectangles on the image using rectangle() function of OpenCV.
for coordinate in coordinates: [x, y, w, h, x_ratio, y_ratio] = coordinate cv2.rectangle(img, (x, y), (x+w, y+h), (255, 255 ,255), 2) plt.imshow(img) plt.show()
We have compared Tesseract and OpenCV in detecting words of a German handwriting. Tesseract is unable to detect handwritten text by default. Via OpenCV we have easily performed this task. However, there are few critical points to consider. On OpenCV approach we have used several parameters (Threshold, etc.). Therefore, generalization of this task via OpenCV approach might suffer. Another point to consider is the order of word segments. OpenCV provides hierarchy of the contours. Hierarchy means that for each contour the relationship between previous, next and parent and child contour is given (in case such relationship exists, otherwise is set to -1). However, it still requires an extra effort to put the words on the right order.
In our next blog we will show you how we could extract meaningful word from each segmented word's image. If you have any question do not hesitate to reach us by writing us at email@example.com.