Computer vision techniques are becoming widely available to empower many use cases. Data which one finds in the images can be extracted either manually or automatic. OCR provides the automatic extraction of the text from the images. It stands for Optical Character Recognition. It defines the process of mechanically or electronically converting scanned images of handwritten, typed or printed text into machine-encoded text. Coming back to the advantages of OCR, such machine-readable text can be incorporated into a web page or compress into a zip file. In this way, such printed text can be stored in a very less space and is easily shared across the web.

Let us discuss various phases for OCR

Phase 1

Pre-processing: It removes the unwanted distortion from the image. This in turns enhances the image features and better segmentation is achieved which is the next phase.

Pre-processing to enhance image quality 

Phase 2

Segmentation: For this, contour detection is used. Contour based approach works very fine if the characters are at some distance from each other but when they are very close to each other, approach fails. While bounding box is formed on each character,there will be many characters where more than contour will be formed like for character e. For such a condition is passed if inner contour has no parent, it is single contour and if it has parent, inner contour is ignored.

Phase 3

Characters classification: This task can be achieved using CNN (convolutional neural network). CNN will extract the features and classification is done using fully connected network of CNN. Since after segmentatiocomputer vision techniques are becoming widely available to empower many use cases.n phase, one will get the image of each characters and corresponding manually added label, dense layer of fully connected neural network will do the prediction. Softmax layer will be the last layer activation function since the problem type is Multi-class, Single label. Loss function used is categorical_crossentropy.

Instead of passing to this network, there is another handy approach which is called Convolutional Recurrent Neural Network (CRNN). Input Image is passed to the CNN receptive filters which forms the feature maps. These feature maps are also called activations maps. These feature sequences are input to deep bidirectional LSTM is also called Recurrent layer. It is passed to stacking of these layers and such hyper-parameters are being tweaked for better results. Output from this layer is passed to the transcription layer which provides per frame prediction. Spaces and redundant letters from this part is removed and actual word is output as a text. There are also some deep learning models such as YOLO, SSD, Faster R-CNN which can detect and recognise digits but these are better in localising and recognising the objects in an image. Have a look at the below image to understand the concept in more better way.

Convolutional Recurrent Neural Network 

There is another method which is called Multi-Task Network for Text Extraction which is based on MaskRCNN. It entails a convolutional backbone for extracting image features. To make such backbone, the output of the CNN is fed to Feature Pyramid Network.  It has three heads one is Bounding Box Regression head while another is Classifier head and third is Text Recognition head. First two heads provides the localisation. These localization along with the features obtained from the CNN is fed to the third head which recognize the characters.

Identifying the problem of Over fitting with CNN or CRNN

The model is not learning if val_loss (validation loss) goes up or does not decrease as the training goes on. To overcome such, regularisation (ex. dropouts), data augmentation, improvement on quality of the dataset is implemented.


In a nutshell, one can conclude that there are many methods to detect and extract text from image. One may use libraries such as Tesseract, Kraken to perform such act. On the other hand deep learning models such as CRNN (convolutional recurrent neural network) can also be used to detect the text from the image. Different phases which explain how OCR works is also explained in this article. I hope readers have understand the article well and got some useful information about extracting the text using OCR. It has the power and capability of creating vast amounts of textual data that can then be searched. At the end we have studied in this blog about recognising the characters using feature extraction. There is also another technology which  tries to recognize the entire character and match it to the matrix of characters stored in the software.