A Simple Guide to OCR using Pytesseract

Reading Time: 2 minutes

What is OCR

OCR is an acronym for optical character recognition. It is a widespread technology to recognize text inside images, such as scanned documents and photos. OCR technology is used to convert virtually any kind of image containing written text (typed, handwritten, or printed) into machine-readable text data. 

OCR using Pytesseract

Python-tesseract is a wrapper for Google’s Tesseract OCR engine. It can read any image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others, making it usable as a standalone tesseract invocation script. Python-tesseract will print the recognized text rather of writing it to a file if used as a script.

Installation

For installation we only require to install a few python modules that are listed below


OpenCV- pip install opencv-python

Pytesseract - pip install pytesseract

Pillow -pip install pillow

Procedure

There are only three easy stages required in this process. First, we’ll load an image saved on the computer or downloaded via a browser, and then we’ll load that. (Any image including text) The image will then be preprocessed to ensure that it is clean before being converted to grayscale, noise removal and binarization. The main objective of the Preprocessing phase is to make as easy as possible for the OCR system to distinguish a character/word from the background. Finally, we’ll run the image through an OCR machine to generate a string format. Let’s have a look at how to build a simple program for optical character recognition in Python.


from PIL import Image
import PIL
import pytesseract
import cv2

#main function   
def ocr_main(img):
    text = pytesseract.image_to_string(img)
    return text



#reading the image
 from local directory

img = cv2.imread('sample.jpg') 

#PREPROCESSING THE IMAGE

#Grayscaling
def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)


# Noise removal
def remove_noise(image):
    return cv2.medianBlur(image,3)
 
#Thresholding
def thresholding(image):
    return cv2.threshold(image, 100, 230, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

#Calling Preprocessing functions according to user needs 

img= get_grayscale(img)
img= thresholding(img)
img= remove_noise(img)

#Using OpenCV  to Preview Preprocessed image

#cv2.imshow('img', img)
#cv2.waitKey(0)
#cv2.destroyAllWindows()

#Calling the main function to display result
print(ocr_main(img))

Results

After OCR Sample image given below has been converted to string successfully.

Conclusion

OCR is a very remarkable technology that holds a lot of potential. In this day and age, such tools are already quite advanced. However, Optical Character Recognition is going to look even better in the future.