Text transformation with JavaCV and Tesseract

Reading Time: 2 minutes

What is Tesseract?

HP developed an open-source OCR engine called Tesseract which recognises more than 100 languageswith the support of ideographic and right-to-left languages. Also, we can train Tesseract to recognise other languages. It contains two OCR engines for image processing – a LSTM (Long Short Term Memory) OCR engine and a legacy OCR engine that works by recognizing character patterns. The LSTM engine uses neural networks to analyse images at multiple resolutions to extract features for improved text recognition performance while the legacy engine makes use of character signatures and string matching algorithms. It also includes four modes for text extraction: Text Extraction from Images, Text Recognition in images using OCR, Text Extraction from PDF files, and Document Conversion via OCR from PDF files or Microsoft Word documents. Tesseract also supports text transformation for over 100 languages.

What is OCR?

Optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text for processing by a computerised system (also called OCR). Optical Character Recognition devices(OCRDs) such as Google Lens are special purpose devices which help us in real time text transformation. The conversion process uses AI bases OCR engines which are capable of recognising various text in different languages. Specialised computer systems are also designed to recognise and decode scanned image data from documents and other objects. OCR has lot of implementation in real life and has find it’s way on mobile devices, with the help of JavaCV and tesseract we will learn how to achieve it easily.

Setup-

Install Tesseract on MacOS -> brew install tesseract-lang

Install Tesseract on Linux -> sudo apt install tesseract-ocr -y

Dependency requires -> "net.sourceforge.tess4j" % "tess4j" % "4.0.0"

Tesseract Implementation-

Through various ways text transformation can be done. One such way is down bellow.

We just need directory address of the image on which we want to perform text transformation. With the help of address we will convert the byte image into buffed image. DoOCR than can be as used on buffed image and the resultant of the function will be of format string.

For more information on tesseract you can visit -> http://tess4j.sourceforge.net/

And, for working template on tesseract with above code example click here -> https://techhub.knoldus.com/dashboard/search-result/javacv/6196c21e0b8f2406715834e7

Written by 

Mohd Alimuddin is a Software Consultant at Knoldus. He has knowledge of languages like Scala, Python, C#, HTML, CSS, and MySQL. His hobbies include watching anime, movies, having excellent food, and traveling.