Messing Around with Tesseract OCR in .NET

This article describes software I’m not really familiar with. Take this with a pinch of salt. For all I know, tomorrow I may realize the error of my ways and change my tune.

I recently found out that there’s this open-source OCR software called Tesseract, and decided to give it a try. I’m going to show you how you can set up something really quickly, and some initial results I’ve seen.

First, install Tesseract via NuGet:

tesseract-nuget

Second, to use Tesseract’s OCR facility, you need some language data, which Tesseract provides. Go to the tessdata project and download it. Technically, you only need the files starting with eng* if you’re going to OCR English text. If you download the whole repo, be patient – it’s a few hundred megabytes zipped. Make sure you put the files in a folder called tessdata, or it won’t work.

Third, get yourself some test images you can feed to the OCR. You can find some online, or scan something from a book.

Fourth, you’ll need to add a reference to System.Drawing, because the Tesseract package depends on the Bitmap class:

tesseract-system.drawing

Finally, we can get some code in. Let’s use this (needs using Tesseract;):

        static void Main(string[] args)
        {
            Console.Title = "Trying Tesseract";

            const string tessDataDir = @"tessdata";
            const string imageDir = @"image.png";

            using (var engine = new TesseractEngine(tessDataDir, "eng", EngineMode.Default))
            using (var image = Pix.LoadFromFile(imageDir))
            using (var page = engine.Process(image))
            {
                string text = page.GetText();
                Console.WriteLine(text);
                Console.ReadLine();
            }
        }

This is enough to set up Tesseract, load a file from disk, and OCR it (convert it from image to text). It may take a few seconds for the processing to happen. Now, you may be wondering what a Pix class is, or what is a page. And I’m afraid I can’t quite answer that, because there doesn’t seem to be any documentation available, so that doesn’t exactly help.

So, when trying this out, I first scanned a page from The Pragmatic Programmer and fed it to Tesseract. I can’t reproduce that for copyright reasons, but aside from some occasional incorrect character, the results were actually pretty good.

The next thing I did was feed it the Robertson image from this page. It looked okay at first glance, until I actually bothered to check the result:

tesseract-robertson

Good heavens. What on Earth is a “sriyialeeeurreneeseenu”? Shocked by these results, I read some tips about improving the quality of the output. Because it’s true, you can’t blame the OCR for mistaking a ‘c’ for an ‘e’ when they look very similar, and the image has some noise artifacts (see top of image, where there’s some faint print from another page).

To make sure I give it some nice, crisp text, I took a screenshot of the Emgu CV homepage (shown below), and fed it to the program.

tesseract-emgucv-source

See the results for yourself:

tesseract-emgucv

That’s quite an elaborate mess. It may be because I’m new to this software, but that doesn’t give me a very good impression. Maybe it’s my fault. But I can’t know that if there’s no documentation explaining how to use it.