Messing Around with Tesseract OCR in .NET

This article describes software I’m not really familiar with. Take this with a pinch of salt. For all I know, tomorrow I may realize the error of my ways and change my tune.

I recently found out that there’s this open-source OCR software called Tesseract, and decided to give it a try. I’m going to show you how you can set up something really quickly, and some initial results I’ve seen.

First, install Tesseract via NuGet:

tesseract-nuget

Second, to use Tesseract’s OCR facility, you need some language data, which Tesseract provides. Go to the tessdata project and download it. Technically, you only need the files starting with eng* if you’re going to OCR English text. If you download the whole repo, be patient – it’s a few hundred megabytes zipped. Make sure you put the files in a folder called tessdata, or it won’t work.

Third, get yourself some test images you can feed to the OCR. You can find some online, or scan something from a book.

Fourth, you’ll need to add a reference to System.Drawing, because the Tesseract package depends on the Bitmap class:

tesseract-system.drawing

Finally, we can get some code in. Let’s use this (needs using Tesseract;):

        static void Main(string[] args)
        {
            Console.Title = "Trying Tesseract";

            const string tessDataDir = @"tessdata";
            const string imageDir = @"image.png";

            using (var engine = new TesseractEngine(tessDataDir, "eng", EngineMode.Default))
            using (var image = Pix.LoadFromFile(imageDir))
            using (var page = engine.Process(image))
            {
                string text = page.GetText();
                Console.WriteLine(text);
                Console.ReadLine();
            }
        }

This is enough to set up Tesseract, load a file from disk, and OCR it (convert it from image to text). It may take a few seconds for the processing to happen. Now, you may be wondering what a Pix class is, or what is a page. And I’m afraid I can’t quite answer that, because there doesn’t seem to be any documentation available, so that doesn’t exactly help.

So, when trying this out, I first scanned a page from The Pragmatic Programmer and fed it to Tesseract. I can’t reproduce that for copyright reasons, but aside from some occasional incorrect character, the results were actually pretty good.

The next thing I did was feed it the Robertson image from this page. It looked okay at first glance, until I actually bothered to check the result:

tesseract-robertson

Good heavens. What on Earth is a “sriyialeeeurreneeseenu”? Shocked by these results, I read some tips about improving the quality of the output. Because it’s true, you can’t blame the OCR for mistaking a ‘c’ for an ‘e’ when they look very similar, and the image has some noise artifacts (see top of image, where there’s some faint print from another page).

To make sure I give it some nice, crisp text, I took a screenshot of the Emgu CV homepage (shown below), and fed it to the program.

tesseract-emgucv-source

See the results for yourself:

tesseract-emgucv

That’s quite an elaborate mess. It may be because I’m new to this software, but that doesn’t give me a very good impression. Maybe it’s my fault. But I can’t know that if there’s no documentation explaining how to use it.

4 thoughts on “Messing Around with Tesseract OCR in .NET”

  1. Thanks for the post.
    I found exactly the same issue when I attempted to use Tesseract for .NET!

    Tesseract appears to be a very old library, written in C / C++ and the .NET versions are usually just wrappers around this native code using some form of interop. This wouldn’t be such a problem except as soon as you see the bad results and look at “training” tesseract, you end up on sites like these: https://code.google.com/p/tesseract-ocr/wiki/Documentation
    wasting hours and hours reading highly technical documentation – most of it not helpful. You start to realise you are going to need to become an expert in OCR and image processing to get consistent results with this thing.

    I realised a couple of days in to using Tesseract, that ultimately, if you start of with a rubbish image, only image processing can attempt to make it higher quality for OCR purposes and that right there is a whole field of study. I did some simple binarisation, but that seemingly just made things worse.. If you start with the perfect image, Tesseract may yield perfect results, but that’s highly unlikely.. It will most likely require “training” for your specific use case, and learning how to Train Tesseract given the present state of the documentation is not fun.. so much so, that I am almost ashamed to say, I gave up.

    1. I’ve done a lot with tesseract, and you’re right, you need to preprocess the image for good results. I’ve used ImageMagicak to do some processing involving a sigmoidal contrast and some other techniques before converting to one-bit. One thing I did a while ago was to build a simple image touch-up script that evolved using algorithms: you take a number of scanned in images of differing qualities, contrasts, and brightnesses, manually copy the text to a text file (I.e., type it up), then run them through a number of different preprocessing algorithms in ImageMagick and OCR the results. Compare the known good output to the OCR output and adjust preprocessing paramters using a GA. Continue until satisfactory results. This method, combined with some frequency-domain halftone removal, was good enough to accurately OCR text from documents that werent supposed to be copyable (where things like “void” or “copy” show up as water mark halftimes on copies).

Leave a Reply

Your email address will not be published. Required fields are marked *