December 18, 2015 - Gigi Labs

This article describes software I’m not really familiar with. Take this with a pinch of salt. For all I know, tomorrow I may realize the error of my ways and change my tune.

I recently found out that there’s this open-source OCR software called Tesseract, and decided to give it a try. I’m going to show you how you can set up something really quickly, and some initial results I’ve seen.

First, install Tesseract via NuGet:

Second, to use Tesseract’s OCR facility, you need some language data, which Tesseract provides. Go to the tessdata project and download it. Technically, you only need the files starting with eng* if you’re going to OCR English text. If you download the whole repo, be patient – it’s a few hundred megabytes zipped. Make sure you put the files in a folder called tessdata, or it won’t work.

Third, get yourself some test images you can feed to the OCR. You can find some online, or scan something from a book.

Fourth, you’ll need to add a reference to System.Drawing, because the Tesseract package depends on the Bitmap class:

Finally, we can get some code in. Let’s use this (needs using Tesseract;):

        static void Main(string[] args)
        {
            Console.Title = "Trying Tesseract";

            const string tessDataDir = @"tessdata";
            const string imageDir = @"image.png";

            using (var engine = new TesseractEngine(tessDataDir, "eng", EngineMode.Default))
            using (var image = Pix.LoadFromFile(imageDir))
            using (var page = engine.Process(image))
            {
                string text = page.GetText();
                Console.WriteLine(text);
                Console.ReadLine();
            }
        }

This is enough to set up Tesseract, load a file from disk, and OCR it (convert it from image to text). It may take a few seconds for the processing to happen. Now, you may be wondering what a Pix class is, or what is a page. And I’m afraid I can’t quite answer that, because there doesn’t seem to be any documentation available, so that doesn’t exactly help.

So, when trying this out, I first scanned a page from The Pragmatic Programmer and fed it to Tesseract. I can’t reproduce that for copyright reasons, but aside from some occasional incorrect character, the results were actually pretty good.

The next thing I did was feed it the Robertson image from this page. It looked okay at first glance, until I actually bothered to check the result:

Good heavens. What on Earth is a “sriyialeeeurreneeseenu”? Shocked by these results, I read some tips about improving the quality of the output. Because it’s true, you can’t blame the OCR for mistaking a ‘c’ for an ‘e’ when they look very similar, and the image has some noise artifacts (see top of image, where there’s some faint print from another page).

To make sure I give it some nice, crisp text, I took a screenshot of the Emgu CV homepage (shown below), and fed it to the program.

See the results for yourself:

That’s quite an elaborate mess. It may be because I’m new to this software, but that doesn’t give me a very good impression. Maybe it’s my fault. But I can’t know that if there’s no documentation explaining how to use it.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Gigi Labs

Daily Archives: December 18, 2015

Messing Around with Tesseract OCR in .NET

"You don't learn to walk by following rules. You learn by doing, and by falling over." — Richard Branson