This article describes software I’m not really familiar with. Take this with a pinch of salt. For all I know, tomorrow I may realize the error of my ways and change my tune.
I recently found out that there’s this open-source OCR software called Tesseract, and decided to give it a try. I’m going to show you how you can set up something really quickly, and some initial results I’ve seen.
First, install Tesseract via NuGet:
Second, to use Tesseract’s OCR facility, you need some language data, which Tesseract provides. Go to the tessdata project and download it. Technically, you only need the files starting with eng* if you’re going to OCR English text. If you download the whole repo, be patient – it’s a few hundred megabytes zipped. Make sure you put the files in a folder called tessdata, or it won’t work.
Third, get yourself some test images you can feed to the OCR. You can find some online, or scan something from a book.
Fourth, you’ll need to add a reference to System.Drawing, because the Tesseract package depends on the Bitmap class:
Finally, we can get some code in. Let’s use this (needs using Tesseract;
):
static void Main(string[] args) { Console.Title = "Trying Tesseract"; const string tessDataDir = @"tessdata"; const string imageDir = @"image.png"; using (var engine = new TesseractEngine(tessDataDir, "eng", EngineMode.Default)) using (var image = Pix.LoadFromFile(imageDir)) using (var page = engine.Process(image)) { string text = page.GetText(); Console.WriteLine(text); Console.ReadLine(); } }
This is enough to set up Tesseract, load a file from disk, and OCR it (convert it from image to text). It may take a few seconds for the processing to happen. Now, you may be wondering what a Pix
class is, or what is a page
. And I’m afraid I can’t quite answer that, because there doesn’t seem to be any documentation available, so that doesn’t exactly help.
So, when trying this out, I first scanned a page from The Pragmatic Programmer and fed it to Tesseract. I can’t reproduce that for copyright reasons, but aside from some occasional incorrect character, the results were actually pretty good.
The next thing I did was feed it the Robertson image from this page. It looked okay at first glance, until I actually bothered to check the result:
Good heavens. What on Earth is a “sriyialeeeurreneeseenu”? Shocked by these results, I read some tips about improving the quality of the output. Because it’s true, you can’t blame the OCR for mistaking a ‘c’ for an ‘e’ when they look very similar, and the image has some noise artifacts (see top of image, where there’s some faint print from another page).
To make sure I give it some nice, crisp text, I took a screenshot of the Emgu CV homepage (shown below), and fed it to the program.
See the results for yourself:
That’s quite an elaborate mess. It may be because I’m new to this software, but that doesn’t give me a very good impression. Maybe it’s my fault. But I can’t know that if there’s no documentation explaining how to use it.