Project Description
IFilter plugin for the Microsoft Indexing Service (and Sharepoint in particular) to index and search image files (including TIFF, PDF, JPEG, BMP...) using OCR technology.

Additional information please visit article at CodeProject.

Introduction

This article describes how to setup indexing of the image files (including TIFF, PDF, JPEG, BMP...) using OCR technology. The indexing described below utilizes Microsoft IFilter technology and as such is not specific to SharePoint, but can be used with any product that uses Microsoft indexing: Microsoft Search, Desktop search, SQL Server search, and through the plug-ins with Google desktop search. I however use it with Microsoft Windows SharePoint Services 2003. For those other products, the registration may need to be slightly different.

Background

One of the projects I was working on required a storage of old documents scanned into PDF files. Then there was a separate team of people responsible for providing a tags for a search engine so those image documents could be found. The whole process was clumsy, labor intensive, and error prone. That was what started me on my exploration path.

OCR

The first search I fired was for the Open Source OCR products. Pretty quickly, I narrowed it down to TESSERACT. Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then it was moved to the Open Source, and now if I understand it correctly, Google is working on it. With credentials like that, it's no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files.

Image files conversion

So now that I have an OCR that can convert BMP files into text, how do I get text out of the image PDF files? One more search, and I settled down on ImageMagic. This is another wonderful Open Source utility that can convert any file into image. It did work out of the box, converting any TIFF files into bitmaps, but to get PDF files converted, it requires a GhostScript.

Dealing with text PDFs

With that utility installed, I was cooking - I can convert any file (in particular PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently - after all, OCR is very computation intensive and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have a PDFTOTEXT - thank God for Open Source! With these guys, I can pull text out of PDF in an eye blink. However, I would get nothing for pure image PDFs, but I already have a solution for that!

Batch process

It took another 15 minutes to setup a batch script to automate the process:
  • Check the file extension
  • If file is a PDF file
    • try to extract text out of it
    • if there is more than certain amount of text in the file - done!
    • if there is no text, convert first page into bitmap
    • run OCR on the bitmap
  • For any other file type, convert file into bitmap
  • Run OCR on the bitmap

Once you unzip the attached project, check out the bin\OCR.BAT file. It will create a temporary file in the directory where your source file is with the same name + the '.txt' extension.

For example:
ocr.bat c:\temp\xyz.pdf

will generate the c:\temp\xyz.pdf.txt file.

IFilter interface

So now I have a simple batch process to extract text out of any image and/or PDF files. To make it usable in SharePoint (or any other product that uses Microsoft Indexing technology), I need to create an IFilter component. This is a plug-in that Microsoft Indexing uses to search for specialized file formats (e.g., Office, PDF, ...).

Over here, there was a right way and a quick way. And I have to admit my guilt here - I chose the quick way. See, the thing is that all the components I use do have C/C++ APIs, and to do it right, I should pull everything together and create a component. Instead, I decided to just run the batch process I setup earlier. This is somewhat slower, but at least I don't have to worry about memory leaks and page faults from the code I'm not familiar with.

So I downloaded Microsoft Platform SDK, got SmpFilt to work, changed GUIDs, got it to run my OCR.BAT - and here you have it - my own OCR plug-in to Microsoft Indexing.

Over here, I'm skipping over some pain and sweat of debugging IFilter, dealing with multi-byte to single byte strings and back, and all this fun that made Microsoft COM development so "loved" around the world. But the purpose of the article is not to teach how to do COM in C++ or how to develop IFilter.

Once you have your filter done and registered, the Platform SDK contains two utilities to test IFilter: filtdump.exe and filtreg.exe - you can play with them to make sure your filter is registered and works correctly.

!SharePoint Registration
The Microsoft IFilter template will do the appropriate registration for the Indexing Service, but SharePoint requires additional entries. In the download, there is a bin\wss_reg.reg file that will register SharePoint related entries. I would encourage you, however, to create a back up of the HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0 key before you try to register the wss_reg file - just in case, you know.

By the way, since I don't have an installer, the DLL (OcrFilt.dll) also needs to be registered manually.
regsvr32 OcrFilt.dll

Installation Summary

  • Download and install GhostScript
  • Download and install VC++ 2008 redistributable
  • ImageMagic convert utility is included in the download - but you can get the latest version of that (update OCR.BAT path if location is different)
  • TESSERACT is included in the download - but you can get the latest version of that (update OCR.BAT path if location is different)
  • PDFTOTEXT is included in the download - but you can get the latest version of that (update OCR.BAT path if location is different)
  • Create a backup of the HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0 Registry entry
  • Register bin\OcrFilt.dll and bin\wss_reg.reg
  • Recycle Search service and run Stsadm -o spsearch -action fullcrawlstart to force SharePoint search database rebuild

Other Notes

  • I use WSS SP2. If your version of SharePoint is different, your WSS registration entries may be different. Please check if you have the HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\e5ecafdd-0ed4-42fa-b663-c38046ae5ec8 key. If not, then your wss-reg.reg file may need to be updated.
  • SharePoint stores a numbered list of extensions it is able to process in HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\e5ecafdd-0ed4-42fa-b663-c38046ae5ec8\Gather\Search\Extensions\ExtensionList. The entry for PDF should not be there, and it needs to be registered with the next unique number. Most of the time, it should be 38. Please make sure that your numbering goes up to 37. If there is anything fishy, please review and update wss_reg as needed.
  • After you install the filter, you would need to re-index the existing contents or remove/add files over. Until indexing is complete, you will not be able to find your entries. OcrFilt.dll will create entries in the Event Log for each file it needs to index, so you can follow the progress. I had to recycle the service and then use stsadm. Removing and adding files back also works.
  • For performance reasons, only the first page of the PDF/TIFF file is OCR-ed. There are additional ImageMagic utilities to combine multiple images together before OCR-ing if you want to OCR the whole document.
  • OCR.BAT will try to create a text file in the same folder your input image is in. As such, the indexing process should have appropriate privileges to that folder. Since this is where WSS creates the temp file, it should not be a problem; however, since rights issues are so difficult to troubleshoot, it's something to keep in mind.

Other applications

Even though currently I'm using it only with SharePoint there are other very interesting application for this solution:
  • Configure iFilter as a plugin for SQL-Server, to allow indexing PDF files stored in the BLOB columns.
  • Structured documents. The ImageMagic convert utility that I use has an ability to extract part of the image. It will be pretty easy to change the batch file to extract for example portion of the scanned bill that contains name and date to organize filing of the billing department.

TO DO

Having resources following things would be nice to do:
  • Provide configuration instructions for the SQL-Server indexing
  • Implement an installer
  • Recompile Tesseract to include native handling for TIFF files.
  • Pull all the libraries together and create a single process instead of starting new processes for image conversion and OCR.

Component Licensing

Even though all the components are Open Source, you might want to verify that your company legal department has no problem with each component's licensing requirements.

Donate
Your donations are kindly appreciated. If you wish your information listed in this site please specify it in PayPal memo field

Last edited Nov 25, 2009 at 2:39 PM by gstolarov, version 13