Thanks for feedback. I uninstalled PDF reader completly and things worked much better. We are getting quite nice results from the OCR of image only PDF files, its cool to see a page that was scanned upside down, being correctly OCRed
and included in index.
We modified the procedure and are using ghostscript to convert the PDFs to multipage TIFFs and then using the Microsoft Imaging to OCR the TIFF. This worked better for us as we wanted to OCR the complete documents, plus we found that Microsoft
Imaging gave better results on a number of our test documents.
Our current ocr.bat looks like:
Rem convert all to upper for correct filename comparison
FOR %%i IN ( "d=D" "f=F" "g=G" "h=H" "i=I" "j=J" "k=K" "l=L" "m=M" "n=N" "o=O" "p=P" "q=Q" "r=R" "s=S" "t=T" "u=U" "v=V" "w=W" "x=X" "y=Y" "z=Z") DO CALL SET "ext=%%ext:%%~i%%"
if "%ext%"==".PDF" goto :ConvertPdf
if "%ext%"==".TIF" goto :Converttiff
if "%ext%"==".TIFF" goto :Converttiff
rem MS OCR modifies the TIFF to store the new text within it, so copy local to make sure we don’t inadvertently modify an original doc.
copy /y %inp% %tmpFile%.ocrtmp.tiff
"C:\Program Files\Common Files\Microsoft Shared\MODI\12.0\mspview" -o %tmpFile%.ocrtmp.tiff
C:\ocrpdfs\filtdump -b %tmpFile%.ocrtmp.tiff > %inp%.txt
"%exes%pdftotext.exe" %inp% %inp%.txt
for %%i in (%inp%.txt) do set fs=000000000%%~zi
if %fs% GTR 000000100 goto :EOF
"C:\Program Files\gs\gs8.64\bin\gswin32c.exe" -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -r300 -sOutputFile=%inp%.tif %inp%
"C:\Program Files\Common Files\Microsoft Shared\MODI\12.0\mspview" -o %inp%.tif
C:\ocrpdfs\filtdump -b %inp%.tif > %inp%.txt
"%imgPath%convert.exe" -density 150 %inp% "%tmpFile%-00.bmp"
"%exes%tesseract.exe" "%tmpFile%-00.bmp" %inp% -l eng
Areas for improvement:
- Not great having to use filtdump as this means configuring so that filtdump uses the standard TIFF text extractor and search server uses a different one to call the ocr.bat but it works well.
- Batch file needs a bit of tidyup
- Would be nice to somehow get stats on conversion, eg conversion confidence so that we know how well it is OCRing
- look at more efficient ways to perform this process
- Check to see if Search server copies files to a temp location to process or does it index in place, if the former then we dont need to copy the Tiffs locally. I did notice when I tested the orc.bat directly it updated the target file.
Microsoft are so close to providing this capability, it would be nice if they would package it together.
Thanks for all your work done to date