Works for FiltDump but not Search server

Feb 15, 2010 at 4:36 PM

Hi,

Nice approach, I have set it up & using it to OCR PDFs.  I have it working great from command line using filtdump, but I just cant seem to get Search Server 2008 to process the documents.  I have done all the registery settings, restarted server, full crawl etc. Search server continues to index PDF files so something is not pointing correctly.

I cant see any entrys in the Event logger & I put an echo %CmdCmdLine%  >> c:\ocrpdf.log in the top of the ocr.bat to see if it is firing.  When I use filtdump, this appends to the file but not when I perform a crawl.

Prior to trying your approach I had PDF reader installed, is it necessary to uninstall/unregister that dll?

Is there any way to debug or enable detailed logging on the search server or the Ifilter the search process,

Any tips would be appreaciated

 

Coordinator
Feb 19, 2010 at 6:01 AM

Please double check registry entries in the wss_reg.reg. Quite possible that your version of Sharepoint has different GUID, or index for the PDF/TIFF plugins doesn't match what in the registry (because you had some other components installed). I also had quite a hard time setting up search to work alltogether. If you think wss.reg file is OK, please make sure that search works by posting simple text file in the document library and making sure you can find it. If there are problems, I would suggest to re-install WSS and make sure search works first, before trying anything else.

Feb 22, 2010 at 10:09 AM

Thanks for feedback.  I uninstalled PDF reader completly and things worked much better.  We are getting quite nice results from the OCR of image only PDF files, its cool to see a page that was scanned upside down, being correctly OCRed and included in index.

We modified the procedure and are using ghostscript to convert the PDFs to multipage TIFFs and then using the Microsoft Imaging to OCR the TIFF.  This worked better for us as we wanted to OCR the complete documents, plus we found that Microsoft Imaging gave better results on a number of our test documents. 

Our current ocr.bat looks like:

setlocal
set exes=%~dp0
set imgPath=%exes%
set ext=%~x1
Rem convert all to upper for correct filename comparison
FOR %%i IN ( "d=D"  "f=F" "g=G" "h=H" "i=I" "j=J" "k=K" "l=L" "m=M" "n=N" "o=O" "p=P" "q=Q" "r=R" "s=S" "t=T" "u=U" "v=V" "w=W" "x=X" "y=Y" "z=Z") DO CALL SET "ext=%%ext:%%~i%%"
set folder=%~dp1
set inp="%1"
set tmpFile=c:\temp\OCR%RANDOM%%random%
if "%ext%"==".PDF" goto :ConvertPdf
if "%ext%"==".TIF" goto :Converttiff
if "%ext%"==".TIFF" goto :Converttiff
goto :ConvertFiles
:converttiff
rem	MS OCR modifies the TIFF to store the new text within it, so copy local to make sure we don’t inadvertently modify an original doc.
	copy /y %inp% %tmpFile%.ocrtmp.tiff
	"C:\Program Files\Common Files\Microsoft Shared\MODI\12.0\mspview" -o %tmpFile%.ocrtmp.tiff
	C:\ocrpdfs\filtdump -b %tmpFile%.ocrtmp.tiff > %inp%.txt
	del %tmpFile%.ocrtmp.tiff
	goto :EOF

:ConvertPdf
	"%exes%pdftotext.exe" %inp% %inp%.txt
	for %%i in (%inp%.txt) do set fs=000000000%%~zi
	set fs=%fs:~-9%
	if %fs% GTR 000000100 goto :EOF
	"C:\Program Files\gs\gs8.64\bin\gswin32c.exe"  -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -r300 -sOutputFile=%inp%.tif %inp%
	"C:\Program Files\Common Files\Microsoft Shared\MODI\12.0\mspview" -o %inp%.tif
	C:\ocrpdfs\filtdump -b %inp%.tif > %inp%.txt
	del "%inp%.tif"
	goto :EOF
:ConvertFiles
	"%imgPath%convert.exe" -density 150 %inp% "%tmpFile%-00.bmp"
	"%exes%tesseract.exe" "%tmpFile%-00.bmp" %inp% -l eng
goto :EOF

Areas for improvement:

  • Not great having to use filtdump as this means configuring so that filtdump uses the standard TIFF text extractor and search server uses a different one to call the ocr.bat but it works well.
  • Batch file needs a bit of tidyup
  • Would be nice to somehow get stats on conversion, eg conversion confidence so that we know how well it is OCRing
  • look at more efficient ways to perform this process
  • Check to see if Search server copies files to a temp location to process or does it index in place, if the former then we dont need to copy the Tiffs locally.  I did notice when I tested the orc.bat directly it updated the target file.

 

Microsoft are so close to providing this capability, it would be nice if they would package it together.

 

Thanks for all your work done to date

John