[lug] OT: Scanning and OCRing

Wed Feb 14 19:34:41 MST 2007

Hey blug,

For a research project I need to scan about 4 masters theses, 2 books,
and maybe a dozen journal articles (that I can't get digital copies of
-- some came from microfiche/microfilm and are pretty grungy). I don't
know how long these theses are yet, but all together I imagine it will
be on the order of 1000 pages.

According to Ben Bunnell when Larry Page and his colleges did a proof
of concept for Google Book Search, they were able to scan and OCR an
entire book in ~40 minutes -- though he didn't give any details as to
how [1]. I'm sure they're much faster at it now, but if I can get
anywhere close to this I'll be happy.

I don't know what software and hardware they used (I probably couldn't
afford it anyway :), but I'm wondering if there's anyone here who has
ever digitized on this kind of scale who can point me in the right
direction?  I all ready had a look at all the FOSS OCR software out
there, ocrad [2] and tesseract [3] both seem good at reading the
characters, but only have plan text output. I'd like the software,
ideally, to export it straight to ps/pdf or something similar
retaining the original fonts/formatting -- but I'm having a hard time
telling the difference between the commercial products. I would also
be interested in any hardware suggestions people might have provided
it's not too expensive.

All of the text is in English and amount of figures is text is small.

- Craig

[1] http://video.google.com/videoplay?docid=-8762514765927564293
[2] http://www.gnu.org/software/ocrad/ocrad.html
[3] http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html