Wednesday 26 September 2012

Book Scanning

I have an old book which I am mining data from... The problem is paper is a pain; there is no Ctrl+F function and unless the index includes the terms you are after (which, in this case, it doesn't) then searching turns into a real pain.

The solution? Scan it.
The problem? How to scan it.

Unlike printed, typed or even many handwritten documents it's not easy to pull apart a book and scan the pages with an automatic machine, especially when the book is old, out of print and quite valuable. Most book scanners (including Google's) use cameras instead. This is my setup:

A very high-tech setup.

It's all very simple; a camera, a tripod to hold the camera still, remote shutter button to snap the pictures, lots of lamps for even illumination and a data connection to the computer so I didn't fill up the memory card too fast.


It was a pretty chunky book (801 pages) and it took a total of 489 shots (including reshoots of slightly out-of-focus pages) to capture all of it. That took nearly 1.5 hours, or about 10 seconds per photo. So what does a whole book look like?



With some magic semi-automated processing these images are all that is needed for a perfect scan. Using ImageJ I converted them to black and white, subtracted the background and cropped/rotated the pages. These are some samples:



These processed images can simply be fed into Adobe Acrobat or other similar optical character recognition (OCR) software to translate the image of the text into machine-understandable, fully-searchable text. Exactly what I need!

Software used:
ImageJ: Automated image processing