Archival Research on an Industrial Scale Part 1


Political scientists and historians face at least four major problems in conducting archival research: time, resources, identifying the key information, and making that information available to others for replication purposes. Together these problems either put serious archival work out of the reach of graduate students/junior faculty or they encourage brief/shallow trips where the exercise becomes can I find a document that supports my claim. Over the next several posts I am going to discuss one of the technological solutions I have developed as well as some online resources which are often overlooked.

One technological fix is to digitize a large volume of documents during a brief research trip, OCR them, and then search for relevant terms back at home. There are many ways to accomplish this, but I have finally tweaked a system which allows for high volume (~10,000 pages a week), low human intervention (automated processing, OCR, filing), and ease of reading/distribution (a single pdf file per set of documents under a common name). The system ensures that I can find the information I am looking for, do it with a relatively short amount of time at the archive, and can cite/locate/share that document years later.

The rough outline of my workflow is as follows:

  1. Very high resolution camera mounted to an inverted tripod connected to a laptop
  2. Custom console application that remotely controls the camera and pulls pictures directly to the hard drive.
  3. Custom batch file which processes the incoming images
    1. Scan Tailor splits the images into specific pages, dewarps the pages, and crops them to just the text/folder.
    2. ImageMagick converts the Tiffs into compressed pdfs.
    3. pdftk binds those pdfs into a single large pdf and dumps it to an inbox.
  4. OmniPage Batch Manager monitors the inbox and automatically processes incoming files, rotates the pages so that the text is upright, applies another deskew/despeckle, performs the OCR, converts the file to a searchable pdf, and compresses it further before dumping it to an outbox.
  5. Zotero citation manager handles storage, note taking, and inserting citations into documents.

The workflow has a number of nice properties. The tripod and remote control make it so that when flipping through a folder, the user need only stop long enough to mash the keyboard and wait for the shutter to click. I often fall into a rhythm where I can capture every page in the time it takes to skim the document to decide whether its worth digitizing, and it can actually be faster to digitize some documents rather than evaluate them. For documents where I know I will want every page, I turn on a time lapse option in my software for a picture every second and I just turn the page. The tall tripod, high resolution camera, and content aware software (Scan Tailor) allows the user to capture any sized/orientated document as well as the location information inscribed on the folder title. Scan Tailor and OmniPage ensure that the documents are cropped and oriented in the pdf for easy reading. My batch file has a number of optimizations including using a ram drive and a mulithreaded scheduler so that computer’s full resources are utilized and processing time per image is low (~3 seconds).

The only downside currently is that Omnipage is paid software and high resolution cameras with remote capabilities are somewhat expensive. Still, when you consider the cost of additional weeks in the field and of not having a replicable research design it, it more than pays for itself.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>