A remarkable amount of data are hiding in historical records in hand written forms, electronic printouts, or typed tables. This post describes methods I use for three types of difficult documents consistently structured forms, inconsistently structured forms, and near machine readable tables.
Consistent Structured Forms
For hand written forms, there is a company called Caprticity which will take your scans and divvy them up to Mechanical Turkers for translation. Using their online interface, you indicate which parts of the document go to which fields, and they cut each image up into a coherent chunk to send out for translation. The advantage is they can handle complicated form features, like check boxes. The downsides is that each image needs approximately of the same layout, and the costs are not trivial.
Inconsistent Structured Forms
In other cases, data fields will be relatively uniform but the underlying forms might vary hugely between time periods or sources. In the Vietnam War, the United States created the Phoenix Program to selectively target rebel collaborators among the civilian population. I recovered tens of thousands pages of records from the National Archives at College Park detailing who was targeted by the program. The underlying data were mostly uniform, but the documents came in no less than 8 different variants. In this case I outsourced the conversion to a company overseas. The advantage to contracting a company is the relatively cheap price. The downside is finding a reputable contractor since most companies simply subcontract out to one another. I used to work with a couple of firms in the U.S. until discovering they were just subcontracting overseas and doubling the price.
Near Machine Readable Tables
- In other cases, the tables will be so close to machine readable that you can convert them with just a bit of python and an off the shelf OCR program. In this example, I’m working with the 1971 South Vietnam Gazetteer created by the the U.S. Army Topographic Command and scanned in by the Virtual Vietnam Archive at Texas Tech. For a number of reasons, neither Omnipage Ultimate nor Abbey Fine-reader will correctly extract the table structure consistently across all 300 pages. We’re going to have to help it along.
Step 1: Split the pdf into individual tif files. I used Acrobat Professional.
Step 3: Now we need to split rows for Omnipage. A very simple trick from machine vision is to identify white breaks between lines by summing pixel intensity row by row. The resulting signal is a saw tooth that you can use to segment each line of text.
Step 4: Set up a folder watching job in Omnipage with the spreadsheet option.