[This is a very rough draft which I'll refine over time...]

On this page, I’ll discuss some pros and cons of different methods for digitizing archival documents. At the end, I’ll tell you how I’ve gone about it myself.

The process is best thought of as a black box in which the input is an archival document and the output is a digital copy of the document. There are a few steps in between, as this diagram sketches out.

Flowchart for Digitizing Archival Documents
Step 1: From Document to JPEG

Hardware: There is only one piece of required hardware for this step: either a digital camera or a scanner. If you use the latter, then a laptop will also be required. However, I still strongly recommend it if you’re using a digital camera. A tripod is optional but highly desirable.

Scanners: Most archives, when they allow scanners at all, will set very restrictive rules for their use. In my experience, the US National Archive system (including Presidential libraries) will allow scanners once they’ve been inspected and approved. The only type they allow, however, are single-sheet flatbed scanners. A good example of such a scanner, which I’ve personally used in the past, is the Canon Lide 80. The good news with this kind of scanner is that it’s very inexpensive. The bad news is that you can tell it’s inexpensive. It takes about 10 seconds to scan a letter-sized page. That’s quite a bit slower than making copies. (As an aside, be sure to get a scanner that has USB 2.0, not the slower and older USB 1.1. The slower transfer speed will make the whole process more difficult.) While faster scanners do exist, it’s unlikely you’ll get one approved, because many of them also include an Automatic Document Feeder, which is usually forbidden. The one time I’ve seen one allowed in a US archive was at NARA II in College Park where one of the senior archive technicians disabled the ADF and then approved the flatbed portion of the scanner for use. One additional note on scanners: you can’t use a line-by-line scanner or hand scanner at almost any archive. They require the user to put pressure on the paper, which any historian knows causes unacceptable damage to the record.

Note that scanners are forbidden at the UK National Archives (aka Public Records Office). As with all steps of this process which take place in an archive, be sure to contact the archive ahead of time to find out their rules.

Cameras: As with scanners, there are several considerations. For a minimally acceptable image, you’ll need a camera capable of at least 3 Megapixel resolution. Fortunately, almost every camera sold today (c. 2007) will far exceed that, with the exception of those built into cell phones. There are two other things I’d recommend for a camera. First, use one for which you can get an AC adaptor. If you’re doing six to eight hour stretches with documents, the battery will discharge. An alternative is to bring a charger with a spare battery. Second, try to find a camera with some method of remote shutter control. This is important if you eventually decide to use a tripod. With either a remote control button in your hand, or control from your laptop, you’ll eliminate the shake which comes from pressing the shutter button. As with the scanners, USB 2.0 is a major plus. Transferring via the USB cable will be painfully slow with USB 1.1. Alternatively, you can use some other method (such as a media adaptor) to copy the images over to the hard drive.

Laptop: You don’t need a high-powered brand new machine, but you’ll want something made in the last few years. You should have a reasonably sized hard drive, with not less than 20 GB at the minimum. A USB powered external harddrive can expand the storage capacity quite a bit (up to 200GB as of right now) for relatively little money. If you’re connecting the scanner, be sure to have all the required software for the scanner already loaded on the machine. If you’re going the camera route, either Windows XP or OS X will have everything you need for this step (which just saves the images as JPEGs), but as you’ll see later on, you’ll want some additional software for later steps.

Tripod: If you’re using a digital camera for hours at a time, for days in a row, your back will eventually make you regret not having a tripod. Having to bend over a table to line up thousands of pictures will get very old. There are two styles of tripods that I’ve seen work. One is a relatively small table-top model where the camera is mounted facing directly down from beneath the tripod’s apex. The second is a tripod placed on the floor next to the table. This type has a detachable center arm which is then placed parallel to the floor, suspending the camera directly above the document. I don’t recommend the more conventional type of tripod which doesn’t allow you to place the camera directly above the table. It will take photos with an unacceptable distortion.

Software: You won’t need anything special for this step, which only creates JPEGs.

Process: This depends on whether you’re using a camera alone, a camera with tripod, a camera with remote control, or a scanner. There are a few things which are consistent, though, no matter what variation of equipment you’re using. My first step is to review the folders just as might be done with any method of research. When I see a document that I’ll want later, I do two things. First, I enter it into my bibliographic database with a one line abstract. That forces me to summarize the document. Second, I mark the document in the folder in such a way that it’s obvious when I open the box that there are documents to be copied. At NARA, I use the long copy slips and have the ends stick out of the open end of the folder. When I’ve reviewed some limit (sometimes I use a single box, sometimes a whole row of a cart), then I switch to photographing.

Line the document up against some easy reference like the corner of the table. Later, if you want to crop the photos, this will make it easier to measure on one page and then crop the entire document.

Be sure to disable your flash! There is a special place in archive hell for those who use camera flashes.

The first frame of every document you photograph must contain all the citation information. When you finally combine the pictures into pdf documents (or just print them out), you’ll need the full citation information.

Here’s an example of the first frame of a document I photographed at the Kennedy Library.

First Frame Example
You can see the top of the page has the information that isn’t going to change as I photograph a series of documents: archive, series, subseries. In the table, I put the things that do change frequently in the course of a single photo shoot: box, folder, and file name. By looking at this single frame, I can extract the entire citation. Also, I’ve got a system for the filenames which works well for finding the file later. A filename is YYYYMMDD_from_to or YYYYMMDD_author_keyword or YYYYMMDD_keyword1_keyword2. Later, when the files are individual pdfs, you’ll be able to use google desktop (Windows) or finder (Mac) to just type in the date and find the document in a few seconds.

As I go thorough a marked box or two (or whatever I’ve decided to shoot for a single run), I’ll cross out the old citation information as I go down the page, so that the last line of each column always applies to the document I’m working with.

Photograph each document that you’ve marked in the series that you’re working on, being sure to update the citation information for each. When you’re done with the series, download the pictures from the camera to the harddrive. Put the pictures in a “working” directory which you’ll clean out at the end of the day.

Keep going until you’re done with the day’s research.

Congratulations. You’ve completed step 1.

My setup: I use a Manfrotto tripod (like this) which I set up next to the table. It allows me to hang the camera (Canon Digital Rebel XT) over the documents. I use a cable to remotely control the camera from my keyboard and download each shot as it’s taken to the laptop’s harddrive. With one hand, I operate the shutter (by tapping the laptop’s spacebar), and with the other I flip to the next page of the document. Normally, I can take photos faster than I could make copies standing at the copier. My record was photographing an 800 page document in 45 minutes.

Step 2: From JPEG to PDF

Overview: In this step, you take the JPEGs created during the day’s photography and transform them into Adobe Acrobat documents. There are a few reasons to do this. First, it puts all the pages for a single document into a single file. That makes it much easier to find them later. Second, you can manipulate the documents by marking them up in Acrobat or performing optical character recognition (OCR). Third, it saves a LOT of space. Acrobat’s compression will reduce a page to 1/4 or less of its original size. I have 110,000 pages of documents stored in about 20GB of pdfs.

Hardware: Just your laptop with the JPEGs taken in step 1.

Software: There are two programs which you can use for this step, and which one you use determines the exact sequence of steps, but the inputs and outputs are identical. One program is Nuance’s Paperport Professional 11 which is “document management software”. It works not only for this step, but will also do OCR and help you find documents. The interface is easy to understand and it’s the best for beginners and intermediate users. Note: Be sure to get the Professional version, because the basic does not work with pdfs.

A second option is Adobe Acrobat Pro 8. To really speed things up, you can add the Autosplit pro plug-in. You can get Acrobat for a steep discount at most schools. For Autosplit pro, you’ll have to ask the company for an educational discount. (I don’t recall if they offer it, to be honest.)

Process: If you’re using paperport, open the working directory on your desktop. You should see thumbnail images of all the pages you photographed. Now, you can simply stack the pages one on top of another. The citation images at the front of each document make it easy to see where each document starts and finishes. When you’re done with stacking, rename each document with the filename…which is conveniently there for you on the first page of the document. Be sure you’ve saved all the files as pdfs. You’re done.

If you’re using acrobat, the first step is to create a single large pdf from all the pictures you’ve taken. The practical upper limit for this is about 500 pages, so chunk the process into multiples of that number unless you want your computer to choke out. Once you have the single large document, use the page tab to scroll through it and “extract” each pdf while deleting the original pages from the concatenated pdf. Save each extracted pdf. Do this until you’re done.

Alternatively, if you’re using the autosplit pro plug-in, delete all the bookmarks that are automatically created when Acrobat concatenates the original 500 page document. Then, page through and create a bookmark with the filename on each citation page. That is, if the photograph of the citation page contains the filename “19600101_kennedy_truman” then create a bookmark on that page with the same text. When you’ve marked up all the pages, use autosplit pro to divide the pdf into new files at each bookmark, using the bookmark as the filename. You’re done.

My method: I’ve done both methods. Paperport is much easier, but Acrobat + Autosplit is extremely fast once you’ve got past the learning curve. However, it requires a lot of confidence with the technology.

[I will add the section on OCR later....as well as additional pictures/videos/screen-captures.]

11 Responses to “Digitizing Archival Documents”


  1. 1 Evan January 20, 2007 at 5:33 pm

    Nice summary of this. I wrote a very similar post — slightly different working habits — last year: http://blog.lib.umn.edu/robe0419/coffee/046590.html

    Have fun finishing up that dissertation. I know too well where you’re at as i’m trying to finish soon too.

  2. 2 Maja Clark April 13, 2009 at 12:55 am

    I’d love to know more about how successful the OCR portion is. What resolution do your photos need to be?

  3. 3 hyh June 6, 2010 at 3:32 am

    Great tips!

    What camera and settings did u use? I will be at an archive that bars the use of tripods and flash. Digicam will be Canon Powershot S90 or Nikon Coolpix S8000。

    Thanks

    HYH

    • 4 dropshot94 June 7, 2010 at 7:18 am

      Ultimately, I used a Canon Rebel XT. As for settings, I don’t remember the exact ones, but I think I used an automatic setting which was fine in the light I had. Also important is that I didn’t use the maximum resolution. Instead I cranked it down to 3 or 4 MP per shot. Any more than that just wastes space.

      I strongly recommend you go with Canon, simply because Canon’s software includes the ability to remotely control the shutter from a computer keyboard. Should you get a tripod (and I suggest it for extended research trips), the ability to flip document pages with one hand while tapping the space bar (and thus taking a picture) with the other will pay great dividends.

      Good luck!

  4. 5 Jhenery January 5, 2011 at 11:48 am

    i would like to know if u can recommend a suitable database that be be created and implemented in an archive

    • 6 dropshot94 January 17, 2011 at 5:31 pm

      @Jhenery: I didn’t implement a specialized database for the actual storage and retrieval. Rather, I relied on the operating system to find my files (using google desktop on windows, and then spotlight on OS X). However, I did create a database record in Scholars Aid (today, I’d recommend Zotero) which summarized the document with enough information to cite it correctly and to identify its contents. As that record naturally included the date of the document and author, and that was my file naming scheme (YYYYMMDD_author_keyword.pdf), I was very quickly able to pull up the document itself.

  5. 7 Jhenery January 5, 2011 at 11:50 am

    *that can be created and implemented for the storage and retrieval of archival documents

  6. 8 wabasso June 2, 2011 at 11:04 am

    Can I suggest the PNG format between the hardware and PDF steps? It is a slightly larger format but digital space is cheap and will only become cheaper.

    The PDF conversion can also be tweaked. You can have it preserve PNG images so that JPEG compression doesn’t make OCR more difficult.

    At the very least, use PNG (or TIFF or anything else that is lossless) as a temporary format to feed good OCR.

    • 9 dropshot94 July 3, 2011 at 1:08 pm

      I like your suggestion. The only downside is that you’d have to preserve the PNG’s as individual pages. My workflow (which is far from perfect, and I’m sure could be improved along the lines you’re suggesting) has me leaving the archive with documents as multi-page PDFs. Or are you suggesting that the camera save the photos as PNGs?

      • 10 wabasso July 3, 2011 at 10:24 pm

        If you would like the archive to be in a multi-page format then PDF is one way to go. I am mainly suggesting PNG as an intermediary between the source format and the PDF. When you create the PDF there are several tricks the software does to compress what it sees. I believe most of the “profiles” that come with Adobe Acrobat will compress photo-type images via JPG. This can introduce ugly artifacts and gets worse each time you save it (should you need to process it again in the future…you never know).

        Most modern operating systems have quick preview programs integrated to quickly flip through photos in a directory. True, everything stored in PNG wouldn’t be a truly multi-page format, but you could almost view it as if it were.

        In any case, your scanner should be able to output TIFF and your camera as RAW. I don’t like leaving a conversion to JPG up to the native software on the image acquisition devices. There are options you can tweak (compression %, resampling, algorithm, etc.) specific to your needs (photo vs. text).

        I might just be ranting here. The takeaway is that you should aim for an archive in a lossless format because you never know if you’ll need to edit it again in the future. If multi-page format is a must have then look into your PDF conversion settings and make sure PNG is saved as PNG or ZIP or some other lossless format. Good luck!

      • 11 dropshot94 July 17, 2011 at 4:36 pm

        A very good point. At the time I wrote this post, I was working with a 40ish GB hard-drive and space was a concern. I suppose there’s no reason not to keep the original photos in a lossless state if you expect to do future processing. For day-to-day work, however, I find having a one (multi-page pdf) file to one database entry relationship essential to my workflow. With very few exceptions (maps, pictures), I haven’t needed to have loss-free captures of my documents. However, in those cases, had I had your archive of lossless photos, they would have come in handy.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s




 

June 2012
S M T W T F S
« Jul    
 12
3456789
10111213141516
17181920212223
24252627282930

Contact information

If you should be seized by an uncontrollable urge to contact me directly, I can be emailed at:

double (dot) take23 (at) gmail (dot) com

Other Info

What I'm reading:
Librarything.

What I'm listening to:
white_rabbit's Profile Page

Flickr Photos

Against Every Form of Tyrrany

America

Washington Monument

Washington Monument (B&W)

The Mind of Man

The Mind of Man

Jefferson

Obama Motorcade

Jefferson

Dupont Circle Fountain

More Photos

Follow

Get every new post delivered to your Inbox.