Monday, July 2, 2012

Class No. 6: Using and Understanding PDF/A as a Preservation Format

Last week I attended my sixth DAS course, a live webinar titled "Using and Understanding PDF/A as a Preservation Format." The course, taught by Geoff Huth, covered some basic information about preservation standards in general, specific information about the purpose and requirements of PDF/A (and its various versions), and some practical information about how to create and validate PDF/A files.

PDF/A is an open preservation standard controlled by ISO (ISO 19005-1:2005). It is not a particularly useful preservation format for complex files like spreadsheets, databases, or webpages, but it is quite good for text-based, static documents, both digitized and born-digital. Some of its advantages include:
  • the look and feel of the original is retained;
  • any fonts required to accurately render the document are embedded within the PDF/A (unlike most file formats, which just point to a place on your hard drive where the necessary font may or may not reside);
  • it contains extractable text (for digitized documents, of course, this is only true if you used OCR software at time of capture); and
  • it helps to ensure authenticity by being very difficult to modify. 
Because the PDF/A standard is expressly designed to persist over time, it requires that certain "non-archival" features be stripped out of a document before it can be converted into a valid PDF/A file. This applies to anything that might be unstable in the long term, such as embedded audio or video, encryption, compression, transparencies, executable files, or references to external content, though with each new version of the standard it seems that more features are allowed. There are several different "flavors" of PDF/A, each with its own list of requirements. For example, to create a valid PDF/A-1a you will need to include metadata that preserves the logical structure of the document, specifies the language of of the text, and preserves the text stream in reading order, whereas a PDF/A-1b preserves the visual appearance of the original but requires less descriptive metadata (the "b" stands for basic, the "a" for accessible; a document that only conforms to the standard at the basic level is less accessible as a result). The PDF/A-2 allows for electronic signatures and JPEG2000 compression and sets requirements for XMP metadata, and within that there is a PDF/A-2a, b, and u (for Unicode). The PDF/A-3 was recently ratified as well, which is very similar to PDF/A-2 but supports the maintenance of the original file by allowing it to be embedded within the PDF/A.

I'm not sure I came away from this course with a comprehensive understanding of the PDF/A standard, but what I do know is that implementing it as a preservation standard is not as simple as choosing a "save as" command (which is, sadly, kind of what I pictured). Documents must be prepared for conversion if they contain problematic features, metadata about the structure of the document must be added, and the resultant PDF/A must be visually inspected for accuracy and validated for conformance to the standard. And that's just the beginning - as Geoff stressed at the end of the course, the format isn't everything; preservation programs require work. We still need conversion procedures, version control, environmental controls, descriptive and technical metadata, regular backups, and vigilance in the face of continued change and obsolescence.

One question I came away with as I thought about how this might relate to my work at the Library is whether the PDF/A might really replace the TIFF image as a preservation format for scanned documents. I can see what the advantages might be, but I'm wondering if there are some disadvantages as well. Is this something that other archives have thought about or are already implementing? I'd be interested to hear what others think.