Monday, July 2, 2012

Class No. 6: Using and Understanding PDF/A as a Preservation Format

Last week I attended my sixth DAS course, a live webinar titled "Using and Understanding PDF/A as a Preservation Format." The course, taught by Geoff Huth, covered some basic information about preservation standards in general, specific information about the purpose and requirements of PDF/A (and its various versions), and some practical information about how to create and validate PDF/A files.

PDF/A is an open preservation standard controlled by ISO (ISO 19005-1:2005). It is not a particularly useful preservation format for complex files like spreadsheets, databases, or webpages, but it is quite good for text-based, static documents, both digitized and born-digital. Some of its advantages include:
  • the look and feel of the original is retained;
  • any fonts required to accurately render the document are embedded within the PDF/A (unlike most file formats, which just point to a place on your hard drive where the necessary font may or may not reside);
  • it contains extractable text (for digitized documents, of course, this is only true if you used OCR software at time of capture); and
  • it helps to ensure authenticity by being very difficult to modify. 
Because the PDF/A standard is expressly designed to persist over time, it requires that certain "non-archival" features be stripped out of a document before it can be converted into a valid PDF/A file. This applies to anything that might be unstable in the long term, such as embedded audio or video, encryption, compression, transparencies, executable files, or references to external content, though with each new version of the standard it seems that more features are allowed. There are several different "flavors" of PDF/A, each with its own list of requirements. For example, to create a valid PDF/A-1a you will need to include metadata that preserves the logical structure of the document, specifies the language of of the text, and preserves the text stream in reading order, whereas a PDF/A-1b preserves the visual appearance of the original but requires less descriptive metadata (the "b" stands for basic, the "a" for accessible; a document that only conforms to the standard at the basic level is less accessible as a result). The PDF/A-2 allows for electronic signatures and JPEG2000 compression and sets requirements for XMP metadata, and within that there is a PDF/A-2a, b, and u (for Unicode). The PDF/A-3 was recently ratified as well, which is very similar to PDF/A-2 but supports the maintenance of the original file by allowing it to be embedded within the PDF/A.

I'm not sure I came away from this course with a comprehensive understanding of the PDF/A standard, but what I do know is that implementing it as a preservation standard is not as simple as choosing a "save as" command (which is, sadly, kind of what I pictured). Documents must be prepared for conversion if they contain problematic features, metadata about the structure of the document must be added, and the resultant PDF/A must be visually inspected for accuracy and validated for conformance to the standard. And that's just the beginning - as Geoff stressed at the end of the course, the format isn't everything; preservation programs require work. We still need conversion procedures, version control, environmental controls, descriptive and technical metadata, regular backups, and vigilance in the face of continued change and obsolescence.

One question I came away with as I thought about how this might relate to my work at the Library is whether the PDF/A might really replace the TIFF image as a preservation format for scanned documents. I can see what the advantages might be, but I'm wondering if there are some disadvantages as well. Is this something that other archives have thought about or are already implementing? I'd be interested to hear what others think.

Tuesday, April 10, 2012

Class No. 5: Standards for Digital Archives

Last week I took a Foundational DAS webinar, "Standards for Digital Archives" taught by Mahnaz Ghaznavi. As its name suggests, this course provided an overview of the many standards that are available for use with digital archives. The underlying theme was that standards are good, and that you should adopt the ones that fit the needs of your institution. The course began with an example of an electronic record that could have benefited from the use of standards: a word processing file created in an obsolete, proprietary format that displayed as a nonsensical mishmash of special characters. Had the file been converted to an open, standard, more persistent format, the information contained in the document could have been retained.

Though sometimes we create our own standards - a local set of topical subject headings, for example -  the best standards are those that are published and maintained by a standards setting body (such as ISO, W3C, NISO, ANSI, or NIST). There are standards to guide us in almost any activity that we engage in as archivists:
  • records retention and appraisal (ISO 15489); 
  • the ingestion, management, preservation, and access of digital or physical archives (ISO 14721, better known as OAIS); 
  • linking objects with their associated metadata (METS);
  • capturing preservation data about our objects (PREMIS); 
  • capturing descriptive metadata about our objects (Dublin Core);
  • migrating our objects into more stable formats (JPEG 2000, PDF/A)
  • and making sure our digital objects are stored in a secure manner (TRAC)
Given that it would have been impossible to delve into these standards in any detail within the confines of a ninety minute webinar, I think the instructor was able to convey some useful information about the options that are available to help manage digitized or born-digital archival assets. She advised us to learn from what other institutions have done and are doing, whether successfully or not, and to recognize that digital preservation is a moving target. To implement any of these standards one would need significantly more guidance, but this course can serve as the first step to becoming aware of what is possible.

Because SAA generously allows multiple people to view their webinars for the cost of one registration (though each attendee must pay for his or her examination fee), we had a good-sized audience of full time staff and interns in a conference room at the Library. For our interns, most of whom are current graduate students in Library Science at Simmons College, it seemed like much of the information presented echoed what they've already learned in class. I took that as a positive sign that graduate programs are adapting to our increasingly digital world. Archives students graduating now will start their careers already armed with skills and knowledge that more established professionals must actively seek out (by pursuing the DAS certificate, for instance). Of course it has always been thus, everywhere and in every profession, but my perspective until recently has been that of the recent graduate; now that I have been out of school for almost ten years, I find that I am suddenly among those who must rush to catch up or be left behind.

Thursday, March 29, 2012

Class No. 4: Electronic Records: The Next Step

I recently completed my fourth DAS course, an on-demand webinar titled "Electronic Records: The Next Step" taught by Geoffrey Huth, Director of Government Records Services at the New York State Archives. One of my classmates had very recently taken Huth's full day, in-person course on Basic Electronic Records, which is part of the Foundational tier of DAS courses. Though this webinar is part of the Tactical and Strategic tier - the next tier up - it apparently didn't contain much new information that was not already covered in the basic course. Given this, the two courses might be best presented as an either/or choice: the basic course for true beginners, and this webinar for those who already have some familiarity with the issues surrounding electronic records.

The structure of the course mirrored the archival lifecycle, which - as I have learned in all of my DAS courses - is consistent regardless of format: Appraisal, Ingest, Processing and Preservation, Maintenance, Access, and Planning. Though much of the material was familiar, I find that I need to hear this kind of information over and over again before it truly sinks in. I took away the following main points:
  • Appraise ruthlessly. It will cost approximately five times more to store a digital file than it does to store a physical object. We cannot and should not keep everything, in the physical world or in the digital world. If you cannot manage or even access the files, if you cannot maintain their original functionality, or if you do not have sufficient metadata to make sense of them, consider whether they are worth keeping. 
  • Define acceptable file formats (uncompressed, unencrypted) and external media devices, as well as acceptable methods of transfer for your institution. This way you will have processes in place to handle any electronic records that you receive.
  • Make sure that the donor retains a second copy of all electronic files until your copy is verified.
  • Always accession electronic records on a quarantined (i.e. non-networked) computer. Run your virus software, wait a month, and then run it again. 
  • Preservation options for electronic records include migration, normalization, emulation, and output to some sort of hard copy, generally paper or microfilm. 
    • Normalization, which involves converting files to a "normal" format that is open and persistent (PDF/A, for example) the most likely solution. 
    • Emulation, wherein the file is never converted to another format, is a less practical choice, as the original environment of each file would need to be perpetually maintained. I see how this is completely impractical, but if you had the resources and the know-how it might be interesting to have a fleet of computers running defunct operating systems and software programs so that records could be accessed as they were originally created.
    • Output to paper or microfilm might be an acceptable solution if you've got just one or two electronic files, and if those files are simple word processing documents. If retaining the functionality of a record is important (links in a website or formulas in a spreadsheet, for example), obviously a hard copy is not going to be sufficient.  
  • One thing that I found slightly alarming was Huth's assertion that the world, with the exception of the archival community, is turning away from TIFF and toward JPEG 2000 as a standard. Is this true, and if so, what will that mean for digital archives (like JFK's) that are full of TIFF images?
  • Access seems like the trickiest piece of this puzzle. Is access provided online, or just in the reference room? If electronic records are closely related to physical records, how do you provide meaningful access to both at once?  
  • Just as we should define the formats we will accept when accessioning records, we should define the formats we are willing to provide to our users. It should be up to the user to convert our normalized file into whatever format he or she may require.
  • Though our inclination may be to ignore electronic records and digitization, the truth is that if you're not working with the digital world, you're not working in the real world. 
  • You can't do everything at once, but do something, and do it now.
In the spirit of that last point, I am going to try to do something with the electronic records that are stored on this device, which was found by my colleague in an unexpected place in our stacks:

Floppy disk

First I'll need a quarantined computer with a disk drive that will fit this floppy disk, and then I'll need to figure out what program was used to create whatever documents are stored on it. In this case my guess is that they'll be word processing documents that most likely exist in hard copy in the collection already, in which case this disk probably won't be of much importance to the collection.  However, rather than just sticking it somewhere in the stacks and pretending it doesn't exist (as we did originally), I'm going to use what I've learned in my DAS courses to deal with it properly.

Tuesday, March 27, 2012

Changes to DAS Course Examination Policies

SAA recently made some changes to the DAS course examination policies, and I thought it might be useful to highlight them here.

The exams, and the rules governing them, now differ depending on the length of the course. Until now students were given two hours to complete each exam, regardless of length, and some exams had as few as five questions. Now for a web seminar, which is the shortest type of course, the exam will consist of ten questions, and participants will be given just one hour to complete them. In contrast, the exam for a two-day course - the longest type of course currently offered - will now consist of 30 questions, but participants will have up to four hours to complete them.

This seems like a sensible way to acknowledge the disparate amount of material that can be covered in a 90 minute webinar versus a one- or two-day, in-person course. I wonder if the next step might be to weight these courses differently, given this disparity, or perhaps to offer significantly longer webinars to increase the complexity of remote courses for the benefit of those who are not able, for whatever reason, to travel.

The revised Course Examinations page also provides some details about the comprehensive exam, though I'm not sure whether it's new information. It explains that the comprehensive exam covers the seven Core Competencies of the DAS Curriculum, and that each DAS course addresses at least two of these competencies. Any combination of the required number of courses from the four tiers of study should theoretically provide students with the knowledge necessary to pass the exam. The seven Core Competencies are:
  1. Understand the nature of records in electronic form, including the functions of various storage media, the nature of system dependence, and the effect on integrity of records over time.
  2. Communicate and define requirements, roles, and responsibilities related to digital archives to a variety of partners and audiences.
  3. Formulate strategies and tactics for appraising, describing, managing, organizing, and preserving digital archives.
  4. Integrate technologies, tools, software, and media within existing functions for appraising, capturing, preserving, and providing access to digital collections.
  5. Plan for the integration of new tools or successive generations of emerging technologies, software, and media.
  6. Curate, store, and retrieve original masters and access copies of digital archives.
  7. Provide dependable organization and service to designated communities across networks.
More information about the DAS Curriculum can be found here.

Monday, January 30, 2012

Class No. 3: Digital Curation: Creating an Environment for Success

Earlier this month I attended "Digital Curation: Creating an Environment for Success" taught by Jackie Esposito, University Archivist and Head of Records Management Services at Penn State University. The workshop was held at the Harvard Business School which, with its beautiful campus and fancy cafeteria, isn't a bad place to spend a day. After my last course, which discussed the basics of digitizing analog archival material, the content of this course represented a return to the concerns surrounding born-digital material.

The recurring theme for the day was that doing digital curation is like eating an elephant; you have to break it into pieces in order to manage it successfully. I think the other main theme of the class could be used to extend the metaphor: an elephant, once broken apart, is best eaten with friends, meaning that digital curation cannot be managed by one archivist alone. Partnering with the right people within your organization, including records creators, budget writers, and IT experts, is key to your success.

As I've done for my two previous classes, I'll list the points that stood out most for me in this course:
  • The format may change, but the function does not , meaning a record is still a record regardless of its format. This reinforces the idea that our archival skills are still applicable in the digital world.
  • Though the same archival processes apply to electronic records, the window of time in which we must gain intellectual control over them is smaller. Whereas a box of paper can sit on the shelf for decades, a born-digital accession may only last a short time - something like 5 years - before the media becomes obsolete and inaccessible.
  • As the permanent caretaker of the records, you can define the formats you are willing to accept. 
  • When forging relationships with others in your institution, don't scare them by talking about all of the horrible things that will happen to your institution's records if they aren't managed correctly. Scaring people does not work. Instead, make them feel comfortable and work to convince them of the benefits of what you are trying to achieve. 
  • Know what your priorities are, and make them manageable and measurable. Also, have frequent parties to celebrate your victories, no matter how small. If you have students on staff, feed them often.
  • NEVER use the word "project." Instead, use the word "program," which implies permanence.
  • If your public access interface doesn't look and act like Google, nobody is going to like it or use it.
  • Don't reinvent the wheel - other archivists have done these things already, so borrow from them, collaborate with them, and generally draw on the experience and expertise of your colleagues.
Though the material in this course was mostly theoretical to me, I really enjoyed it, mainly because I found Jackie to be an extremely engaging instructor. One note, though: this is classified as a Foundational DAS course, whereas to me it seems more suited to the Tactical and Strategic tier. Foundational courses focus mainly on "the needs of practitioners—archivists who are or will be working directly with electronic records," while Tactical and Strategic courses are meant to focus on "the skills that archivists need to make significant changes in their organizations so that they can develop a digital archives and work seriously on managing electronic records." As the course title indicates, this was geared towards archivists who actually have the power to change the environment at their institution and who are responsible for implementing an electronic records management program rather than (or perhaps in addition to) working hands-on with the actual records.

I welcome any and all comments on this course or the DAS program in general, and as always, thanks for reading!

Thursday, January 5, 2012

Class No. 2: Thinking Digital: A Practical Session to Help You Get Started

On Tuesday I began the new year by taking my first DAS course presented as an on-demand webinar: "Thinking Digital...A Practical Session to Help You Get Started," taught by Jessica Branco Colati and Greg Colati. This is one of the Foundational DAS courses, and it serves as a good overview of the decisions we make as creators and stewards of digital content. Unlike the first course I took, this was more focused on the digitization of traditional archives than on born-digital records.

The course was organized by the kinds of choices digital archivists must make about quality, metadata, management, storage, preservation, and delivery. Without repeating all of the information in the course, I'll just list the highlights as I saw them:
  • High quality digital objects adhere to five established principles: interoperability, reusability, sustainability, authenticity, and scalability. Better quality requires more time, more storage, and better equipment, but it also allows for a wider variety of uses. We should create the best quality digital objects we can afford now so that we have greater flexibility later.
  • Metadata allows for the identification, management, access, use, and preservation of a digital object. There are several different types of metadata: administrative, descriptive, preservation (including technical), and structural. Metadata should support local needs, but should also be standardized in order to enable interoperability. Controlled vocabularies should be supported. Keep in mind that metadata is never truly finished - there will always be changes or updates to make, or new information to capture.
  • The management of digital files must include all derivatives of the original object (and there could be hundreds) as well as the metadata about that object. Management must be built into your digitization workflow; it should not be a separate activity. There is no one digital asset management system (DAMS) that will solve all of your problems - you will most likely need an array of systems to accomplish all of your goals. In any DAMS, web delivery is only a small piece of the puzzle despite how important it is to users and probably to your management.
  • Storage choices will depend on the choices about quality you made earlier - the higher quality files you have, the more storage you will need. While storage may be getting cheaper, back up and preservation services are getting more expensive. It might be best to consult an expert when it comes to storage.
  • Preservation starts at the point of creation of a digital object, which is also the point at which the creators of digital content probably don't want to be bothered with questions about preservation, so it's on us as archivists to maintain the focus on preservation concerns. The first stage in a successful preservation plan is simply to acknowledge that digital preservation is important (much like the first step in overcoming addiction is to acknowledge that you have a problem, I suppose).This is as far as we've gotten, to be honest, but we hope to move onto the next stage soon, which is to take action.
  • Delivery involves both discovery and access. Discovery is based on the indexing of your metadata and/or the full text of your scanned documents. Access is how users interact with your digital objects once they are discovered - are the objects simply viewed, or are they able to be manipulated or extracted by the user?
The first point the instructors made before delving into what I described above was that the skills we already have as archivists can be easily adapted to the digital environment. I find that this is particularly true when it comes to the following: 

  • Planning and prioritizing digitization workflow. This is no different from planning and prioritizing our processing workflow, and should be done in the same systematic way.
  • Creating descriptive metadata. Descriptive metadata is archival description, which means that we already know how to create it, and also that our finding aids are full of preexisting descriptive metadata.
  • Managing and preserving digital assets. We manage our physical holdings, whether through the use of a database or a paper shelf list, and we are responsible for their long term preservation. This is true of digital files as well, whether they are born-digital records or digital surrogates of physical objects. Though digital files do present some specific challenges that will require more technical knowledge than we may start out with, the fundamental responsibility is the same.
One final comment: the on-demand courses are available for two months once you register for them, which is very convenient, but it turns out that this flexibility actually made it difficult for me to find the time for it when there are so many other things that require immediate attention. I registered for this webinar back in November, and I was lucky to complete it just before the two months expired. I do plan to take additional on-demand courses, but in order to thwart my inner procrastinator I will try to schedule a specific day for them as if I were taking them live.

I will be taking another Foundational DAS course, "Digital Curation: Creating an Environment for Success" on January 18th in Boston, so I'll be posting again in a few weeks. Until then, thanks for reading!