Wednesday, April 5, 2017

Next steps for the Canadiana.org platform: IIIF, JHOVE

This is a new fiscal year.  I'm excited to discuss some of our current plans, and hopefully get some feedback and help from the community.

Some related Internal Projects for our 3 person DevOps team include adoption of IIIF APIs, JHOVE, and updates to associated METS application profile and schemas.

A quick note about when?

The Phoenix Project describes how IT Operations only do 4 things. This work falls under Internal Projects, and will be bumped by Business Projects as well as Unplanned Work. While this work may be exciting to the DevOps team, other work will take precedence.  This means that while I can say we are working on it, we can't offer any useful idea of when there will be something other people can use.

International Image Interoperability Framework

The name gives an impression much less than what this community is doing.  It isn't only a common way to access images, but a community that is working on common APIs and documentation for institutions like ours which offer access to images.  When I first glanced I thought this was a possible replacement for part of our Content Server (See: The Canadiana preservation network and access platform), but it turns out to be a community defining interoperability between most aspects of our access platform.

Adoption will come in stages, with the replacement of our Content Server being the first step.  The content server can be thought of as having 3 parts:
  • Authentication
  • Authenticated direct access to files
  • Authenticated access to derivatives of images

IIIF Image API

We have decided to adopt Cantaloupe, which provides a simple configuration method to handle our authentication and access to images in our TDR (Trustworthy Digital Repository).  We could have enhanced our own image server to handle IIIF, but we felt it better if we adopted and participated in an existing community.  We will be retrofitting our existing access services to use the IIIF image server and decommission our existing server.

Setting up a test Cantaloupe server with access to TDR files involved writing a small self.get_pathname(identifier)  Ruby function to return the correct filesystem pathname based on an identifier.  What will take more work is the authorization function to denying access to files we aren't able to offer access to.

IIIF Authentication API

The authentication model that IIIF uses is different than what we have been using, so we'll need to adjust.  Our existing authorization token was presumed to be unique to accessing content within an AIP (See: OAIS) within our TDR, and a new token would be requested if the client needed access to a new AIP.  We will be adopting a different token, and require a database lookup to confirm the AIP being accessed is part of a collection the patron is authorized to access.  While most AIPs are sponsored, not all are and thus not every patron will have access to every AIP.

Direct access to files

Once we have the first two in place, all we need to do for direct access is verify credentials, find the base path for the AIP in the disk pools, and allow the HTTP server to send the file to the client.

IIIF Presentation API

Reading the API documentation reminded us of many conversations at our whiteboard. Our current collection management interface is simple: a collection is a tag that is attached to an AIP ID within a non-TDR database.

We planned to move to collections being a list which can contain other collections or AIPs, which is the model IIIF uses.  Specific collections would be able to be purchased, and for these we would compile that to tags to allow for a quick reverse-lookup for authentication.

The same is true of other aspects of the Presentation API that map to expansion ideas we have had for a few years.

Our current (2012 edition) AIP and SIP documents describe a layout and a METS profile that was designed for the ECO project.

  • Set of scanned images (JPEG, JPEG2000, TIFF)
  • Single down-loadable PDF, usually derived from OCR of all the images
  • Single "physical" order for the images  (METS structMap, IIIF Sequence)
There are a number of enhancements to our AIP structure we have been discussing, and each fit well within the IIIF presentation API

  • Allowing multiple structMaps to define multiple orders, such as when image corrections are made (pages removed, or added), or the order wasn't correct. One structMap can describe all the images within the AIP, but other named structMaps(sequences) can describe a separate sequence for display.
  • Storing the "master" as well as other derived formats (other image formats, ALTO XML OCR data, OCR derived PDF) associated with that master, making the relationship between the master image and the derivatives clear.

We had a prototype a summer student created for us of a tool which was intended to edit those structMaps.  Now that we are moving towards IIIF we are most likely to adopt using  an IIIF Manifest Editor instead, providing mechanisms for IIIF Manifests to be used by our AIP manipulation software to generate METS structMaps.

Adoption of IIIF will impact most aspects of our platform, making full integration into many incremental projects.

JHOVE

It all started with IIIF Manifests needing the height and width dimensions for canvas and image resources.  This is metadata our lead software developer has wanted the front end interface to have access to for some time, and our metadata architect had planned to look into MIX (NISO Metadata for Images in XML) as a way to store this information in our METS records.  The MIX documentation references JHOVE, and JHOVE has been mentioned to us quite often over the years.  It was in our list of tools to investigate adopting.

In our TDR we have tens of millions of files, with some of the earliest scanned and OCR'd in the late 1990's.  As part of the CRL TDR certification we added checks for new files being added to our TDR (ImageMagick's identify for our JPEG, JPEG2000, TIFF and PDF files).  Prior to that our guarantee was based on our use of Bagit for AIPs and revisions, which was that if any change happened to the file in the TDR (bit rot, etc) that we would detect it with our constant re-validation of the MD5's of all files in all AIPs.  We always keep copies of files in multiple physical server rooms across the country.  During content ingest we would also confirm our METS record referenced all the files being submitted.   While we wanted to adopt something more robust such as JHOVE for identification and validation of the format of files, we weren't able to allocate the resources at the time to implement.



On March 24 I downloaded the latest JHOVE (XML files generated indicate release="1.16.5") and set it to generate an XML file for all of our files.  This process is ongoing, with 60 million files categorized and counting.


I expected to find problems with our earliest files, but was surprised to find issues reported from our most recent additions.  We would like to have JHOVE file identification/validation as part of our ingest process, only adding new files which are recognized.  Before we can do this we need to work out compatibility issues.

If you have adopted JHOVE, and manipulate PDF files as part of your processing, I am curious if you have seen the problem we found when joining single-page PDFs into multi-page PDFs using Poppler (via Ubuntu) or PDFTK (via Ubuntu). It is surprising that JHOVE is finding errors for multi-page PDF files generated with these common utilities.

No comments: