Thursday, May 11, 2017

Canadiana JHOVE report

This article is based on a document written to be used at Code4Lib North on May 11’th, and discusses what we’ve learned so far with our use of JHOVE.

What is JHOVE?



The original project was a collaboration between JSTOR and Harvard University Library, with JHOVE being an acronym for JSTOR/Harvard Object Validation Environment.  It provides functions to perform format-specific identification, validation, and characterization of digital objects.


JHOVE is currently maintained by the non-profit Open Preservation Foundation, operating out of the UK (Associated with the British Library in West Yorkshire).




Standard JHOVE modules for AIFF, ASCII, BYTESTREAM, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, UTF8, WAVE, XML, MP3, ZIP.

What is Canadiana doing with JHOVE?



As of the last week of April we generate XML reports from JHOVE and include them within AIP revisions in our TDR.  At this stage we are not rejecting or flagging files based on the reports, only providing reports as additional data.  We will be further integrating JHOVE as part of our production process in the future.

Some terminology




What did Canadiana do before using JHOVE?



Prior to the TDR Certification process we made assumptions about files based on their file extensions: a .pdf was presumed to be a PDF file, a .tif a TIFF file, .jpg a JPEG file, and .jp2 a JPEG 2000 file.  We only allowed those 4 types of files into our repository.


As a first step we used ImageMagick’s ‘identify’ feature to identify and confirm that files matched the file types.  This meant that any files added since 2015 had data that matched the file type.


At that time we did not go back and check previously ingested files, as we knew we would eventually be adopting something like JHOVE.


Generating a report for all existing files
As of May 9, 2017 we have 61,829,569 files in the most recent revisions of the AIPs in our repository.  This does not include METS records, past revisions, or files related to the BagIt archive structure we use within the TDR.


I quickly wrote some scripts that would loop through all of our AIPs and generate reports for all the files in the files/ directory of the most recent AIP revision within each AIP.  We dedicated one of our TDR Repository nodes to generating reports for a full month to get the bulk of the reports, with some PDF files still being processed.

Top level report from scan



Total files
61,829,569
Not well-formed
941,875 (1.5%)
Not yet scanned
253
Well-Formed and valid
60,828,836 (98.4%)
Well-Formed, but not valid
58,605  (0.09%)


JHOVE offers a STATUS for files which is one of:


  • “Not well-formed” - problems at the purely syntactic requirement for the format
  • “Well-Formed, but not valid” - meets higher-level semantic requirements for format validity
  • “Well-Formed and valid” - passed both the well-formedness and validity tests

Issues with .jpg files



Not well-formed
10
Well-Formed and valid
44,743,051
Well-Formed and valid TIFF
14


We had 10+14=24 .jpg files which were ingested prior to adopting the ‘identify’ functionality that turned out to be broken (truncated files, 0 length files) or that had the wrong file extension.  9 of the “Not well-formed” were from LAC reel’s where we were ingesting images from reels with 1000 to 2000 images per reel.

Issues with .jp2 files



Well-Formed and valid
11,286,315


JHOVE didn’t report any issues with our JPEG 2000 files.

Issues with .tif files



Not well-formed, Tag 296 out of sequence
1
Not well-formed ,Value offset not word-aligned
503,575
Not well-formed  , IFD offset not word-aligned
435,197
Well-Formed and valid
4,608,048
Well-Formed, but not valid  ,Invalid DateTime separator: 28/09/2016 16:53:17
1
Well-Formed, but not valid , Invalid DateTime digit
21,004
Well-Formed, but not valid  , Invalid DateTime length
3,483
Well-Formed, but not valid  , PhotometricInterpretation not defined
202


  • Word alignment (offsets being evenly divisible by 4 bytes) is the largest issue for structure, but it something that will be easy to fix.  We are able to view these images so the data inside isn’t corrupted.
  • Validity of DateTime values is the next largest issue.  The format is should be "YYYY:MM:DD HH:MM:SS" , so something that says “2004: 6:24 08:10:11”  will be invalid (The blank is an Invalid DateTime digit) and “Mon Nov 06 22:00:08 2000” or “2000:10:31 07:37:08%09” will be invalid (Invalid DateTime length).
  • PhotometricInterpretation indicated the colour space of the image data (WhiteIsZero/BlackIsZero for grayscale, RGB, CMYK, YCbCr , etc).  The specification has no default, but we’ll be able to fix the files by making and checking some assumptions.

Issues with .pdf files



Not well-formed , No document catalog dictionary
3,081
Not well-formed  ,Invalid cross-reference table,No document catalog dictionary
2
Not well-formed , Missing startxref keyword or value
8
Not well-formed  ,Invalid ID in trailer,No document catalog dictionary
1
Not yet scanned
253
Well-Formed and valid
191,408
Well-Formed, but not valid , Missing expected element in page number dictionary
33,881
Well-Formed, but not valid ,Improperly formed date
33
Well-Formed, but not valid , Invalid destination object
1



The current primary maintainer of JHOVE wrote a longer article on the JHOVE PDF module titled “Testing JHOVE PDF Module: the good, the bad, and the not well-formed” which might be of interest.  Generally, PDF is a hard format to deal with and there is more work that can be done with the module to ensure that the errors it is reporting are problems in the PDF file and not the module.


  • “No document catalog dictionary” -- The root tree node of a PDF is the ‘Document Catalog’, and it has a dictionary object.  This exposed a problem with an update to our production processes where we switched from using ‘pdftk’ to using ‘poppler’ from the FreeDesktop project for joining multiple single-page PDF files into a single multi-page PDF file.  While ‘pdftk’ generated Well-Formed and valid PDFs, poppler did not.

    When I asked on the Poppler forum they pointed to JHOVE as the problem, so at this point I don’t know where the problem is.

    I documented this issue at: https://github.com/openpreserve/jhove/issues/248
  • “Missing startxref keyword or value” - PDF files should have a header, document body, xref cross-reference table, and a trailer which includes a startxref.  I haven’t dissected the files yet, but these may be truncated.
  • “Missing expected element in page number dictionary”.  I’ll need to do more investigation.
  • “Not yet scanned”.  We have a series of multi-page PDF files generated by ABBYY Recognition Server which take a long time to validate.  Eventually it indicates the files are recognized with a PDF/A-1 profile.  I documented this issue at: https://github.com/openpreserve/jhove/issues/161


Our longer term strategy is to no longer modify files as part of the ingest process.  If single-page PDF files are generated from OCR (as is normally the case) we will ingest those single-page PDF files.  If we wish to provide a multi-page PDF to download this will be done as part of our access platform where long-term preservation requirements aren’t an issue. In the experiments we have done so far we have found the single-page PDF output of ABBYY Recognition server and PrimeOCR validate without errors, and it is the transformations we have done over the years that was the source of the errors.

Sunday, May 7, 2017

Some of the earliest community groups on FLORA.org

Some of the earliest groups on Flora.org


  • Ask the Doctors, which was a real cool site managed by Rosaleen Dickson who I met from the Freenet. Involved in the publishing industry, she co-authored a book on HTML back in the early 1990's when it was such a new thing.  She hosted pages on Canadian Books, and following the work she did with the doctors ran a "Ask Great Granny" site.
  • I believe Auto-Free Ottawa was the first community group I hosted, a group of people in the early 1990's envisioning an Ottawa that wasn't as dependent on the automobile.
  • Canadian Homeschool is the last of the original groups to be hosted on FLORA.org
  • Community Democratic Action
  • KC. Maclure Centre
  • MAI-not
  • Ottawa District Committee of Ontario Special Olympics
  • Peace and Environment Resource Centre -- still around, but has had their own domain name for quite some time.
  • Pednet was a mailing list hosted by Majordomo, and then moved to Mailman.
  • Visually Impaired was, I believe, information hosted by Charles Lapierre
  • Westend Family Cinema also obtained their own domain name quite some time ago.


As web access became easier for organizations it was far more common for groups to get their own domain names so that they could move their sites between hosts without anyone having to remember a new URL.  There are many redirects still in the config files for such groups that were previously hosted on FLORA.org.


There were other groups over the years, but not all of them still exist such as:
  • Car Free Living
  • Communities Before Cars Coalition
  • Coop Area Network (Networking between Coop Voisins and the Conservation Coop)
  • Cycle Challenge / Commuter Challenge
  • Economic Good
  • Famous 5
  • FTAA Ottawa
  • FVC (Fair Vote Canada) Ottawa
  • Food Action Ottawa (FAO)
  • Global Education Network
  • Global Issues Forum
  • Green Party (Ottawa region, back when the party was smaller)
  • International Association for Near Death Studies Ottawa)
  • Maclure Center
  • National Capital Runners Association
  • OPIRG Forestry group
  • Ottawa LETS
  • Ottawa River Bioregion Group
  • Ottawa Transit Riders Association
  • Ottawa Vegetarian Society
  • The Doorstep Alliance

After doing a bit of spring cleaning of some sites where I can no longer reach the managers, and who haven't updated the sites in years, there were only 4 sites remaining: two from close personal friends, and two community groups.

For the sites that I couldn't reach the managers I had set them to redirect to the most recent archived version of their sites on Archive.org.   I made a mistake with the robots.txt file and they are temporarily unavailable, but I have sent a message to Archive.org in the hopes they can fix my mistake and restore the archive.

If there are any groups I've missed, please let me know in the comments.  It has been a few years, and I've been looking at old Apache config files to be reminded of some of the organizations.  I've not listed all the individuals (volunteers as well as individual election candidate websites from back when I hosted candidates during elections).

There have been many mailing lists over the years, but since this isn't something I'm planning on closing I won't get all nostalgic about them.  I'm keeping the domain name and will be keeping the redirects active for any sites that have moved so bookmarks can be updated.

Saturday, May 6, 2017

Winding down FLORA.org after more than 22 years.

FLORA Community Web was started in December 1994 (See Ottawa alternative community minded networking) and the first domain name it used was flora.ocunix.on.ca. Later the name flora.ottawa.on.ca (date unknown) was adopted, and then FLORA.org (13-Oct-1996).

It offered free websites and mailing lists for community groups from before these things were as easily available as they are today.  I haven't had time to spend on the server as I would like, and believe it would be best for me to admit that my interests have moved onward.   I'm in the process of helping the remaining groups hosted on FLORA.org to migrate to some other hosting.

Thanks go out to the many volunteers who participated over the years, and the many friends I made through these connections.



If you want to take a look at what the site looked like at various points in the past, Archive.org's Wayback engine has many snapshots.  The earliest list of flora.org organizations they have is a snapshot from 1998.  This was back when I listed some of the clients as well as volunteer sites hosted on the same computer.  A larger list of domains I was hosting on that computer can be seen from a 1999 list.


Some time in 2000/2001 most of those clients had been moved to OpenConcept where I was managing the growing number of computers, but Mike Gifford of OpenConcept was doing the billing, customer relations and all the business side.  In 2003 those servers became part of CooperIX which Mike Richardson and I founded. CooperIX was a small co-location provider for more technical clients, with OpenConcep and its growing client list being the biggest single user..

Wherever my self-employed company went, the volunteer FLORA.org services came along with me.

In 2011 I moved from being a self-employed consultant to being staff at Canadiana.org.  This was the first time I was an employee since I became self-employed in early 1995, but has been a great transition.  However, with the work I'm doing for Canadiana I don't feel I have the time to dedicate to the volunteer services, which are currently a couple of virtual machines on a server running in the basement of my home.

I finally decided this year that I should start the process of decommissioning those VMs.  I will start with the www.flora.org website, which is still managed via 1990's technology (content providers use FTP to log in to update their sites).  I will likely spend the time to migrate the mailman services to another VM, and keep them running as there are less security and other concerns with the mailing lists. I'll then decide what to do with my personal sites (my old business site, and so-on).