Thursday, May 11, 2017

Canadiana JHOVE report

This article is based on a document written to be used at Code4Lib North on May 11’th, and discusses what we’ve learned so far with our use of JHOVE.

What is JHOVE?

The original project was a collaboration between JSTOR and Harvard University Library, with JHOVE being an acronym for JSTOR/Harvard Object Validation Environment. It provides functions to perform format-specific identification, validation, and characterization of digital objects.

JHOVE is currently maintained by the non-profit Open Preservation Foundation, operating out of the UK (Associated with the British Library in West Yorkshire).

https://github.com/openpreserve/jhove

Standard JHOVE modules for AIFF, ASCII, BYTESTREAM, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, UTF8, WAVE, XML, MP3, ZIP.

What is Canadiana doing with JHOVE?

As of the last week of April we generate XML reports from JHOVE and include them within AIP revisions in our TDR. At this stage we are not rejecting or flagging files based on the reports, only providing reports as additional data. We will be further integrating JHOVE as part of our production process in the future.

Some terminology

Archival Information Package (AIP) as defined by the Open Archival Information System https://en.wikipedia.org/wiki/Open_Archival_Information_System . This document also describes what we are discussing when we mention “ingest”.
Trustworthy Digital Repository (TDR). Canadiana was certified as a TDR in July 2015 http://www.canadiana.ca/tdr-certification . We also call the file store that contains all our AIPs the TDR.

What did Canadiana do before using JHOVE?

Prior to the TDR Certification process we made assumptions about files based on their file extensions: a .pdf was presumed to be a PDF file, a .tif a TIFF file, .jpg a JPEG file, and .jp2 a JPEG 2000 file. We only allowed those 4 types of files into our repository.

As a first step we used ImageMagick’s ‘identify’ feature to identify and confirm that files matched the file types. This meant that any files added since 2015 had data that matched the file type.

At that time we did not go back and check previously ingested files, as we knew we would eventually be adopting something like JHOVE.

Generating a report for all existing files

As of May 9, 2017 we have 61,829,569 files in the most recent revisions of the AIPs in our repository. This does not include METS records, past revisions, or files related to the BagIt archive structure we use within the TDR.

I quickly wrote some scripts that would loop through all of our AIPs and generate reports for all the files in the files/ directory of the most recent AIP revision within each AIP. We dedicated one of our TDR Repository nodes to generating reports for a full month to get the bulk of the reports, with some PDF files still being processed.

Top level report from scan

Total files	61,829,569
Not well-formed	941,875 (1.5%)
Not yet scanned	253
Well-Formed and valid	60,828,836 (98.4%)
Well-Formed, but not valid	58,605 (0.09%)

JHOVE offers a STATUS for files which is one of:

“Not well-formed” - problems at the purely syntactic requirement for the format
“Well-Formed, but not valid” - meets higher-level semantic requirements for format validity
“Well-Formed and valid” - passed both the well-formedness and validity tests

Issues with .jpg files

Not well-formed	10
Well-Formed and valid	44,743,051
Well-Formed and valid TIFF	14

We had 10+14=24 .jpg files which were ingested prior to adopting the ‘identify’ functionality that turned out to be broken (truncated files, 0 length files) or that had the wrong file extension. 9 of the “Not well-formed” were from LAC reel’s where we were ingesting images from reels with 1000 to 2000 images per reel.

Issues with .jp2 files

Well-Formed and valid

11,286,315

JHOVE didn’t report any issues with our JPEG 2000 files.

Issues with .tif files

Not well-formed, Tag 296 out of sequence	1
Not well-formed ,Value offset not word-aligned	503,575
Not well-formed , IFD offset not word-aligned	435,197
Well-Formed and valid	4,608,048
Well-Formed, but not valid ,Invalid DateTime separator: 28/09/2016 16:53:17	1
Well-Formed, but not valid , Invalid DateTime digit	21,004
Well-Formed, but not valid , Invalid DateTime length	3,483
Well-Formed, but not valid , PhotometricInterpretation not defined	202

Word alignment (offsets being evenly divisible by 4 bytes) is the largest issue for structure, but it something that will be easy to fix. We are able to view these images so the data inside isn’t corrupted.

Validity of DateTime values is the next largest issue. The format is should be "YYYY:MM:DD HH:MM:SS" , so something that says “2004: 6:24 08:10:11” will be invalid (The blank is an Invalid DateTime digit) and “Mon Nov 06 22:00:08 2000” or “2000:10:31 07:37:08%09” will be invalid (Invalid DateTime length).

PhotometricInterpretation indicated the colour space of the image data (WhiteIsZero/BlackIsZero for grayscale, RGB, CMYK, YCbCr , etc). The specification has no default, but we’ll be able to fix the files by making and checking some assumptions.

Issues with .pdf files

Not well-formed , No document catalog dictionary	3,081
Not well-formed ,Invalid cross-reference table,No document catalog dictionary	2
Not well-formed , Missing startxref keyword or value	8
Not well-formed ,Invalid ID in trailer,No document catalog dictionary	1
Not yet scanned	253
Well-Formed and valid	191,408
Well-Formed, but not valid , Missing expected element in page number dictionary	33,881
Well-Formed, but not valid ,Improperly formed date	33
Well-Formed, but not valid , Invalid destination object	1

One of the board members of the Open Preservation Foundation, the organization currently maintaining JHOVE, wrote a longer article on the JHOVE PDF module titled “Testing JHOVE PDF Module: the good, the bad, and the not well-formed” which might be of interest. Generally, PDF is a hard format to deal with and there is more work that can be done with the module to ensure that the errors it is reporting are problems in the PDF file and not the module.

“No document catalog dictionary” -- The root tree node of a PDF is the ‘Document Catalog’, and it has a dictionary object. This exposed a problem with an update to our production processes where we switched from using ‘pdftk’ to using ‘poppler’ from the FreeDesktop project for joining multiple single-page PDF files into a single multi-page PDF file. While ‘pdftk’ generated Well-Formed and valid PDFs, poppler did not.

When I asked on the Poppler forum they pointed to JHOVE as the problem, so at this point I don’t know where the problem is.

I documented this issue at: https://github.com/openpreserve/jhove/issues/248
“Missing startxref keyword or value” - PDF files should have a header, document body, xref cross-reference table, and a trailer which includes a startxref. I haven’t dissected the files yet, but these may be truncated.
“Missing expected element in page number dictionary”. I’ll need to do more investigation.
“Not yet scanned”. We have a series of multi-page PDF files generated by ABBYY Recognition Server which take a long time to validate. Eventually it indicates the files are recognized with a PDF/A-1 profile. I documented this issue at: https://github.com/openpreserve/jhove/issues/161

Our longer term strategy is to no longer modify files as part of the ingest process. If single-page PDF files are generated from OCR (as is normally the case) we will ingest those single-page PDF files. If we wish to provide a multi-page PDF to download this will be done as part of our access platform where long-term preservation requirements aren’t an issue. In the experiments we have done so far we have found the single-page PDF output of ABBYY Recognition server and PrimeOCR validate without errors, and it is the transformations we have done over the years that was the source of the errors.

Sunday, May 7, 2017

Some of the earliest community groups on FLORA.org

Some of the earliest groups on Flora.org

Ask the Doctors, which was a real cool site managed by Rosaleen Dickson who I met from the Freenet. Involved in the publishing industry, she co-authored a book on HTML back in the early 1990's when it was such a new thing. She hosted pages on Canadian Books, and following the work she did with the doctors ran a "Ask Great Granny" site.
I believe Auto-Free Ottawa was the first community group I hosted, a group of people in the early 1990's envisioning an Ottawa that wasn't as dependent on the automobile.
Canadian Homeschool is the last of the original groups to be hosted on FLORA.org
Community Democratic Action
KC. Maclure Centre
MAI-not
Ottawa District Committee of Ontario Special Olympics
Peace and Environment Resource Centre -- still around, but has had their own domain name for quite some time.
Pednet was a mailing list hosted by Majordomo, and then moved to Mailman.
Visually Impaired was, I believe, information hosted by Charles Lapierre
Westend Family Cinema also obtained their own domain name quite some time ago.

As web access became easier for organizations it was far more common for groups to get their own domain names so that they could move their sites between hosts without anyone having to remember a new URL. There are many redirects still in the config files for such groups that were previously hosted on FLORA.org.

Ardbrae Dancers of Ottawa
Big Soul Project Gospel Choir
Citizens for Safe Cycling
Coop Voisins (which Rina and I lived in from 1997 to 2003)
Deshantari of Ottawa-Carleton
Digital Copyright Canada (specifically the DMCA forum)
Greenspace Alliance of Canada's Capital
Good Companions Seniors' Center
Halifax Initiative
Housing Help
Linux FreeS/WAN
Lynx Developers
National Capital Region Y Canoe Camping Club
Ontario Federation of Teaching Parents
Rare Breeds Canada
re-Cycles
Sustainability Project / 7th Generation Initiative
The OX project

There were other groups over the years, but not all of them still exist such as:

Car Free Living
Communities Before Cars Coalition
Coop Area Network (Networking between Coop Voisins and the Conservation Coop)
Cycle Challenge / Commuter Challenge
Economic Good
Famous 5
FTAA Ottawa
FVC (Fair Vote Canada) Ottawa
Food Action Ottawa (FAO)
Global Education Network
Global Issues Forum
Green Party (Ottawa region, back when the party was smaller)
International Association for Near Death Studies Ottawa)
Maclure Center
National Capital Runners Association
OPIRG Forestry group
Ottawa LETS
Ottawa River Bioregion Group
Ottawa Transit Riders Association
Ottawa Vegetarian Society
The Doorstep Alliance

After doing a bit of spring cleaning of some sites where I can no longer reach the managers, and who haven't updated the sites in years, there were only 4 sites remaining: two from close personal friends, and two community groups.

For the sites that I couldn't reach the managers I had set them to redirect to the most recent archived version of their sites on Archive.org. I made a mistake with the robots.txt file and they are temporarily unavailable, but I have sent a message to Archive.org in the hopes they can fix my mistake and restore the archive.

If there are any groups I've missed, please let me know in the comments. It has been a few years, and I've been looking at old Apache config files to be reminded of some of the organizations. I've not listed all the individuals (volunteers as well as individual election candidate websites from back when I hosted candidates during elections).

There have been many mailing lists over the years, but since this isn't something I'm planning on closing I won't get all nostalgic about them. I'm keeping the domain name and will be keeping the redirects active for any sites that have moved so bookmarks can be updated.

Saturday, May 6, 2017

Winding down FLORA.org after more than 22 years.

FLORA Community Web was started in December 1994 (See Ottawa alternative community minded networking) and the first domain name it used was flora.ocunix.on.ca. Later the name flora.ottawa.on.ca (date unknown) was adopted, and then FLORA.org (13-Oct-1996).

It offered free websites and mailing lists for community groups from before these things were as easily available as they are today. I haven't had time to spend on the server as I would like, and believe it would be best for me to admit that my interests have moved onward. I'm in the process of helping the remaining groups hosted on FLORA.org to migrate to some other hosting.

Thanks go out to the many volunteers who participated over the years, and the many friends I made through these connections.

If you want to take a look at what the site looked like at various points in the past, Archive.org's Wayback engine has many snapshots. The earliest list of flora.org organizations they have is a snapshot from 1998. This was back when I listed some of the clients as well as volunteer sites hosted on the same computer. A larger list of domains I was hosting on that computer can be seen from a 1999 list.

Some time in 2000/2001 most of those clients had been moved to OpenConcept where I was managing the growing number of computers, but Mike Gifford of OpenConcept was doing the billing, customer relations and all the business side. In 2003 those servers became part of CooperIX which Mike Richardson and I founded. CooperIX was a small co-location provider for more technical clients, with OpenConcep and its growing client list being the biggest single user..

Wherever my self-employed company went, the volunteer FLORA.org services came along with me.

In 2011 I moved from being a self-employed consultant to being staff at Canadiana.org. This was the first time I was an employee since I became self-employed in early 1995, but has been a great transition. However, with the work I'm doing for Canadiana I don't feel I have the time to dedicate to the volunteer services, which are currently a couple of virtual machines on a server running in the basement of my home.

I finally decided this year that I should start the process of decommissioning those VMs. I will start with the www.flora.org website, which is still managed via 1990's technology (content providers use FTP to log in to update their sites). I will likely spend the time to migrate the mailman services to another VM, and keep them running as there are less security and other concerns with the mailing lists. I'll then decide what to do with my personal sites (my old business site, and so-on).

Russell McOrmond's personal blog