Friday, 13 April 2012

Rethinking the Stewardship of Newspapers in the Digital Age - 2

The previous post summarized a November 21, 2009 draft discussion paper with the above title, focusing on a subsection on the current state of newspaper stewardship.

The follow is an extract from 21 April 2010, a draft of the newspaper "Pathfinder" co-authored by Susan Haigh and Samuel Generoux.

4.3 Stewardship and Growth of the Retrospective a Digital Collection

While the above strategies for microforms are recommended, the group recognized that continuing to build the microform collection does not extend online access to historical newspapers, which is what users clearly want.

There are three approaches to extend access to and digital preservation of the Canadian retrospective newspaper collection. The first two would seek to build LAC's collection through the digitization of existing retrospective holdings – by us or by others – and by increasing acquisition of retrospective newspapers digitized by others. A third approach that would strengthen the overall national collection is facilitating better long-term preservation practices by others and the provision of federated access to digitized newspapers.

The first two approaches are discussed below while the third falls into discussions in section 6, the section on National Collaboration.

Digitization of Holdings

There are a range of decisions to be made before LAC would be in a position to launch a substantial newspaper digitization project or program. The decisions related to priority, purpose, content, and approach include:

  • Whether to digitized newspapers as a priority, giving them precedence over other parts of the LAC collection as necessary;
  • The goal of the digitization. Options include: to extend access by providing microfilm-like image-only online access; to provide new forms of access through OCR and indexing; or to create high-quality digital masters with full structural analysis (using e.g. METS/ALTO XML standard) to serve preservation, presentation, and enhanced access;
  • The part(s) of the newspaper collection, and the specific titles, to digitize. For example, the program could aim to digitize selected major daily newspaper titles from microform, or instead, multicultural and aboriginal titles from print. Notably, digitization from print is approximately 10 times more costly;
  • Whether the project is broader than just LAC digitizing from its own holdings; and if so, with whom to collaborate;
  • How to deal with rights issues. For example, a project could target pre-1920 material as it would be effectively rights clear; or target pre-1960 material, as producers' rights have expired; or seek permissions from rights holders to digitize complete runs;
  • Who would undertake the digitization. Options include in-house, outsourced, or collaborative arrangements, and each carries pros and cons;
  • The standards and methods that will be employed. Choices related to grayscale/bi-tonal, resolution, processing, output quality and output format are tied to the project's goals. Each carries cost implications, resulting in the fact that newspaper digitization costs can range from about $.10 per page to $1.70 per page;
  • How to resource the undertaking, and ultimately, what budget will be available.

The working group looked a bit more closely at the possibility of digitizing some of the daily newspapers, because that option would allow stewardship decisions to be taken on the corresponding print collection. It obtained some sobering scale and cost indicators that ranged from $.10 per page all the way up to $1.70 per page depending on the quality of the output, the type of post production and indexing, and who does the digitization.

The following are three examples of content and cost scenarios. These should not be considered as recommendations from the working group and are only meant to illustrate the range of cost options and price points associated with newspaper digitization.
1. For digital scanning of the 12 major dailies in-house:
Number of reels: 21,482 (entire backruns for all 12 titles)
Cost per reel (includes duplicating the real, scanning the duplicated reel, no OCR, no indexing, no zoning, etc.) including labor, materials and hardware: $86.02 (about $.10 per image)
Total cost: $1,847,881.64.
Number of weeks it would take (one microfilm scanner): 286.4 weeks (5.5 years) 
The estimate should be seen as illustrative only as not all 12 titles would need to be digitized, and permissions from rights owners cannot be assumed. 
2. If LAC were to collaborate with the microfilming/digitization company, higher quality imaging could result by being able to digitize from the microfilm master, but the cost per page would increase to perhaps $.20 per page (and that is imaging only).
It is notable that converting images to searchable text through automated and uncorrected optical character recognition (OCR) processing and indexing constitutes a relatively minor additional cost. However, reaching an acceptable level of OCR accuracy to support optimal retrieval introduces substantial human labor costs for quality assurance and OCR correction. Australia has undertaken the interesting approach of enabling volunteer end-users to correct OCR errors. 
3. A more sophisticated preservation and access project, such as that being undertaken for selected Western Canadian historical newspapers at University of Alberta, produces high-quality digital image files with full structural and content analysis using METS/ALTO XML for enhanced search and retrieval. It costs much more (up to $1.70 per image), but provides the most comprehensive access for users.
Clearly any digitization project would have to be scoped carefully with a view to balancing cost factors with the desired outcomes in terms of improved public access to newspapers and their digital preservation.

Collecting Digitized Newspapers

It is unclear that newspapers digitized by others are subject to legal deposit, and owing to the potential scale of that type of ingest, and the fact that LAC's TDR is just now being implemented, LAC has not endeavored to exercise that provision. It is possible, however, that some of the memory institutions and other organizations that are digitizing newspapers would be willing to deposit digital files with LAC for long-term safekeeping and appropriate redundancy. As their project funding and use model (i.e. business model) is likely based on providing access from their own website, and they would most likely seek for LAC to be a "dark archive".

It is important to note that newspaper files tend to be large and the number of files that could be massive. There will be significant ingest and storage implications for LAC's TDR if LAC endeavors to take on a large scale digitization digital preservation role for digitized as well as born digital newspapers.

As there are significant digitization projects currently in play at University of Alberta and at BAnQ in partnership with SOCAMI, exploratory discussion should begin with both of these.

In summary, the working group recommends that LAC:
  • Determine whether newspapers are one of its top priorities for digitization, and if so, develop a mass digitization project plan that will meet its identified preservation and access goals and is scaled to the resources that can be obtained or allocated to it.
  • Explore the feasibility of LAC obtaining files from others digitization projects on deposit or by agreement on a preservation basis, and initially through discussions with University of Alberta and BAnQ.

