Friday, 17 July 2009

How good is newspaper digitization?

It seems I was very fortunate to find all three words in my great-grandfather's name accurately deciphered in two articles from the British Library's 19th Century Online Newspaper Archive. The story of what I found will be coming in a future issue of Anglo-Celtic Roots, the quarterly chronicle of the British Isles Family History Society of Greater Ottawa.

The conclusion that I was fortunate comes from reading Measuring Mass Text Digitization Quality and Usefulness, an article in D-Lib Magazine, July/August 2009 Volume 15 Number 7/8, available at http://www.dlib.org/dlib/july09/munoz/07munoz.html/.

The authors comment that manufacturer supplied OCR percent accuracy figures, >99.9%, are misleading as they are based on perfect laser-printed text, not the kind of printed and often microfilmed text normally available to work from. "...gaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900 and pre-1950's text and anything pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong)."

The article reports overall averages for the British Library19th Century Newspaper Project as follows:

  • Character accuracy = 83.6%
  • Word accuracy = 78%
  • Significant word accuracy = 68.4%
  • Words with capital letter start accuracy = 63.4%
These represent the highest OCR accuracy rate expected as they are based on two samples from the "best" sections of a large sample of pages.

For genealogical purposes the 63.4% figure is most pertinent as it relates to the ability to find names.

The authors suggest that if word accuracy is greater than 80% then most fuzzy search engines will be able to sufficiently fill in the gaps or find related words.

A high success rate would still be possible from newspaper content because of repeated significant words. Theoretically if the chance of finding a single instance of a word is 63.4%, the chance of finding one of two occurrences is 86.6% and of finding one of three occurrences 95%.

No comments: