The conclusion that I was fortunate comes from reading Measuring Mass Text Digitization Quality and Usefulness, an article in D-Lib Magazine, July/August 2009 Volume 15 Number 7/8, available at http://www.dlib.org/dlib/july09/munoz/07munoz.html/.
The authors comment that manufacturer supplied OCR percent accuracy figures, >99.9%, are misleading as they are based on perfect laser-printed text, not the kind of printed and often microfilmed text normally available to work from. "...gaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900 and pre-1950's text and anything pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong)."
The article reports overall averages for the British Library19th Century Newspaper Project as follows:
- Character accuracy = 83.6%
- Word accuracy = 78%
- Significant word accuracy = 68.4%
- Words with capital letter start accuracy = 63.4%
For genealogical purposes the 63.4% figure is most pertinent as it relates to the ability to find names.
The authors suggest that if word accuracy is greater than 80% then most fuzzy search engines will be able to sufficiently fill in the gaps or find related words.
A high success rate would still be possible from newspaper content because of repeated significant words. Theoretically if the chance of finding a single instance of a word is 63.4%, the chance of finding one of two occurrences is 86.6% and of finding one of three occurrences 95%.
No comments:
Post a Comment