Sunday, 24 November 2013

Almonte Gazette Searchable Database Project Summary and Some “Tips and Hints” to Search Effectively

This is a follow up to a previous post about the Almonte Gazette, a document prepared by Matthew Moxley of the Mississippi Valley Textile Museum.

Project Summary

The Almonte Public Library and the Mississippi Valley Textile Museum partnered in 2012 and 2013 to digitize the Almonte Gazette newspaper collection and make it searchable and accessible to the general public. This was a project that was in the works years earlier when the Almonte Public Library had the newspapers from 1861-1989 professionally scanned.

Because of the size of newspapers, regular scanners cannot be used effectively to reproduce a digital image. Therefore, a large-format scanner must be used. Large-format scanners are quite expensive, difficult to use, and generally, not available to the general public. This being the case, people and organizations usually get private companies (print shops, graphic design shops, etc) with access to these large-format scanners to perform the scanning for them for a fee. This is what happened with the Almonte Public Libray and the Almonte Gazettes prior to this current project and the images were saved for future use.

The 2012/2013 partnerships sole purpose was to make these valuable images into searchable files so researchers can perform keyword searches and be directed to appropriate articles. This would immensely speed up Almonte Gazette research that was only available before by microfilm. Two components were needed to makes the scanned images into searchable documents: the proper software and an employee to perform the work.

Matthew Moxley was hired to perform the task of creating searchable documents for online use. After some research, a character recognition software program called ABBYY was purchased to convert digital images into searchable documents that can be used in a variety of ways. Mr. Moxley would oversee this by putting in the correct images to make complete issues, correcting any major spelling errors, rotating pages upright, and organizing the new searchable documents. Over 6,300 newspapers and over 31,000 pages were converted into searchable documents. This process took several months to complete.

Once the files were converted to searchable documents, the next step of posting them online needed to be resolved. This initially proved to be a challenge due to financial constraints. The grant only covered the costs of converting the newspapers into searchable documents. Any web integration would have to be done at the MVTM and Almonte Public Library’s own expense.  The MVTM explored programs and services that would perform this, but the end result was too expensive because of the volume of the newspapers. Thankfully, the MVTM has a brilliant webmaster who has volunteered his time to create an MVTM website and then integrate the searchable newspapers onto the site. With no resources available, he had to create his own database in order search the newspapers and return results. This free database/search function now gives researchers the ability to search 150 years of town news and family histories. The searching could be a little simpler, however, we would need software made available that is relatively free of charge and supports over 30 gigabytes of data.

Over the course of the project, an estimated 50-75 issues were seen as “missing” from the collection. We have made note of these missing issues and will keep an eye out for them in the future.

How the Optical Character Recognition (OCR) Works

It may be helpful to understand how the OCR software worked to make sense of the some of the results you are or are not getting. OCR is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. The program analyzes the structure of document image. It divides the page into elements such as blocks of texts, tables, images, etc. The lines are divided into words and then - into characters. Once the characters have been singled out, the program compares them with a set of pattern images. It advances numerous hypotheses about what this character is. Basing on these hypotheses the program analyzes different variants of breaking of lines into words and words into characters. After processing huge number of such probabilistic hypotheses, the program finally takes the decision, presenting you the recognized text. The OCR software reads the page like a page in a novel (left to right) rather than a typical newspaper page (narrow columns with different stories from left to right). Therefore, when returning results on the search engine no snippet is provided because more often than not, the snippet would not make much sense if viewed.

Tips and Hints

Searches should be limited to about a 10 year span to avoid timeouts.

The features for searching the Almonte Gazettes are highly dependent on the browser used to access the internet. It is hard to provide help if the exact configuration is not given. If you are experiencing difficulty make note OS name and version (e.g. OS X 10.9 on a mac or Windows 8 on a PC) plus the browser name and version (Safari 7.0, Internet explorer 9.0, etc).

Most of the advanced PDF viewers require quite recent versions of browsers. Unfortunately, if you are trying to use old computers or browsers, especially windows XP and IE6 or IE7, it simply will not work and there is nothing that can be done.

It is encourage that people download and install a recent browser (they are free after all!) such as Chrome (https://www.google.com/intl/en_uk/chrome/browser/) so that their problems can be reproduced and advice provided. Chrome is available for both Mac and Windows as well as most tablets and mobile phones.

Researchers should also realize that the search mechanism is limited by the accuracy of the original OCR scan which missed some words that were obscured, hyphenated or in a font that wasn’t recognized (e.g. in advertisements). Some newspaper original quality was poor while others were superb. This is much more apparent in the early year additions.

When searching, first, select the year(s) of issue that you are interested in, this will limit the amount of articles the database will search through, therefore getting results more quickly. Second, specify the search keyword(s) and click 'Perform Search'. Once the results appear, click on a reference to download PDF files of specific pages or issues that are of interest.  (Note that these PDF files are large and may take several minutes to download on your computer. Your browser will require a plug-in to view the downloaded files).

Keywords are in the context of a single page and are indexed by eliminating words of 3 characters or less and truncating words longer than 16 characters. Words in “None of these” (exclusions) are only active if there is at least one inclusion in “Any” or “All of these.”

Once you have opened the PDF file you will have to perform a secondary keyword search in order for the word(s) you just searched to be highlighted. To perform this, simply press ctrl+F (PC) or command+F (Mac) and a box in the upper right hand corner will appear. Simply type in the word you searched minutes before again and you will be able to see the results highlighted in the article in a greyish/blue colour. Note: Only one word will be highlighted at a time. If there are multiples, you can press enter and find the duplicate words throughout the page.

If you cannot read the words on the newspaper page because they are too small, simply move your cursor to the bottom of the page to make the PDF tool-bar appear. There will be several options such as print, save, page number, and zooming. To perform a zoom on the page to make the words larger simply press the + sign. It is recommended that you zoom to about 200% to read the page comfortably.

If you would like to save one particular article, you can save the PDF and then go to “edit” then select “take a snap shot” and paste that image into a word document or paint. There you can save or print the article.

If you would like to look up a particular issue of the Gazette, you will have to indicate the year and search a very common word such as “Almonte” You will then get the results along with dates of the issues. Simply click the desired date.

If you are having difficulty finding words in the actual PDF make sure your brower and plugins are up to date as stated earlier.

If you have any issues or problems using this new resource please do not hesitate to contact Matthew Moxley at the Mississippi Valley Textile Museum at collections@mvtm.ca.


1 comment:

Susan said...

Thank you john for taking the time to help us with these details to help us conduct a search.