02 February 2017

Distant Reading

What's distant reading? Only done by those with hypermetropia?
Distant reading is understanding not by studying particular texts, but by aggregating and analyzing massive amounts of textual data from a corpus.
Distant reading has evident limitations for genealogy where you need to pick through to find something particular, say, an obit of your great-grandmother. That's called, unsurprisingly, close reading.
What distant reading can do for the family historian is provide context. Your great grandmother died of influenza and you'd like to know if it was a year when the disease was prevalent.

You may be familiar with Google Ngram where you can explore how frequently a word or phrase has been used in a corpus of books over time. This example shows the profile for cholera in red and influenza in blue. There's a huge spike for cholera in 1884 and upticks for both in 1942. While there is an uptick for influenza in 1918/19 for the pandemic that killed more than 20 million, perhaps as many as 50 million, the Ngram peak is not as significant at in 1905. The problem for genealogy is the book corpus relates to the publication date which may not bear any relationship to current events. It is good for long term trends - try cigarette, aircraft, newspaper, radio, television.
Recent months have seen several articles published on distant reading using newspaper databases as the corpus. The British Newspaper Archive, Chronicling America and a Dutch newspaper database have all been explored. While newspapers cover current events there are still issues of representativeness as discussed in the article Bridging the gap between quantitative and qualitative research in digital newspaper archives.

The studies using the British Newspaper Archive have been conducted by a group from the
University of Bristol led by Professor Nello Cristianini. A recent article, Content analysis of 150 years of British periodicals includes a diagram reproduced here with the bottom panels showing the difference between the frequencies through the years for cholera, influenza, smallpox and plague from newspapers (left) and books (right). Note the more prominent peak for influenza in the newspaper corpus in 1918.
That same article also gives a link where you can download a huge file giving the year-by-year frequency of occurrence of different Ngrams from the newspapers. It's a computer challenge. not for those unprepared to wrangle large data files.
I've been experimenting with it. A graph produced for cholera, influenza and cancer is appended.
More later if and when time permits.

It should not go unremarked that there is no possibility of performing a similar analysis on Canadian newspapers lacking any national newspaper digitization program.

3 comments:

Lois Hellemond said...

Thanks John. I was getting really excited with this information as my GGrandfather died of influenza in Fort William Ontario in 1919. Your last comment has saved me a lot of wasted research time on Canadian relatives!

Ian Barker said...

A reference on the 'Spanish Flu' in Canada is:

Pettigrew, E. The Silent Enemy. Canada and the Deadly Flu of 1918. Western Producer Prairie Books, Saskatoon, 156 pp.,
1983.

Ian Barker said...

A reference on the 'Spanish Flu' in Canada is:

Pettigrew, E. The Silent Enemy. Canada and the Deadly Flu of 1918. Western Producer Prairie Books, Saskatoon, 156 pp.,
1983.