Monday, 6 October 2014

Computational Genealogy Using WikiTree Data

Data for 6.67 million people in over 160 countries (mainly the US, UK, Germany, Canada, New Zealand and Holland) have been used to examine name trends, birth and fertility, marriages and lifespan in an article Quantitative Analysis of Genealogy Using Digitised Family Trees (pdf) by Michael Fire, Thomas Chesney & Yuval Elovici.

Some of the general trends found are:

  • The trend of naming a son after its father rises then falls through the 16th century, and throughout history there have been fewer girls named for their mother than boys named for their father. About 24% of twins’s names start with the same letter. The most frequent twin names between 1800 and 1900 are Mary and Martha, and John and James.
  • Of 963,416 births, 10,246 were twins (0.0106%). Twin gender ratios were almost even.: male-male – 32.7%; female-female 33.9%; and male-female 33.3%.
  • In any given time period males marry later than females, and the age increases over time. The raw data collaborates that during the medieval period it was not unknown for girls aged 12 and boys aged 14 to marry although this was not usual.
  • If an individual’s spouse lives longer, then that individual lives longer too. Twins also tend to have the same lifespan.

The authors state that computational genealogy, using machine learning tools, graph analysis and related techniques to the analysis of high volume ancestry data, opens up many possibilities for understanding social trends.

There are also obvious benefits in exploiting large data-sets for genealogists interested in the use of Bayesian techniques to refine genealogical proof.

No comments: