Thursday, 21 November 2013

Online genealogical dataset exploited for big data research

A preprint of an article Data Mining of Online Genealogy Datasets for Revealing
Lifespan Patterns in Human Population (pdf) by Michael Fire and Yuval Elovici has just appeared.  It uses over a million profiles from WikiTree, and focuses on 363,292 for individuals who were born in the United States,
The paper is mostly about the techniques which are gobbledygook for anyone but the big data analysis specialist
Results reported show, with high confidence, some very small influences on lifespan.

"Our findings indicate that significant but small positive correlations exist between the parents’ lifespan and their children’s lifespan. Additionally, we found slightly higher and significant correlations between the lifespans of spouses. We also discovered a very small positive and significant correlation between longevity and reproductive success in males, and a small and significant negative correlation between longevity and reproductive success in females. Moreover, our machine learning algorithms presented better than random classification results in predicting which people who outlive the age of 50 will also outlive the age of 80."
The analysis also identifies some interesting anomalies in the records, periods where the US median lifespan for males increased while that for females suddenly decreased, or vice versa. From 1650 to 1660 the male median lifespan increased from 61.86, to 66.81 while in the same period of time the female median lifespan decreased from 63.04 to 60.8. Between 1770 and 1780 the female average lifespan increased from 66.31 to 68.63, while the male average lifespan decreased from 66.55 to 64.29. Why, one wonders!

It's good to see databases like WikiTree being exploited in this manner. Maybe the tools used will be made more accessible to the layman so that this type of study can benefit, just as genetic genealogy has prospered through the efforts of citizen scientists

