Tuesday, 4 September 2012

How to use probability in genealogy - part 1

I've blogged previously on how genealogical proof could be made more convincing, beyond the textual reasoning presently employed, by adopting a probabilistic approach. The example here, based on the non-genealogical example in this YouTube video, uses simple arithmetic with calculations done on a spreadsheet.

This genealogical example draws on some of the evidence presented in the article "Sally Hemings's Children: A Genealogical Analysis of the Evidence" by Helen Leary published in the NGS Quarterly in September 2001 which documents why the evidence now points to Thomas Jefferson (TJ) as being the father of six of the children.

This will be a three part post, the first being where the initial believe is that it's unlikely TJ was the father, a view now held by a minority. We'll assign a probability, known as the prior probability, of 1%, or 0.01 that TJ was the father. We'll accept the opinion that it was a relative of TJ descended for the same paternal grandfather who was the children's father. We'll assign a probability of 98%, 0.98. There's also a small probability it could be someone else which we'll give the probability 1%, 0.01, to make the probability total 100%, or 1.0.

Now consider the evidence that Sally's son Eston had a "striking similarity" to TJ.  We need to estimate the probability that there was a striking similarity to TJ because TJ was the father. It's called the conditional probability. I'll estimate that 5 in 10,000 sons bear a striking resemblance to their father, 1 in 10,000 a striking resemblance if descended through another male line from the same paternal grandfather, and 1 in 1,000,000 if descended from an unrelated person. These are my estimates. Yours may differ and what matters is the ratio between the conditional probabilities, not the absolute value.

The above values are entered into a spreadsheet in the second and fourth columns.

Now calculate the Joint Probability by multiplying the prior and conditional probabilities across the row. The final stage is to divide the joint probability in each row by the sum of the all the joint probabilities in the column to obtain the posterior probabilities after accounting for the striking similarity. In this case adding the information has increased the probability of TJ being the father from 0.01 to nearly 0.05, and using the Canon of Probabilities mentioned here this is a change from extremely improbable  to very improbable. That's because we started off not believing that TJ was the father and the striking similarly evidence fails to provide highly significant discrimination.

Now add additional evidence, in this case the coincidence between the dates Sally Hemings conceived her children and when TJ was present at Montecello where Hemings lived. Analysis by Neiman in "Coincidence or Causal Connection" in William and Mary Quarterly, January 2000, accepted unquestioningly by Leary, gives about 1.5% chance that TJ was not the father of the six children. This is the basis for the conditional probabilities.

The analysis proceeds as before with the posterior probabilities from the "striking similarity" calculation becoming the prior probabilities in this one.

Taking both items of evidence together the probability that TJ was the father has jumped to 83% or probable. The evidence of TJ visits to Montecello is strong but without the striking similarity evidence the probability that TJ was the father would have been 50%, even odds.

If you start out doubting that TJ was the father it takes a lot of evidence to change the probabilities. The same applies if you start out doubting that a probabilistic approach can be useful in genealogy.

In future posts we'll look at starting with other initial (prior) probabilities that TJ was the father.

No comments: