Wednesday, 28 January 2009

Probability in genealogy

According to studies cited in Wikipedia a total of about 100 billion people have lived on Earth. If you were compiling a genealogy and picked one of them at random to fill a specific position on the family tree that would make for pretty long odds of selecting the right one, a probability of 0.00000000001.

To improve the odds you start adding additional information As you add a timeframe, region and surname you rule out billions of people and increase the probability that a random selection from those remaining is correct.

If you know for sure your ancestor was named Smith and had a birth registered in 1900 in England and Wales you could check with FreeBMD and find 13427 matches. Select one of them at random and the probability of being correct is 0.0000745.

Add the information that the first name is John, 520 of them, and the probability that one selected at random is the correct one increases to 0.00182.

If you knew they were from London you would find 44 of them, giving a probability that one selected at random is the correct one at 0.0227.

Adding the additional information that the birth was registered in Stepney would eliminate all but one candidate in that year (probability 1.0). Going year by year, Stepney saw one John Smith registered in each of 1900, 1901, 1902, 1907, two registered in 1903, 1906 and 1908 and 1909, none in 1904 or 1905. Looking at a longer time period you can find quarters (not years) where three John Smith's had a birth registered in Stepney in which case the probability would be 0.333.

It's evident that with a common name you need to add a tremendous amount of detail to pinpoint the right John Smith. You'll find yourself scouring the records, sending and paying for much additional information, in order to distinguish which one is the right one.

With a less common name your task can be less demanding, although you still need to beware of situations such as families with a less common name with a favoured first name in the family and cousins living in the same area.

No matter how hard you try, and even if the documentary evidence all points to one person you'll never be 100% certain, probability 1.0, that you have the right person. You can never be sure the documentation tells the truth. People do lie on records, and to other family members who become informants for later records.

Lying about the person who is the genetic father is one such case. Non-paternity rates of 0.01 to 0.33 have been reported.

Take the case of identifying someone with 0.99 probability based on documentary records. If you take into account a 0.01 possible non-paternity rate the real probability is 0.99* (1-0.01) = 0.9801. Is it worth going the extra mile to 0.999 based on documentary records if non-paternity at 0.01 will reduce the probability back to 0.989?

1 comment:

William said...

This is an interesting excercise, but one that has more than a little "straw man" quality to it.

It ignores the cumulative impact of each additional generation in your assumptions both about the reliability of paper records and YDNA, and it uses a best-possible-case scenario for the rate of non-paternity.

The assumptions in your excercise about the reliability of paper records goes far beyond the assumptions made about conclusions in the genealogical proof standard (clear and convincing), and even beyond those of a criminal court (beyond a reasonable doubt).

Are we really to believe that given best-possible-case assumptions for the reliability of paper records, say back eight generations from a living person, that we can say that someone is our fourth great-grandfather with a .989 probability rate, assuming a rate of .01 of non-paternity?