03 December 2018

DNA Cousins are more distant than they appear

When interpreting autosomal DNA statistics, one must be careful to distinguish between the distribution of shared DNA for a given relationship and; the distribution of relationships for given amounts of shared DNA.

Distribution of shared DNA for a known relationship

Data has been collected for the distribution of shared DNA by relationship by the Shared cM Project 3.0. These were crowdsourced by Blaine Bettinger. Most were likely from tests conducted on known relatives, others from a test that found a match which was subsequently identified. The figure is an example of the full distribution for first cousin once removed.

The table below summarizes the centimorgan (cM) results for a wider range of relationships. The more distant the relationship the less shared DNA, and also the fewer the data reported so there is greater sampling error. For relationships more distant than third cousin the Shared cM Project 3.0 obtained insufficient data for a full analysis. Perhaps there was a relationship but the amount shared was too small for it to be recognized. That also shows in the ratio of the average cM to the amount expected which increases for more distant relationships.

1st percentile4861314747000000
Average (Expected)884 (850)440 (425)232 (213)232 (213)123 (106)75 (52)75 (53)49 (27)36 (13)29 (7)
99th percentile1761851517517317229229175122118

Distribution of relationships for given amounts of shared DNA
Extended families vary in size. In informal polls I've found anything from zero to over 80 first cousins. According to a study cited by ISOGG https://isogg.org/wiki/Cousin_statistics the average British person has an estimated 5 first cousins, 28 second cousins and 175 third cousins.

Even if the expected amount of shared DNA is relatively small more people with a more distant relationship will increase the chance of finding a match — you catch more fish if there are more fish to catch. The weighting depends on the number of people with the relationship, and that depends on whether families had few or many children who themselves went on to have few or many children, and how many survived.

Simple Scenarios
Take the situation of 200 cM of shared DNA where the only relatives with tests are of the same, one or two generations younger. This would be the case for a senior where those in the older generations are deceased while there are no people in the great-grandchildren’s and subsequent generations yet born. Assume every family in the tree consists of the same number of children and there are no half relationships, endogamy or other complicating factors. Everyone survives and has that same number of children.  The table shows the percent probabilities from the Shared cM Project and for families which uniformily have two and three children.

Percent Probability1C1C1R1C2R2C2C1R2C2R3C3C1R
Shared cM Project0234371944>0
Two Family012224319103
Three Family0119183412125

The peak probability shifts from second cousin to second cousin once removed. Probabilities of closer relationships decrease and of more distant relationships increase. The trend is larger for larger family size — unsurprising as, for instance, there are 64 third cousins in the Two Family, 432 in the Three Family.

Simple population scenarios show that the distribution of relationships for given amounts of shared DNA differs from the distribution of shared DNA for a known relationship as found in the Shared cM Project. Taking into account the population distribution moves probabilities to more distant relationships, more so for larger family size.
It’s important to appreciate the limitations of the analysis. The statistics in the Shared cM Project are less reliable where there are fewer data reported, for more distant relationships and in the tails of the distribution. Further, given the variability of individual extended families the scenarios in the study will only be broadly representative for any individual's result.
While it would be interesting to have a similar analysis for lesser amounts of shared DNA the statistics at present are too noisy and do not extend to sufficiently distant relationships to yield reliable results.


Denis Bourque said...

Hi John,

If I understand this analysis, it is based on the study of relationships based on autosomal DNA amongst descendants of a single common ancestor.

There is also another DNA scenario which can lead one to believe closer relationships than actually exist. I’m referring to the situation where two individuals share more than one common ancestor. In such an instance, the number of shared cMs could be from ancestors further back in time - hence, the relationship would be more distant than presumed at first glance.

This is my case. The Acadians were a ‘closed’ society for several hundred years. For example. my parents are related to each other 21 different ways (going back 4 centuries), from 4C to 7C, so many common but distant ancestors. If I didn’t know that from my genealogy research and had only looked at their DNA (impossible to do now, as they are both deceased), I’m certain there would be a large number of shared cMs - which could have lead me to erroneously conclude, if I were to base my conclusions solely on shared cM tables, a much closer relationship than actually existed.

I have actually had this brought to my attention through the 'matches' I received from FTDNA and 'MyHeritage DNA'. I often receive notices of 1c1R or 2C matches - but this is clearly not the case as I have found and recorded all these genealogically. The answer is that we have multiple more distant common ancestors - hence, cumulatively, a lot of shared cMs - but the matchmaking algorithms cannot account for this.

Just food for thought.

Debby said...

I agree with the above post.My maternal grandmother's line, generates from a small village of approx 200, where most of the families also came from the same area of Quebec. Hence it is so difficult to tease apart as second cousins married into the same families. One fine gentleman had 22 children. I have more than 1000 matches to play with, as many of the families have tested. I belong to a face book group callled Otter Lake memories(the town where my grand mother was raised) the group has over 1200 members, which speaks of the large families coming from this area.