Here are some practical hints and tips to optimise your Privacy if you are thinking of doing a DNA test (or you have already done one).
1) Don’t test!
This is the simplest way to avoid exposing your self to potential online scrutiny and unwanted intrusion from others. If you are not sure whether you should do a DNA test or not, do yourself a favour and don't test. You will only worry about it if you do.
2) Get your brother to do it instead
Some people are less concerned about privacy than others ... so if this is how one of your siblings feels, why not ask them to test instead? One person I know did this and everyone was happy. Win-win.
3) Don't use your Real Name
You are not obliged to use your real name. You can use whatever name you want. I don't recommend using "Clint Eastwood" (unless you want unlimited fan-mail) - much better to use something completely nondescript like John Williams or Jane Jones.
Genealogically it makes sense to use your surname (as this will help with any genealogical research) but again, it's not essential. You can just as easily use an alias, a pseudonym, or a nom de plume. Or even a sequence of letters & numbers … FYL227 has a particular ring to it.
A cunning disguise will fool most people (this is obviously Groucho Marx in a wig)
4) Disguise your Personal Information
Similar to above, you are under no obligation to use your real date of birth. Now is the perfect opportunity to take 10 years off your age. I did and I feel so much better.
You could also create a bespoke, untraceable email address just for your DNA tests. It's easy to set one up on Gmail and have any messages directed to your inbox. I believe firstname.lastname@example.org is already taken but something similar would work just as well. It would be extremely difficult to identify you from a seemingly random combination of letters and numbers.
Only give the minimum amount of information necessary. I don't bother with my postal address or telephone number. If they can't reach me by email then I am probably on a retreat to the North Pole and they are unlikely to reach me by snail mail or telephone either.
5) Privatise your DNA account All the testing companies allow you the option to make your results completely private. For some, this means that your matches cannot see you, but you cannot see them either. And this seems like it might defeat the purpose of doing the test in the first place, but not so! You can de-privatise your results when you want to work on them, and re-privatise them when you have finished. This minimises the amount of time you are "exposed to public view" by your matches.
6) Privatise your Family Tree Without a family tree attached to them, DNA results are relatively useless. You could show up as a close "2nd cousin match" to someone else but if you haven't supplied any family tree information, it can be very difficult for them to figure out how you fit in to their tree.
Keeping your family tree private is as effective as keeping your DNA results hidden (if not moreso).
7) Delete your DNA account
If you have finished working with them, you could delete your results completely. This works really well if you have transferred your results to a particular website from another company - you can always keep the original results on the website you initially tested with and re-upload them again at any time.
Similarly, you can delete your kit from any website and have your sample destroyed.
So there are ways and means of finding the level of privacy and security that you personally feel comfortable with. Can you think of any others? Leave a comment below.
When it came time for her to deliver, she was taken into a room and put to sleep. When she woke up, the large bump of her pregnancy was gone, and so was her child. For the past 60 years she has always wondered if it was a boy or a girl - they wouldn't tell her.
Now, 60 years later, thanks to DNA, she knows. It's a boy.
There are many people in Ireland searching for their birth family. Some are adoptees, some are foundlings, some are people who were raised in industrial schools, some of whom were boarded out. Over the past few years, many of these people have turned to DNA for help, and these numbers are increasing all the time as the success stories of people finding family through DNA are becoming more widespread.
But it's not just the children that are searching for their families, it's the parents too. I have been working with several birth mothers (in their 70s and 80s) who are trying to locate the child that was taken away from them many decades beforehand. Many tell a similar story, like the one at the top of this article. They had little control over what happened to them. Decisions were made for them. And they were left with little or no information about the child they gave birth to, not even what gender it was.
I am delighted to announce that one of my clients (the woman above) has finally reconnected with her son. She gave up her child 60 odd years ago, and it only took 12 months for DNA to find him. She tested with Ancestry and then uploaded her data to FamilyTreeDNA, MyHeritage & Gedmatch (the recommended approach).
Now comes the next step in their journey - getting to know each other, building bridges, putting the past in the past, and moving into the future. This is a slow process that will take a lot of work on both sides.
If tracing using the first-line method above is unsuccessful, then you can consider DNA testing. The recommended approach is to test with Ancestry, and then upload a copy of the results to MyHeritage, FamilyTreeDNA, LivingDNA and Gedmatch. If this is unsuccessful, you should also test with 23andMe. If this is still unsuccessful, then it becomes a waiting game. You are hoping that some time soon your child or one of their children will do a DNA test and pop up in one of the databases as your closest match.
When they do, the connection may be instantaneous and things may move very quickly indeed so be prepared and have a letter written that you can post or email to your child.
For most people, reconnection is an emotional rollercoaster. It is best to have professional help on hand in case you need it. Take things slowly. You will need time to process your feelings. Be kind to yourself and others.
It's been a year since Law Enforcement Matching (LEM)* has resulted in the identification of suspects in at least 50 cases of violent crime in the US.  The power of genetic genealogy techniques to solve these crimes is truly amazing and is a real game-changer for law enforcement (LE), not just in cold cases but in active cases where the perpetrator is still at large and may offend again. [2,3]
But against this are the growing concerns about infringement of civil liberties, in particular lack of informed consent and intrusive police procedures which may unfairly target innocent people. These are legitimate concerns and need to be addressed. [4,5,6]
Previously I have suggested a very cautious, conservative approach to law enforcement use of the genetic genealogy databases.  In this post, I'll take a broader look at the overall Benefit Risk Ratio of Law Enforcement Matching and explore how this can be further improved by appropriate Risk Minimisation.
The potential Benefits of Law Enforcement Matching are many:
It helps solve "cold cases" of violent crime where the victim has been killed or raped or both. This can bring closure to the families involved and save a huge amount of time and money for LE, thus allowing limited police resources to be used more efficiently.
It can help solve "active investigations" where the rapist or killer is still at large, and thus helps remove violent criminals from the streets, potentially preventing further violent crime and loss of life. [2,3]
It is the certainty of being caught rather than the severity of the penalty that stops criminals from committing crimes in the first place. The advent of LEM has created the possibility that many criminals will now think twice before committing a crime because of the very high risk of being caught using LEM.
By reducing the risk from active violent criminals and by serving as a deterrent, it makes the society we live in a safer place.
These are the Benefits. What are the Risks?
LEM is being undertaken without the express Informed Consent of a sizeable number of people within the genetic genealogy databases (possibly the majority). This is because some of them are deceased, some people have not read the revised Terms, and some kits are managed by other people who are making the decisions for them. And there are probably other reasons also.
The recent Utah case [2,3] illustrated that Gedmatch was in breach of its own Terms. [4,5] Gedmatch argued that this action was justified, given the violent circumstances of the case concerned. Others have argued that this is a slippery slope  and that soon non-violent crimes will be the target for LEM. This raises fears of inappropriate and intrusive police action and an increased risk of inappropriate or wrongful targeting, and even wrongful conviction.
Some people have moral objections to the death penalty and would not like to see their DNA being used to identify criminals whose punishment would be death.
If LEM is not carried out with appropriate oversight and safeguards, there is a risk that it will be "shut down", thus depriving society of a very powerful tool for crime detection and prevention, and denying future generations the potential benefits that it may provide. Update: as of Sunday 19th May 2019 (the day after this post was written), everyone in the Gedmatch database has been automatically opted-out of LEM. A process will be instituted in the next week or so to allow people to actively opt back in.
LEM does not guarantee anonymity of the "passive genetic informants". Brandy Jennings will always be known as the woman who helped convict her cousin. Her name is out there. Forever. This puts her at risk of revenge attacks, unwanted media intrusion and public scrutiny (like this blog post).
If LEM is to survive and thrive, there needs to be a process of Risk Minimisation in order to optimise the Benefit Risk Ratio. So how do we minimise each of the Risks identified above?
Let's take a look at Informed Consent The first important point to make is that the requirement for Informed Consent is not absolute. In medicine, if someone arrives unconscious at the Emergency Department and needs an urgent blood transfusion to save their life, the doctor can order for such an infusion to be given without the patient's consent. And in most circumstances the patient will thank them for saving their life. But there are exceptions - if the patient later turns out to be a Jehovah's Witness then they may be very upset by this course of action and may attempt to sue the hospital. However, if it can be shown that the doctor "acted in good faith" and provided a reasonable standard of care, then he should get off the hook.
So this raises the question: is the requirement for Informed Consent absolutely essential in the situation where a murderer may kill someone again unless they are caught quickly? My gut feeling in this case is: go ahead and catch the murderer. The requirement for Informed Consent is not absolute in this case. Not to proceed in this fashion risks another murder ... and how would you explain to the victim's family that you did not catch the killer when you could have, because of concerns over Informed Consent? Would they buy your explanation? Would they agree that "Mum had no option but to lose her life because Informed Consent wasn't obtained?" I don't think they would agree with that line of logic. Rather they would say: you should have found some way around it. You should have made it happen. You should have found some solution.
And that is where we now stand: how do we find a solution to this issue? who determines if the Right to Informed Consent is absolute under these circumstances?
The BFEG cautioned against using this approach in the UK. Asides from the issues of incompatibility of testing carried out in an unaccredited environment, the ethical issues of using DNA profiles provided for genealogy purposes were considerable.
Whilst this statement clearly recognises the ethical issues involved, it does not suggest any means by which they might be resolved. One also wonders to what extent the BFEG were sufficiently informed about the way that LEM operates in practice such that a comprehensive evaluation of the Benefit Risk Ratio could be made. There is also the consideration that LEM may not be as urgently required in the UK (for example) as it might be in other jurisdictions due to the relatively lower rate of violent crimes and the relative greater potency of the national forensic database.
The second risk described above is the risk of inappropriate or wrongful targeting. Gedmatch breached their Terms of Service. And they did so for justifiable reasons. And many people will agree with those reasons and the subsequent actions, and many will not. The remedy is quite simple: change the Terms, apologise to the customers, move forward. Most people will be happy with that. Some won't.
However, it does raise some very important questions. First off: who is the Gatekeeper?
Who decides whether or not LE can use the database? Should it be a single person (as in the case of Gedmatch)? or should it be a group of people, a committee perhaps? But then who decides who sits on such a Committee? Should I be on the Committee? Should I make decisions for what the FBI can and cannot do? Do I have the requisite skills and experience? Who does?
In the UK, should the BFEG decide on which cases can make use of the commercial DTC databases and which cannot? Do they have the requisite skills and experience?
In the US, does the FBI have a Committee that decides what cases can and cannot be progressed? Does Parabon? Does Bode? Are the terms under which such committees operate transparent? If not, how do we know if they are reasonable? And one final question: are the workings of such committees overseen by an appropriate authority?
Who gatekeeps the Gatekeeper?
Thus there is a need for a clearly defined process with appropriate stops and checks. And there is a need for transparency so that public confidence can be attained and legitimate fears and concerns can be minimised. I would feel a lot safer if there were processes in place that would help minimise the risk of inappropriate use of the databases. It would also take the burden off the CEOs of Gedmatch and FTDNA (and the other commercial companies).
Which brings us to the risk of inappropriate or wrongful targeting. This has always been a problem with police forces everywhere. Often they get it right, but often they get it wrong. And sometimes they plant evidence to get the conviction they desire. We've all seen it in the movies.
There have been several cases where genetic genealogy has been used to target the wrong person. People often say: DNA doesn't lie. But what these cases have shown is that it can certainly be misread, misinterpreted, misunderstood and misconstrued. It can lead you on a wild goose chase, barking up the wrong tree. And that can cause harm. Michael Usry suffered the distress of unnecessary police intrusion, anxiety while waiting for the DNA results that could potentially convict him, and potential damage to his career and reputation despite being exonerated. His name is out there. Forever.
Here's another consideration: if you are a police officer, and you are certain that someone has committed a crime but there isn't enough evidence to convict him, just plant some of his DNA at the crime scene. Or at any appropriate crime scene for that matter. Frame him with his own DNA. Put him behind bars, where he deserves to be. The new technology allows this.  It may even have been done in the past when only standard forensic DNA tests were available. Has a movie been made about that?
The Innocence Project has had 350 exonerations to date - 20 of them were on death row. So clearly there is a risk of wrongful conviction and the death penalty. It may be a small risk, but IT IS THERE. And it needs to be minimised.
But is this more a problem of the criminal justice system than the actual use of LEM itself? Yes, it is. LEM doesn't kill people, people kill people. But the context within which LEM is applied needs to be taken into consideration. In those jurisdictions where the death penalty is not enforced, LEM will not result in deaths due to wrongful conviction (unless the innocent convict is killed in prison). But in those jurisdictions where the death penalty is enforced, then there is a definite risk of wrongful death. How can such a risk be minimised?
Furthermore, we need to think beyond the English-speaking world. Most commercial DNA tests have been done by people in the US, followed by smaller percentages in the UK, Ireland, Canada, Australia and New Zealand. LEM is more likely to be successful in the US, less likely in the other English-speaking countries, and much less likely anywhere else ... with some rare exceptions. Sweden, for example, has had a large proportion of its population tested. Iceland has had practically the entire population tested. Kuwait tried to test the entire population but their efforts were overturned by their Supreme Court. China has surreptitiously tested 50 million people (i.e. no informed consent) and has used the data to send members of the Uighur minority to "Re-education Camps".
To what extent are we responsible for this?
Who is looking after the planet?
The last Risk identified above is the loss of anonymity for "passive genetic informants". The Brandy Jennings case was cited. Her name appears on a Search Warrant that was obtained by the Press under Freedom of Information legislation and was made public. This exposes her to revenge attacks as well as unwanted Press intrusion and public scrutiny (like this blog post).
The Take Home Message is: by allowing your DNA to be used for LEM, you risk being named in the newspaper and on TV. Does that give you pause for thought? It does me.
This is definitely a risk and it definitely needs to be minimised. But is this more an issue relating to the nature and culture of how the Press reports than the actual use of LEM itself? Yes, it is. And again, the context within which LEM is practiced is all important. What is a risk in one country may not be a risk in another. And this emphasises the need for a global perspective in relation to LEM and the context in which it is practised.
How could this Risk of loss of anonymity be minimised? Well, it would have been nice if the names of the "passive genetic informants" had been redacted. Should someone have done this? Who is responsible for safeguarding the anonymity of the "passive genetic informant"? Does anyone know? Someone should change the Policies and Procedures of the FBI and other relevant LE authorities such that safeguards are put in place to minimise the risk of loss of anonymity and such that Privacy for those in the commercial database can be optimised.
It is only by minimising the risks associated with LEM that the Benefit Risk Ratio for its continued use remains optimal. And this helps safeguard the future viability of this incredibly powerful tool that holds the promise of making society a safer place for this generation and the ones to follow.
* LEM, Law Enforcement Matching refers to the use of the genetic genealogy databases (e.g. Gedmatch, FTDNA) by law enforcement officials to help identify perpetrators of crimes, usually "cold cases" involving rape or murder.
Recently I was privileged to be invited to be part of FamilyTreeDNA's Citizen Panel to advise on steps to meet the privacy requirements of FTDNA's members and at the same time allowing the FTDNA database to be of service to the wider community.
FTDNA have long been leaders in the field of genetic genealogy - they were the first company to provide DNA tests aimed specifically at the genealogy community and remain the only company to provide their customers with an infrastructure for running their own DNA projects. In fact, it can be argued that without FTDNA there would have been no genetic genealogy - I certainly owe them a debt of gratitude for fostering my own emergence as a genetic genealogist. This active promotion of Citizen Science has resulted in great advances in the field of genetics, such as the ongoing characterisation of the Tree of Mankind (Y-Haplotree) and the Tree of Womankind (mitochondrial Haplotree). They were also the first company to introduce a chromosome browser and many other tools to help with the interpretation of our autosomal DNA results. They have also actively supported the community through sponsorship of scientific meetings and conferences, such as Genetic Genealogy Ireland and the DNA Lectures at Who Do You Think You Are - Live!
So it was an honour to be part of the Citizen's Panel and to help contribute to the continued leadership of this great company.
The use of Genetic Genealogy Techniques by law enforcement is just the latest in the potential applications of these techniques. We as a community have been using these same techniques for many years to help adoptees connect with their birth families, and the use by law enforcement is a further natural extension of the methodology. It also has potential applications in any mass grave situation and in the future we may see its increasing use in such circumstances (e.g. to help identify soldiers who have been killed in the field of battle, to identify victims of natural disasters, such as the California Wild Fires, to identify the children buried at the former Tuam Children's Home, etc). And the availability of public, crowd-sourced databases to help achieve these important objectives will help increase the likelihood of successful identification and positive outcomes. Recent surveys have demonstrated broad public support for the use of public DNA databases to achieve these aims, but have hinted that additional regulation may be necessary.
FTDNA are to be congratulated for their continuing leadership in this regard. They are the first of the commercial companies to recognise the power of crowd-sourced databases to achieve the Greater Good. Their revised Terms of Service and Privacy Statement address a lot of the concerns that have been raised in the ongoing debate about law enforcement access to public DNA databases and they should be commended for this latest revision. No doubt as the debate continues, and different perspectives are aired, the need to revise and refine the approach to privacy and consent will change and the Terms will evolve accordingly. This is only natural. Privacy, Consent & Data Protection are not static topics. They never were. They are ever-evolving and will continue to evolve over the course of time.
So well done to FTDNA on taking the lead in addressing this issue head on and advancing the cause of the Greater Good. Hopefully, as the debate continues, additional safeguards will be identified and introduced such that any potential risks associated with the process of Law Enforcement Matching will be effectively neutralised.
Being part of the Citizen's Panel was of enormous benefit to me personally. It afforded me the opportunity to review all the many blog posts and Facebook comments that have been exchanged over the past year or so since the prime suspect in the Golden State Killer case was identified in April 2018. The advice I provided was based on my assessment and interpretation of the various perspectives and concerns aired in this ongoing debate. I hope I have captured all of them. In addition, I also have to thank my colleagues here in the UK and Ireland for our extremely fruitful ongoing discussions, partially arising out of GDPR, and many of my recommendations are based on these interchanges. In particular, I would like to thank Debbie Kennett, James Irvine, John Cleary, Donna Rutherford and Michelle Leonard whose sage advice and measured commentary have helped form my own opinions.
I found that the recommendations arising from my review incorporated a useful summary of the key issues that we as a community (and as a society) currently face. As such, I think that many people would find this very helpful in educating themselves about the issues involved and formulating their own opinions. As this is merely a summary of issues that have already been aired publicly, and as there was no requirement for a Non-Disclosure Agreement, I have appended my analysis and recommendations in their entirety below (this was an email that I sent on Feb 25th). I also believe that doing so is important as it helps promote the transparency of the Citizen's Panel (which ideally should reflect the broad range of views held by the customer base). I hope people find the advice informative (there are hyperlinks within the text) and that it is a useful contribution to the ongoing debate.
We are in exciting and unchartered territory. We are living in interesting times. The decisions we take today may have huge implications for privacy, consent, data protection, and the Greater Good. The debate is not over and will continue well into the foreseeable future. But it is very encouraging to see that FTDNA took many of my suggestions on board for their revised Terms of Service and no doubt this will be only one of many future revisions of their Terms over the coming years.
Hopefully other companies will follow suit as the situation evolves. People want to contribute to the Greater Good and there is a moral imperative to facilitate that happening. The devil is in the detail - we need to identify all potential risks and introduce sufficient (and not overly-restrictive) safeguards to minimise them. FTDNA's revised Terms of Service are a step in the right direction.
Disclosure FTDNA have kindly sponsored the Genetic Genealogy Ireland conference that I organise each year in Dublin & Belfast. I am very grateful for this sponsorship. They have occasionally paid part of my travel and accommodation expenses at these events.
My advice to FamilyTreeDNA as a member of the Citizen's Panel:
Feb 25th, 2019
Dear Bennett and Max
Thank you for inviting me to be part of the Citizen’s Panel. It is an honour and a privilege and I am very grateful indeed.
Let me start by saying that if it wasn’t for you both, I would not be the citizen scientist that I am today. None of us would. Without FamilyTreeDNA’s vision and the creation of an infrastructure that allows ordinary citizens to run their own DNA Projects, the genetic genealogy community as we know it today, would never have emerged. And therefore, I am acutely aware of the debt of gratitude that we owe to FTDNA as a company, to all its employees, and to the both of you in particular.
With that in mind, what follows comes from a place of deep respect for you both and I hope my honest and direct assessment serves as a useful addition to the ongoing conversation. Please feel free to pass these comments on to your legal team to help them in their exploration of the various international legal ramifications, and also to your PR consultants to help them in their efforts at damage control. My current thoughts have formed gradually over the past few months (having read the many posts and comments and blogs relating to this issue) and are likely to evolve further as the situation unfolds.
Ever since the news that the FBI were making use of the FTDNA database, I have struggled with the two default options before us for a database that allows LE (Law Enforcement) access:
a default opt in database, from which customers can opt out
a default opt out database, into which customers can opt in
1. The current situation: default “opt in”, optional “opt out” of all matching
The current situation is a default opt in database from which customers can opt out. But doing so means opting out from all matching, which for many customers was the main reason for joining the database in the first place. Some may claim that their consumer rights have been infringed by this move and may have a legitimate case for compensation. Not only might this impose a financial strain on the company, but it would be extremely bad press.
2. The new proposal: default “opt in”, optional “opt out” of LE matching
The new proposal to have a separate “opt out” option such that "Users can opt-out of Law Enforcement Matching at any time, while retaining the ability to see all of their matches” is a step toward remedying the current situation and no doubt will satisfy a lot of your customer base. But there are several major risks associated with this approach that could substantially damage the business:
It will be easy to apply the revised consent process to new customers, but much more difficult to apply it to existing customers. Emails could be sent out to all customers telling them they can opt out if they want to, but many customers do not read their emails and others do not bother replying. Lack of objection to the default “opt in” cannot be interpreted as express or explicit consent. FTDNA could lock people out of their accounts until such time as they had acknowledged they are happy being opted in automatically, but a lot of people haven’t accessed their account for years so this too is not a foolproof method of confirming that people are consenting to the default opt in.
In addition, dead people will obviously not be able to re-consent, and many have not appointed beneficiaries … so do dead people have rights in this regard? Do their families? It is important that FTDNA does not to appear to walk over the (perceived) rights of dead people. And in addition, this will be a particularly sensitive issue for some people with indigenous status both within the US and outside (such as the Havasupai tribe).
Many Users manage kits for other people - there is no guarantee that they will consult with those people and therefore there is a real risk that some customers will be opted in for something they did not consent to. This is a major flaw in the proposed new system and FTDNA will be heavily criticised for it.
The FBI only have jurisdiction in the US. They don’t have jurisdiction in Europe, the Middle East, Australia, etc. So all customers falling outside of the FBIs jurisdiction should automatically be opted out of the "LE-only" database.
there is a convincing argument that access to matches' personal data (e.g. names, email addresses, matching segment data) by LE is beyond the intention for which the database was set up and requires separate optional “opt in” consent in a similar way to consent for scientific research (see the dedicated consent processes at Ancestry & MyHeritage)
this specific point is made in the Future of Privacy Forum’s Best Practice Guidelines (see section IIb on page 4). LE access clearly falls under the “incompatible secondary use” category and this would therefore require "separate express consent". (Incidentally, the fact that FTDNA has been expelled from the forum raises serious concerns in people’s minds and FTDNA will be branded in the media as "the company that does not follow Best Practice Guidelines”.)
Under GDPR, there is a specific requirement to collect “freely given, specific, informed and unambiguous consent” from customers before sending them marketing emails (Article 32). The same GDPR requirements also apply when allowing LE to access the personal data (name, email, family tree) of any matches that any of the kits uploaded by LE may have. Consent must be explicitly “opt in” and cannot be “opt in” by default. This is covered in the section on consent in the Guide to GDPR and falls under section 3 of the UK’s Data Protection Act 2018 Your legal team should offer specific advice not just on the GDPR requirements in this regard, but also the requirements of the DPA 2018. Further specific information on the use of personal data by LE is available from the Information Commissioner’s Office.
in the UK, the Information Commissioner's Office (ICO) is particularly sensitised to LE use of personal data following a recent investigation into the UK Police’s use of a “Gang Matrix” (consisting of suspected gang members) which was shared by the police with several different government organisations. The ICO found this to be in breach of GDPR and an Enforcement Notice was instituted against the police. If a company (such as FTDNA) were to be perceived as doing something similar, a hefty fine (of up to 20 million euro or 4% of company annual turnover) might be levied as well as an Enforcement Order. The largest fine to date is 50 million euro (against Google last month).
From the discussions on Facebook, it would appear that at least one person has instituted a GDPR complaint (there may be others). There is also talk of a class action law suit. Furthermore, there are dedicated groups whose sole objective is to aggressively fight against perceived breaches of privacy and “forced consent". NOYB is one such group and they have brought successful GDPR actions against Google and Facebook … so there is a real risk that they could take similar action against FTDNA, particularly if alerted by an aggrieved customer or a competitor. Any such legal activity will tie up FTDNA in terms of time, money & resources, not to mention the damage to its public image and the opportunity cost resulting from the consequent loss of business. Thus such possible consequences are to be avoided at all costs.
FTDNA is in danger of losing its EU/US Privacy Shield status by converting a genealogy database into an LE database. One of the basic principles of the Privacy Shield is data integrity and purpose limitation The revocation of the Privacy Shield is likely to hit European recruitment hard.
FTDNA relies greatly on the support of volunteer project administrators to promote the company both online and offline at various genealogy events. Those admins who disagree with the proposed opt out policy are likely to become disillusioned and withdraw their support for the company or post damaging negative comments which could impact on the company’s sales and reputation.
For these reasons the optional "opt out” system will not work. It has to be changed to an optional “opt in” with “opt out” being the default position. This move is likely to severely compromise the ability of the “LE-only” database to catch killers & rapists (something we all want to do), but we cannot set up a database for US law enforcement that is in breach of international data protection laws even if the benefits for the greater good are plainly evident to all. In fact, if the "LE-only” database is built in the wrong way, with undue haste and lack of forethought, the public will lose trust in the process and ultimately more harm than good will be done by this precipitous action.
And FTDNA’s public image will suffer hugely. Despite the best intentions of FTDNA, it will be seen as the company that ultimately destroyed the possibility of a voluntary database that helps LE catch killers & rapists.
3. The alternative solution: default “opt out”, optional “opt in” to LE matching
If FTDNA copied the same process introduced by Gedmatch, this would be a significant advance. Consent is explicitly obtained from all new Users to “opt in” to a database that is clearly described as allowing LE access. Gedmatch has a second option for their Users, namely that those who choose to can additionally “opt out” of having LE (or anyone else for that matter) see their kit (the “Research kit only” option). Thus there is an initial informed consent obtained from each User followed by an "escape route" should they so desire. This two-step process goes a long way toward reassuring customers and building trust in the system.
And this 2-step process could also be introduced by FTDNA. Copying the Gedmatch approach would allay a lot of fears and help restore public confidence in FTDNA. It would also potentially allow FTDNA to collaborate with Gedmatch on resolving the exact same legal issues.
This optional “opt in” LE-only database will take a lot longer to build than a default “opt in” database, but it will be more robust and less vulnerable to attack, thus helping to ensure its survival and making it more likely that it will achieve its goals of catching violent criminals and bringing closure to victim’s families.
However, even with the alternative default “opt out” / optional “opt in” LE-only database, there remain several very significant problems:
the ongoing legal action by Maryland (and potentially other states) arguing that LE access is a breach of the 4th Amendment. The publicity of the case may be even more damaging to FTDNA (and Gedmatch) than any eventual legal decision.
the inherent vulnerability of the database to exploitation by undesirable forces (see below)
4. Vulnerability of the database
Even if a separate optional "opt in" database is created for LE use, what is to stop them from continuing to use the general database surreptitiously, in the same way the FBI were using it before FTDNA discovered them? Conceivably, the FBI (or any LE agency) could say that they will comply with the revised Terms of Service but thereafter could simply upload DNA profiles “undercover”, just like they did previously. FTDNA might not be any the wiser of this surreptitious activity. And some customers would have their personal data (name, email, etc) exposed to the FBI if any of them were a match to the undercover FBI kits.
So this scenario begs several questions:
how can FTDNA monitor the database to ensure that any such undercover kits are either prevented from being uploaded, or are quickly identified and removed?
what is the penalty for breach of the Terms of Service? Would FTDNA refuse to work with the FBI if it did not observe these Terms?
It doesn’t stop there. Any organisation could potentially gain access to the database as long as they were able to upload somebody’s DNA. The Mafia or organised crime could potentially use it to identify the families of specific individuals, perpetrate revenge attacks, or even disrupt witness protection programmes. I know this is far-fetched but you can imagine the damage to FTDNA’s reputation if it ever came to pass.
But most importantly what this demonstrates is that, in the absence of a method to prevent rogue kits from entering the database, FTDNA will never be able to 100% guarantee the confidentiality of their customer’s personal data. This would be catastrophic both legally (GDPR, etc) and from the perspective of FTDNA’s public image. This is why involving a legal team and a PR consultant is so vital. In addition, the legal team will need to consider implications not just in the US but across a variety of different legal systems across the world.
So how then can FTDNA protect itself against this type of undercover activity? One possible solution is to require that all DNA transfers from other companies have to have a cryptographic signature as proposed by Yaniv Erlich. This would clearly identify where the original DNA results had been generated and “non-permissible" kits could be rejected.
This does not address the possibility of some people trying to create a “fake” or “spoof” DNA sample, although this is more of a problem with saliva-based DNA kits. Nevertheless, in order to sustain a good reputation, FTDNA will need to take (and be very publicly seen to take) the appropriate and proportionate action to protect its customers' data. It will also need to prepare for a possible external audit, either by the relevant US authority or GDPR representative or both.
5. Some additional suggestions
You could also add the LE access opt in / opt out feature to the Family Tree Sharing section under the Privacy & Sharing tab. This would allow people to specifically opt out of sharing their family tree with LE. And this..
In April 2018, US police finally identified a murder victim (known as "Buckskin Girl") whose identity had remained a mystery for 37 years. Despite everything that they threw at the case, Forensic Science could not come up with the answer. But Genetic Genealogy did, and in doing so, made history.
The Buckskin Girl case was solved in 4 hours once the kit had been processed by Gedmatch
The excitement generated by the case did not have a chance to die down because two weeks later, the prime suspect in the Golden State Killer case was arrested, causing a media storm. Once again, genetic genealogy techniques had helped identify the most likely candidates for the killer, and (using routine police work) officers had followed these leads and gathered the evidence necessary to bring charges against a single individual.
Since then over 11 people have been arrested and charged with rape or murder, thanks to the application of the new technique by Law Enforcement agencies across the US. This new development seems to be in the process of changing forever the way that law enforcement solves "cold cases".
Some of the suspects arrested over the past 7 months
But in fact, the technique is not that new at all. The genetic genealogy community have been using the powerful marriage of DNA combined with genealogy to solve cases of unknown persons for almost 10 years. This use began shortly after the first DTC (direct to consumer) autosomal DNA tests were introduced in 2007. It is the same technique that has been used to help adoptees & foundlings identify and locate their birth families, to help donor-conceived children find their genetic father, and to solve illegitimacy mysteries in our family trees. The methodology is exactly the same (a person's DNA matches point to specific family trees that are likely to contain that person's ancestors, and consequently, one of the descendants of those ancestors will be the person's birth parent) and the outcomes are similar: a candidate is identified, and further DNA testing helps to confirm, refute or bolster the probability that we have found the right person.
The technique is simple (relatively) but labour intensive - finding the answer can take hundreds or thousands of hours to achieve.
And this is particularly true of Law Enforcement cases because legally they only have access to Gedmatch (database size 1 million) whereas the rest of us are using the combined database power of Ancestry, 23andMe, MyHeritage, FamilyTreeDNA and (most recently) LivingDNA - a total database size of over 20 million people. On a simplistic level, this potentially makes our ability to find an adoptee's family 20 times easier than Law Enforcement finding a killer.
But all these recent developments (and the now regular revelations that yet another serial killer suspect has been arrested) has sparked fierce debate about the use of DNA for a purpose other than what was initially intended. (1) Is it right to use our DNA in this way? Was adequate consent obtained from all the customers on Gedmatch? Is the use of Gedmatch an invasion of privacy? Will innocent people be inadvertently targeted? Have such concerns been overstated? What are the risk involved with this new use and how do we safeguard against them?
In early Oct 2018 a paper was published in the journal PLOS Biology entitled: Should Police have access to genetic genealogy databases? (Guerrini et al, 2018). They conducted a 20-item survey of 1587 people. Participants were aged 18 years or older and were recruited from the general US population. Overall, in relation to violent crime, missing persons, and crimes involving children, most responders felt that law enforcement should be allowed to search genealogical websites that match DNA to relatives (89-91%) and to create fake profiles of individuals on these sites (72-75%). (2)
Most responders were quite liberal in their attitudes to police access Guerrini et al, 2018 (click to enlarge)
The authors concluded that there appears to be a general lack of concern among the public regarding police access to their DNA data in cases where the purpose is considered justified. They also make the point that the combined use of DNA & genealogy is "quickly on its way to becoming routine procedure". They also call for "robust input from the public" in any discussions regarding what limits (if any) are to be placed on police access to genetic genealogy databases.
However, this survey was not representative of the general genealogical community. Most participants were under 37 years old (63%) whereas genealogists tend to be older. Also, the majority had not researched relatives on genealogy websites (63%) and had not done a DNA test (88%).
So in order to garner further public opinion, I conducted a survey of the genealogy community - a community which might better understand the processes involved in doing genealogical research and how DNA can be applied to help in that research.
The Survey: Non-genealogical use of DNA to identify unknown persons
The objective of the survey was to assess people's attitudes to the use of their DNA by Law Enforcement agencies. It was conducted between 10th Oct 2018 and 12th Nov 2018 (although most of the recruitment occurred between Oct 10-18). QuestionPro (a platform to create online surveys) was used to collect and analyse the data. The survey was advertised widely on various genealogy groups on Facebook during the period 10-18 Oct 2018. A list of the groups is included in the footnotes. (3)
The following questions were asked:
Are you reasonably comfortable with law enforcement agencies using your DNA data on Gedmatch to help identify serial rapists and serial killers?
Answer choices: yes, no, undecided
In general, would you be comfortable with your DNA being used to help identify other unknown persons? Please check all the items you would feel comfortable with:
John / Jane Doe
Does the use of DNA & Genealogy Combined by law enforcement agencies require additional regulation? Please check one of the following (and add a comment in the Comments section if you wish):
Answer choices: no, yes, not sure
Which country do you live in?
Please select your age group from the drop down menu.
Are you male or female?
Have you been a victim of violent crime?
Have you done a DNA test for genealogical purposes?
At GGI2018 (Genetic Genealogy Ireland 2018, Dublin), near-final results were presented (n=617) at the start of a Panel Discussion on the use of genetic genealogy by law enforcement. The video of the presentation is below ...
Ethical issues & the social application of DNA (Panel Discussion) - YouTube
An interim analysis was conducted on 12th Oct 2018 and at that stage 187 responders had completed the survey. Of these, 41% of responses came from the US, 19% from Canada, 20% from Australia and New Zealand, 17% from the UK and Ireland, and 4% "other". This all changed by the end of the survey with 42% of all responses coming from Sweden!
This late surge in Swedish responders illustrates several important points:
you never know what kind of response you are going to get from Facebook
the genealogy community in Sweden is extremely well organised (I know this from working with them) and clearly is great at recruiting. How this was achieved we are not sure but we know who he is and would like to extend our thanks and gratitude toward him! (You know who you are, Peter)
So, the results were analysed for the group as a whole followed by various subgroup analyses, breaking down the data by demographic features - gender, age, country of residence, etc - to see if there were any major differences between subgroups.
Figure 1: survey overview (click to enlarge)
Over 3000 people viewed the survey and 767 started to complete it. Of these, 83% completed at least one question of the survey, giving a total of 640 full or partial responses and 127 non-responses (i.e. no questions answered despite "starting" the survey).
The country of origin of the responses is detailed in the diagram above and is based on the 767 total responders. The equivalent numbers are as follows: Sweden 314, US 166, Great Britain 90, Canada 50, Australia 44, Ireland 33, New Zealand 32, Spain 15, Norway 6; 2 each from Netherlands, Costa Rica & Germany; and 1 each from Morocco, South Africa, Switzerland, Denmark, Japan, Portugal, South Korea, Puerto Rico, Finland, Argentina, and the Aland Islands (give yourself a free cupcake if you know where this is). Some of these people were clearly on vacation (or on business trips) because the country where they lived was different to the country where the response came from. Typical genealogists - sneaking in a bit of research when no one is looking!
The analysis below is based on the 640 full or partial responses. Not everyone responded to every question so the numbers in the analyses for each question range from 621 to 640. This explains why the numbers for Q4 (in Figure 2 below) differ slightly (but not substantially) from the numbers in Figure 1 above.
Figure 2: Country of current residence (n = 637) (click to enlarge)
The majority of responses came from Sweden (40%, 252), followed by the US (23%, 144), the UK & Ireland (16%), Australia & New Zealand (11%) and Canada (7%) with other countries making up 4% altogether. The proportion of response from "non-other" countries was as follows: US 40%, UK & Ireland 29%, Australia & New Zealand 19% and Canada 12%.
So, all in all, the survey had a very international flavour and there was good representation from predominantly English-speaking countries ... and Sweden.
Figure 3: age of responders (click to enlarge)
The spread of ages across the sample was in keeping with what we already know about genealogists - most are older and most are female (see this survey of 4109 genealogists here - Drake 2001). In fact, only 10% of Drake's sample was under 40 years old. In contrast, 63% of responders in Guerrini's survey were under the age of 37. So there is a large age difference between the two surveys. Will the older age of the responders in this survey result in a more conservative attitude toward police use of their DNA?
There was no substantial difference in the spread of age groups between Sweden, the US and all other countries combined (i.e. all three subgroups had similar percentages for each age group).
Figure 4: gender of responders (click to enlarge)
Almost two thirds of the responders were female, again in keeping with what we know of the demographics of genealogists (see Drake 2001). In Guerrini's survey, the male-female ratio was 48% to 52%.
There were some differences in the male-female ratio between countries: Sweden 43% vs 57%; US 31% vs 69%; other countries 33% vs 67%. So it would seem that there may be a greater proportion of men practicing genealogy in Sweden than in the US.
Figure 5: victims of violent crime (click to enlarge)
Ten per cent of responders had been victims of violent crime. And here there was a difference between countries - Sweden 9.6%, other countries 8.3% and the US 15.8%. So there was about 75% more responders who had been victims of violent crime in the US than elsewhere. Would this influence how people responded?
Figure 6: DTC (direct to consumer) DNA testing (click to enlarge)
A whopping 96% of people had done a DNA test. This was in stark contrast to Geurrini's survey where only 12% of people had taken a DNA test. So it may be that the current sample knew more about DNA testing than the participants in the earlier survey and thus might be better placed to make a judgement about whether or not police should have access to our DNA results.
In the end, however, it made little difference, because the results of this survey were very similar to what was found in Geurrini's earlier survey. Read on ...
Figure 7: attitude to use of Gedmatch by law enforcement (click to enlarge)
The top line result is that 85% of people were "reasonably comfortable" with the use of their DNA results by law enforcement agencies (for catching serial rapists and killers).
This high response rate was relatively consistent across countries, with the notable exception of Ireland (64%) although the sample here was relatively small (n=25). Nevertheless this does raise an issue specific to Ireland that will be discussed further below.
Figure 8: percentage in favour of police use, by country (click to enlarge)
There were no substantial differences between men and women (83% vs 86%), between victims of violent crime and non-victims (77% vs 86%), and between those who had DNA tested and those who had not (85% vs 84%).
There may have been a trend in responses across age groups. Fewer people (73%) in the under 40s age group answered positively compared to those in their 70s (92%). Thus, perhaps contrary to expectations, the younger age groups appeared to be more reticent than the older ones.
Figure 9: percentage in favour of police use, by age (click to enlarge)
These results are largely consistent with Guerrini's findings although the questions asked by both surveys were slightly different. In Guerrini's survey, 91% felt law enforcement should be allowed to search genealogical databases, and 75% felt it was acceptable to create fake profiles for upload to genealogical websites.
So overall, there seems to be broad support for the police use of genealogical databases, and this is independent of gender, age, country of residence, whether or not people have taken a DNA test, and whether or not people have been victims of violent crime.
Figure 10: support for use of DNA to help identify other unknown persons (click to enlarge)
A similarly high percentage of people were "comfortable" for their DNA to be used in other situations to help identify unknown persons, including adoptees, unidentified human remains, murder victims, and soldier's remains (90-92%).
Slightly fewer people (76%) were comfortable with the idea of their DNA being used to help identify the father's of donor-conceived children. There could be several reasons for this:
some people have greater concerns about privacy and anonymity in this particular instance compared to the other situations
some of the responders may have been sperm donors themselves
some of the responders may have had children who did not know that they were donor-conceived
Only 47% of people felt comfortable with their DNA being used to help solve non-violent crimes. And this is in keeping with Guerrini's survey where the percentage was 38-46%.
Of particular interest is the fact that a small but significant number of people did not feel comfortable with their DNA being used for any of the above purposes (3.4%). It may be that these people would never upload their data to Gedmatch or might even delete their DNA results.
Were there differences between the various subgroups? Gender did not substantially influence the percentage of positive responders in each category. In fact, the differences between men and women never exceeded 3.2%.
Responses by country were also broadly similar, with some minor differences. Support for helping adoptees ranged from 85-100%. The least support for helping donor-conceived individuals was in the UK (70%), and highest in Australia (92%). Ireland was the lowest-scoring country overall, with the lowest scores in 4 of the 6 categories (although once again, we need to be cautious about over-interpreting the data given the small sample sizes).
Figure 11: support for use of DNA to help identify other unknown persons, by country (click to..
There are some incredible discounts in the current FTDNA Sale which lasts from now until Nov 22nd. And there will probably be a Christmas Sale after that. So now is the time to start thinking about getting that upgrade or that extra kit.
Below are the sale prices and they are the lowest I have ever seen. Y37 for just $99 ... Family Finder for just $49 ... and $100-140 off Big Y upgrades.
This feels more like Crazy Eddie's Second Hand Car Deals!
There are several phenomena encountered in the the analysis of Y-DNA STR data that can throw a genetic spanner in the works, and Convergence is one of them!
In genetic genealogy, Convergence occurs when two men have DNA signatures that are exactly or nearly identical, but have evolved that way purely by chance. As a result, the two men will show up in each others' list of matches and will give the false impression that they may be closely related (e.g. within the last several hundred years) when in fact they are much more distantly related (e.g. within the last several thousand years). The problem is we cannot tell that Convergence has occurred simply by looking at the two men's STR results. It is hidden from our view. We cannot see it just by looking at the present-day STR data. And the danger is that if the two men think they are closely related, they may start chasing their common connection, thinking that they will find the answer via further documentary research, when in fact there is little hope of that at all. Their "close match" is a red herring. And their pursuit of the Common Ancestor is a wild goose chase.
So what can we do about it? How can we recognise it? How can we avoid it wasting our precious research time?
The concept is occasionally discussed in Facebook groups or on various blogs, but there tends to be quite a lot of confusion around what it actually means. And there are a variety of quite understandable reasons for this.
Firstly, there isn't a standard definition for Convergence, so how it is used varies from person to person. Some people apply it only to exact matches, others apply it to exact and close matches. Moreover, the concept of Convergence is closely tied up with the concept of lack of Divergence. Both are different phenomena, but their effects and consequences are very similar. Another contributing factor is the fact that it is difficult to see it or detect it in practice. We know that it exists, but we have no way of identifying it just by comparing two sets of STR results. In other words, it's largely a hidden phenomenon (like Black Holes). It is only when we do SNP testing that the extent of Convergence becomes apparent. And the problem is that not enough people have done SNP testing.
The good news is that more and more people are doing SNP testing and as they do, the extent of Convergence becomes more apparent. The Lineage II members in the Gleason DNA Project are trailblazers in this regard and we will explore the results of the recent Z255 SNP Pack testing in subsequent blog posts.
But in this post, we will look at an example of Convergence from the Gleason DNA Project in order to illustrate some of the key characteristics and consequences of Convergence. In later posts, we will look at clues that may indicate that Convergence is present, attempt to quantify the number of Back Mutations & Parallel Mutations that occur over time (using the Mutation History Tree that we have previously constructed for Lineage II - the North Tipperary Gleeson's), and finally we will attempt to quantify Convergence itself.
But first of all, let's look at some of the aspects of the definition of the term.
A general definition for the term convergence from the Conicse Oxford English Dictionary illustrates some general characteristics of convergence that are worth exploring because they are of relevance to how the term is applied in genetic genealogy and to the analysis of Y-DNA STR data in particular:
converge 1. come together from different directions so as eventually to meet
convergent2. Biology (of unrelated animals and plants) showing a tendency to evolve superficially similar characteristics ...
There are several important aspects to these definitions that we can apply to the analysis of STR data (e.g. your 37 marker data). First of all, the sense that things were initially apart, but then they come together. Secondly, the idea that two things can look the same or similar on the surface, but in fact they have come from very different directions. And thirdly, the idea that two things can evolve from something different into something the same.
Let's look at how this more general concept can be applied to the analysis of Y-STR data.
And a good starting point is the description of Convergence on the ISOGG Wiki:
Convergence (also known as evolutionary convergence) is a term used in genetic genealogy to describe the process whereby two different genetic signatures (usually Y-STR-based haplotypes) have mutated over time to become identical or near identical resulting in an accidental or coincidental match.
One can think of convergence as producing misleading matches – two men appear to be more closely related than they actually are. The same situation may result (very occasionally) if there is an exceptional lack of divergence. In other words, so few mutations occurred in the descendants of a common ancestor over the course of time that the common ancestor may appear to have lived only a few hundred years ago when in fact he lived much further back than that, perhaps several thousand years ago.
So let's pick apart some of the key elements of this definition. You might like to refamiliarise yourself with some basic concepts, such as the different types of DNA markers (STRs and SNPs), and what you are actually seeing when you look at the DNA Results page.
Firstly, the above description of Convergence refers to the genetic signature - the Y-STR haplotype. This is the string of numbers you see associated with your results on the DNA Results page of the project. I like to think of it as if all the Y-chromosomes of the men in the group were all stacked up on top of each other, in such a way that each of the individual markers along the chromosome were all aligned with one column for each marker. Thus in the diagram below, each of the men have a value of 13 for the first marker. The values for the second marker are a mixture of 23 and 24. And so on.
The Y-STR results for the men of Lineage II
(click to enlarge)
Another key point in the above description is the concept that some markers mutate over time e.g. the number changes from 14 to 15. These mutations are identified by comparing the value in each square to the modal value for the entire group (i.e. the most frequent value among the men in that group). The most frequent values for each of the markers are used to generate the "modal haplotype" which is a virtual signature constructed from these most frequent values (and is represented by the row marked "MODE", the 3rd row from the top in the diagram above).
Mutations are indicated by coloured squares. If the value for any marker is the same as the modal value for that marker (i.e. the most common value among the men in that group), then the square that the value is in will not have a colour. If however, the value is higher than the norm, it will be coloured pink; if it is lower than the norm, it will be coloured purple.
If you and someone else have exactly the same string of numbers, you will have the same coloured squares and the same "no-colour" squares. If you are not exactly identical, you will have some coloured squares that the other person does not have ... and vice versa. In other words, the sequence of numbers, and hence colours, will be different. Each coloured square represents a mutation - a small minor increase or decrease in the number (compared to the norm) for that particular marker, in that particular individual.
Convergence in theory
Let's imagine that some distant ancestor living 10,000 years ago gave rise to four distinct lines of descent surviving today (represented by the men A, B, C, and D in the diagram below). Let's look at what happened to their first 37 STR markers over time, and let's assume that mutations only occurred in 5 of these STR markers, as shown in the diagram below. How did the values change over the passage of time, from 10,000 years ago to the present day? And how many of the descendants of this ancestor "match" each other today?
In descendant A, only one of these 5 STR markers mutated. It underwent a single mutation (from 13 to 14) about 6000 years ago, and that was the only mutation over the span of 10,000 years. This is an rather extreme example of "lack of Divergence".
Descendant B had several mutations in his line of descent, but only affecting the first and the fifth markers. These show progressive "forward mutations" away from their original values. With the first marker, the mutations go forward in an upward direction (14,15,16,17) whilst with the fifth marker they go forward in a downward direction (15,14,13,12). This latter may seem counterintuitive but it serves to emphasise that "forward" means "away from" the original value, no matter if it is up numerically or down numerically.
Descendant C also has experienced mutations in only the first and fifth marker. But here we see two examples of a Back Mutation. The first marker shows a forward mutation 6000 years ago (13 becomes 12) but this has gone back to 13 by 4000 years ago. It then undergoes another forward mutation by the time of the present day (13 to 14). Similarly, the fifth marker undergoes a forward mutation (16 to 17) by 4000 years ago but a Back Mutation by 2000 years ago.
Descendant D undergoes mutations on all 5 of his STR markers. A Back Mutation occurs with the second marker between 2000 years ago and the present day (15 to 14); and likewise with the third marker (12 to 13); and likewise with the fifth marker (17 to 16). Two Back Mutations occur with the fourth marker (29 to 30 by 4000 years ago; and 31 to 30 by the present day).
Mutations over time in 4 distinct lines of descendants
Remember, these are four distinct lines of descent, with the MRCA (Most Recent Common Ancestor) represented by the first row of 5 STR markers in the diagram above. So now let's look to see if any of the mutations that occurred in these four individual lines of descent occurred in parallel i.e. the same mutational change occurred in two completely separate lines of descent.
Have a look at the first marker in A, B and C. All three men developed the same mutation on this marker - a change from a value of 13 to 14. In Lines A and B this change occurred in parallel around 6000 years ago. In Line C, the change occurred in parallel around about the present day.
There is a similar parallel mutation between Line C and D. Look at the fifth marker - it increases in value from 16 to 17 around about 6000 years ago in Line D and 4000 years ago in Line C.
And there is a parallel back mutation present in Lines C and D also - the fifth marker switches from 17 to 16 about 2000 years ago in Line C and around about the present day in Line D.
With Back Mutations you are only looking at a single line of descent. With Parallel Mutations we are comparing two or more lines of descent. And we will see that in practice Parallel Mutations are much more common than Back Mutations and have a much greater role to play in the development of Convergence.
The STR results of living people today tells us nothing about their evolutionary history - it is hidden from view
Which brings us to Convergence itself. Let's look at the Genetic Distance between each of these lines of descent. This helps to make the point that the DNA results from living people are only a snapshot in time. They do not tell us anything about how those STR values have evolved over the past 10,000 years:
A and B have a Genetic Distance (GD) of 7. This is made up of a 3-step difference on the first marker (14 vs 17) and a 4-step difference on the fifth marker (16 vs 12). And as these were the only changes on their first 37 markers, the GD would be written as 7/37. This exceeds FTDNA's threshold for declaring a match (i.e. 4 steps or less over the first 37 markers; written as 0-4/37) and so A and B would not appear in each other's list of matches.
A and C have a GD of zero. They are an exact match. Their GD for the first 37 markers is thus 0/37. They appear in each other's match list and the match looks really close. They think they have a common ancestor in the last few hundred years. They start comparing family trees, looking for the elusive ancestor. They will never find him. This is a wild goose chase. This is the consequence of Convergence.
A and D have a GD of 2 (or 2/37). This GD falls within the threshold for declaring a match. They both appear in the other's match list. They email each other, looking for the common ancestor - another wild goose chase. Another example of Convergence and its consequences.
B and C have a GD of 7/37. No match.
B and D have a GD of 9/37. No match.
C and D have a GD of 2/37. It's a match. It's Convergence. They don't know that. They spend months researching their connection. It's a wild goose chase.
The STR results of people living today tell us nothing about how those STR marker values have evolved over time. They may have come from a relatively recent common source, or they may have come from widely differing directions.
Below is another way of conceptualising how the numerical value of a single STR marker might evolve over time. This marker started out with a value of 8 for the common ancestor of 4 distinct lines of descent. But by the time of the present day, two lines had a value of 9, one had a value of 13 and one had a value of 5. But the evolutionary history of these 4 lines of descent is peppered with Back Mutations and Parallel Mutations:
Line 2 (red) - 14 becomes 13 some time between 1000 years ago and the present day (0)
Line 4 (purple) - 4 to 5 between 1000 and 0 years ago
Line 3 (green) - 5 to 6, 6 to 7, and 7 to 8 between 7000 (7K) and 4000 (4K0 years ago
8 to 9 in Line 2 (10K to 9K), Line 1 (7K to 6K), and Line 3 (2K to 1K)
8 to 7 in Line 3 (10K to 9K) and Line 4 (9K to 8K)
7 to 6 in Line 3 (9K to 8K) and Line 4 (7K to 6K)
6 to 5 in Line 3 (8K to 7K) and Line 4 (4K to 3K)
The evolution of values in a single STR marker over time in 4 descendant lines
of a common ancestor who lived some 10,000 years ago
The consequence of all these Parallel & Back Mutations is that the present day descendants of two of the lines (green Line 3 & blue Line 1) have exactly the same numerical value for this STR marker despite the fact that their evolutionary histories are so different.
This is an example of the evolutionary history for a single STR marker. And if this is representative of all STR markers, then the chances that the values for a particular marker will converge over time is really quite high. But our DNA results usually consist of 37 markers (the standard test most people start with) so what are the chances of the first 37 markers evolving in such a way as to result in convergence of a sufficient number of STR values to cause a coincidental match? ... well, the probability of that happening would be a lot lower. And the probability would be lower still with 67 markers, and lower still with 111 markers. But because so many people have tested (over 600,000 currently), we do see the phenomenon occurring even at higher marker levels (67 and 111).
And in a subsequent post we will look at clues to the presence of Convergence, so that you can look at your own or anyone's list of matches and adjust your suspicion level accordingly.
Convergence in practice
And to illustrate these points, I have temporarily moved one of the ungrouped project members into Lineage II, namely member Jim Treacy (B38804)*. He is third from the end in the diagram below. Don't worry about not being able to read the text (you can click to enlarge the diagram if you like) - just focus on the coloured squares.
The Y-STR results for the men of Lineage II (with a Treacy third from the end)
(click to enlarge)
And Jim has no coloured squares for the first half of the markers. It is only when we reach the 19th marker in the row that he has a pink square with the value 16 inside it - everyone else in that column has a value of 15 for that marker, except for one person who has a value of 14. And as we continue along Jim's row, there are 4 other coloured squares, bringing the total to 5. This can be expressed as a Genetic Distance of 5/37 from the modal haplotype (i.e. the 3rd row from the top, which - to remind you - is a virtual signature constructed from the most frequent values for each of the markers).
Now a GD of 5/37 between two men would mean that they do not appear in each others' list of matches (because FTDNA have set the threshold for "declaring" a match to be 4/37 or less). But among Jim's list of matches at the 37 marker level, there are two members of Lineage II (with a GD of 4/37). And at the 67 marker level, Jim has 6 members of Lineage II among his matches (with a GD of 6 to 7/67). So this looks (on the surface) that Jim is relatively closely related to our Lineage II group. And this suggests (on the surface) that there may be a common ancestor some time in the past several hundred years, maybe somewhere between 1700-1850 (on the basis of TMRCA calculations based on the TiP Report).
So what do we do next? Do we start looking for documentary evidence? Do we go back to the church records and land records and old newspapers to see if there is mention of a Gleeson-Treacy connection?
We could do. But it would be a wild goose chase. Because the Treacy-Gleeson connection is a red herring. And we know this because we have done SNP testing.
Jim has done the Big Y test, as have 10 of the members of Lineage II. Both Jim and Lineage II members belong to Haplogroup R, and both share some SNP markers in common. Each marker characterises a branching point in the Tree of Mankind and a SNP Progression is a list of these SNP markers down to the finer "more downstream" branches of the Tree. Here are the SNP Progressions for Jim and for the Lineage II Gleeson's:
You can see that the branching points are exactly the same ... until marker Z16437. Thereafter, Jim goes down one branch and the Gleeson's go down another one. Now, let's be clear: the Gleason's and Jim do share a common ancestor. And if he was around today he would test positive for the SNP marker Z16437. But his children would have evolved along different paths - one path taking us down to our present-day Jim Treacy, the other taking us down to our present-day Gleeson's. You can see where Jim and the Gleeson's are placed on the Tree of Mankind in the diagram below.
And when did this common ancestor live? YFULL date the formation of Z16437 as 1650 years ago. The two markers downstream of this, A557 (Jim Treacy) and A5631 (Gleeson), both have formation dates of 1400 years ago. So from this we can say that the common ancestor of Treacy & the Gleeson's is somewhere between 1400 to 1650 years ago. Or to give it an actual date (by subtracting from 1950, the approximate birth year for members of Lineage II), sometime between 300 and 450 AD.
This is clearly a lot further back in time than the 1700-1850 AD estimate suggested by the STR data.
So this is a great example of Convergence. By chance, Jim's STR signature has evolved over time to approximate that of the Gleeson's of Lineage II and as a result, he looks a lot more closely related to the group than he actually is.
* a big thank you to Jim for allowing me to use his name and his results in this example
Gleeson's to the left, Treacy's to the right, & about 1500 years in between
In a recent post I explored the concept of Convergence and made the point that the mechanism by which Convergence arises is via a combination of Parallel Mutations and Back Mutations in the STR marker values. These mutations are changes that occurred at some time in the past but because they remain hidden to us in the present, we cannot tell when they occurred or how frequently they occurred just by looking at two sets of STR results from people living today.
However, there is a way around this problem. Or at least a partial solution.
By using a combination of STR data and SNP data we can build a Mutation History Tree that is a more accurate representation of the branching structure of the "family tree" for a specific genetic group. And this type of tree allows us to more easily (and more accurately) spot Back Mutations and Parallel Mutations.
I did this for one particular genetic family in one of my surname projects - the North Tipperary Gleeson's (Lineage II of the Gleason DNA Project). This tree is a "best fit" tree, by which I mean a tree constructed in such a way as to explain the STR & SNP data in the most parsimonious way i.e. with the fewest number of branches that will accommodate or "fit" the data. This approach is also called the "maximum parsimony" approach and is often used when building cladograms or phylogenetic trees. The Mutation History Tree (MHT) is simply another type of cladogram. You can read about the process of how the tree was developed in this blog post here and subsequent posts.
But a key point here is that this "best fit" tree is likely to change as more data becomes available. And to illustrate this point, I'm going to compare the current version of the tree (Dec 2016) with the next version that is being prepared following the recent availability of new data from 12 sets of Z255 SNP Pack results.
Below is the current version of the MHT for Lineage II. By comparing each mutation in the tree with every other one, we can identify which mutations are Back Mutations (occurring on a single line of descent) and which are Parallel Mutations (occurring on two or more lines of descent). I have highlighted the Back Mutations in yellow and the Parallel Mutations in green.
Back Mutations in yellow, Parallel Mutations in green from Gleeson Lineage II MHT (version Dec 2016)
Parallel Mutations occur in the following lines of descent:
CDYb 40-39 ... A, E, D, F (4 times)
CDYa 39-38 ... A, B, C, F (4 times)
464c 17-16 ... A x2, D (3 times)
461 12-11 ... A, B (2 times)
576 18-19 ... A, D (2 times)
390 23-24 ... A, B, C (3 times)
390 24-23 ... B, C (2 times)
456 16-15 ... B, D (2 times)
and so on ...
Back Mutations are more difficult to count, and to conceptualise. Whether you consider the value as mutating forward or back is entirely dependant on your reference point. If our anchor is the upstream Z255 branch, then the original value of marker 390 (for example) is 24, mutating (forward) to 23 on the Z16438 branch, and then back to 24 (in parallel) on Branches A, B & C, and then back to 23 (again in parallel) on Branches B & C. So there are several points to make here:
this is in fact a Back Mutation that occurs in parallel in 3 separate lines of descent. It is thus both a Back Mutation (relative to its earlier value of 24 on the Z255 branch) and a Parallel Mutation, occurring at (presumably) different time points in Branches A, B & C. It is thus coloured yellow and green.
It can also be considered a Triple Mutation relative to the Z255 branch - in the sense that it mutates forward to 23 then back to 24, then back to 23 again. But what happens if it flips forward and back 5 times? What would we call that? And what do we call it if it goes two steps forward and one step back? This is where terminology fails us. I'm not sure if there is a standardised way of describing these different kinds of mutation (if there is, please leave a comment below).
the mutation 390 24-23 occurs in Branches B & C ... relative to its value of 24 in the Z255 branch, this could be considered a Parallel Forward Back Forward Mutation ... for Pete's Sake!!
But if we just focus on the Back Mutations that occur downstream of the branch characterised by the STR mutation (710 36-37), just above the A5627 SNP Block. This "710 branch" incorporates all the Gleeson's of Lineage II, from Branch A to F.* On this overarching branch for Lineage II, the value of the STR marker 390 is 23 and Back Mutations are as follows:
390 24-23 ... B, C ... this is the only Back Mutation below the "710 branch"
And it is also a Parallel Mutation
All the other yellow Back Mutations are relative to the upstream Z255 branch, and not our downstream "710 branch", and so are not counted in this particular exercise.
So, let's generate some statistics from these numbers:
The total number of mutations below the "710 branch" (irrespective of whether they are forward or back) is 71.
There are 69 Forward Mutations (i.e. away from the original value of the relevant marker on the "710 branch")
31 Forward Mutations show an increase in the number (e.g. 9 to 10)
38 Forward Mutations show a decrease in the number (e.g. 9 to 8)
There are 2 Back Mutations
both Back Mutations show a decrease in the number (i.e. 24 to 23)
There are 26 Parallel Mutations
Forward Mutations outnumber Back Mutations by a ratio of 35.5 : 1
Parallel Mutations outnumber Back Mutations by a ratio of 13 : 1
There are 16 people in this tree, and if we make the big assumption that the "710 branch" starts 1000 years ago (i.e. roughly at the time of the introduction of the Gleeson surname), then over the course of 1000 years, the rate of each type of mutation is (crudely) as follows:
Forward Mutations = 69/16 = 4.3125 mutations per "line of descent" per 1000 years
Back Mutations = 2/16 = 0.125 mutations per "line of descent" per 1000 years
Parallel Mutations = 26/16 = 1.625 mutations per "line of descent" per 1000 years
These are crude estimates but they give some idea of the relative importance of Parallel Mutations compared to Back Mutations. And applying this information to the phenomenon of Convergence, it would seem that Back Mutations play a very minor role compared to Parallel Mutations.
This conjecture is supported by some recent modelling work undertaken by Dave Vance and written up for the L21 Yahoo Discussion Forum. In Dave's simple model, which is an extremely useful basis for further discussion, the "average tree" could expect to have a ratio of Parallel to Back Mutations in the range of 25:1 to 50:1.
This is a lot higher than what I have shown in my MHT for the Lineage II Gleeson's, but this can be partly explained by the fact that there are only 16 people in my Gleeson sample, and we are looking at (perhaps) only the last 1000 years. I would predict that the ratio will increase further as 1) I add more people to the sample; and 2) the duration of observation is extended backward from 1000 years ago (the 710 Branch) to 4300 years ago (the Z255 Branch).
In subsequent posts we will see how these calculations stand up when we add in additional data from 12 SNP Pack results and reconfigure the MHT for Gleeson Lineage II into the next version of the "best fit" model. And we will also attempt to quantify the total number of Back & Parallel Mutations below the upstream marker Z255. And lastly, we will attempt to quantify Convergence itself.
* the Big Y results of a 10th member of the group indicate that this branch is characterised by the SNP A5631 although this result is not reflected in this version of the MHT
One of the main tasks of Surname Project administrators is to place new members into the appropriate genetic group within their surname project.
Having run a variety of surname projects for the last few years, I have come up with a set of criteria I use on a routine basis to place newcomers into existing genetic groups and also to identify new genetic groups. I call these criteria Markers of Potential Relatedness (MPRs). And (not surprisingly) these can be thought of as indicators that two people may be "related" to each other, which for the purposes of surname projects means somewhere in the last 1000 years or so. This arbitrary timepoint is chosen because many European surnames were introduced about 1000 years ago (in particular British and Irish surnames), although they only became commonplace several centuries thereafter.
This approach to grouping works best with hereditary surnames (i.e. passed from father to son) but should also work with patronymic (and other) surnames, except that (in these latter cases) criteria 1 and 8 will not apply. The discussion below is very much from the standpoint of hereditary surname projects.
Not all criteria have to be met. But the more criteria that are met, the higher the likelihood of two people being related. This is particularly important in relation to SDSs (Surname or DNA Switches; also known as NPEs, Non-Paternity Events), as it may be difficult to distinguish a match that is an SDS (e.g. adoption, illegitimacy) from one that is due to Convergence.
Below is a list of these criteria and we will consider each one in turn. Some of these Markers of Potential Relatedness (MPRs) have nothing to do with DNA. If two people have the same surname, or the same unusual surname variant, or have a similar ancestral homeland, or even an ancestor with the exact same name, then these can be indicators that the two people are related. And because they don't rely on genetics I simply call them "traditional markers" as opposed to "genetic markers".
MPRs for deciding if two or more people are related within the last 1000 years
In practice, the most useful indicators (or at least the ones I most frequently use) are Markers 1, 2, 6 and 7. And if a new project member is grouped on the basis of these "main" markers, it usually becomes apparent that they meet many of the remaining criteria also.
1. The members have the same surname
This is an obvious criterion, especially for surname projects that deal with hereditary surnames. If two people share the same surname, the next question is: are they related? And it would seem a reasonable supposition that there is a much higher probability that they are related on their direct male lines (within the last 1000 years) if they do share a surname than if they don't.
Problems tend to arise when there is some doubt over what is a valid surname variant and what is not. For example, are Malley and Malloy surname variants? Are Farrell and Farris surname variants? What happens when you get both types of variant testing positive for M222? Do you group them together or keep them apart? Only other MPRs (such as downstream SNP testing) can answer these questions.
2. The Genetic Distance (GD) between two people indicates a (very) close relationship
The threshold for "declaring a match" between two people varies with the number of STR markers tested (see below). These thresholds are arbitrary, but the intention is to get the right balance between false positives and false negatives - in other words, letting the wrong people in and keeping the right people out (known more technically as specificity and sensitivity).
Most people do the Y-DNA-37 test initially and I would usually feel very confident grouping together people with the same surname if their GD was 2/37 or less; and reasonably confident of grouping them together if the GD was 4/37 or less. Except in the instance where there is evidence of Convergence, as indicated (for example) by the terminal SNPs of their matches sitting on a wide variety of distantly related "upstream" branches of the Y-Haplotree (Tree of Mankind). We'll talk about this some more in item 7 below.
In addition, Convergence is a common occurrence in certain subclades, such as M222 and L226. When I see these terminal SNPs in a new project member, alarm bells start ringing, my level of conservatism increases, and I start looking to other MPRs other than Genetic Distance to decide if two people belong in the same genetic family.
This technique for grouping people together will miss outliers - people who do indeed belong in the same genetic family but whose ancestors branched away from the main group many many generations ago. For example, in the Gleeson DNA Project, several of the members of Lineage II (all confirmed to be related by Big-Y SNP testing) have a GD of 10/37 compared to other group members, and that would usually preclude them being grouped together.
3. The TiP24 score is >80% compared to the group modal haplotype
I don't use this marker so much anymore but it can be a useful way of assessing if a newcomer belongs in a given genetic family, especially if there is insufficient data regarding SNP markers among their STR matches. The potential benefit of this method is that it takes into account the varying mutation rates of STR markers whereas GD does not.
It involves generating a TiP Report between a new project member and the member closest to the modal haplotype for a given genetic family within the project, and then looking at the percentage probability of being related within 24 generations. We call this the TiP24 Score (for lack of a better term). If this is >80% (an arbitrary figure, which can be adjusted to suit your personal preference), then the newcomer can be considered to be "likely to be related" and therefore placed in that specific genetic family.
It is important to note that the use of the TiP24 Score is not an attempt to date when two people are related, merely to ascertain if two people are likely to be related. The TiP24 Score is simply an attempt to standardise GD comparisons, given that we know that a GD of (say) 4/37 on slow-mutating markers is much more significant than a GD of 4/37 on fast-mutating markers. The former (probably) indicates a much more distant relationship than the latter.
This techniques works best for those related within the last several hundred years, but will miss outliers. I have several people in the Gleeson DNA Project (confirmed to be related via SNP testing) whose TiP24 Score with other members is as low as 1%.
Also, the TiP24 Score is likely to be tripped up by Convergence (in the same way that GD is) and is therefore of limited utility in such circumstances.
4. There is a clear Genetic Distance Demarcation between project members within a genetic cluster & project members outside it
Administrators have access to a tool called the "Y-DNA Genetic Distance" tool. This permits comparisons between any person in the project and every other person in the project. Often times, there will be a clear demarcation between a newcomer's range of GDs to a particular genetic family and all other genetic families within the project.
In the example below, the newcomer matches 9 members of R1b-Genetic Family 2 with a GD ranging from 4/67 to 9/67. Thereafter, the GD jumps to 16/37 and higher. This stark demarcation in GD suggests strongly that the newcomer falls within R1b-Genetic Family 2.
This also suggests that Convergence is unlikely to be an issue here (otherwise we might expect to see a more gradual increase in GD values, rather than the jump from 9 to 16 that we see here).
This technique works best with 111 or 67 marker comparisons. Demarcations are much less obvious using 37 marker comparisons.
The GD between the newcomer & other members shows a clear demarcation between one particular genetic family and all others
5. Presence of Rare Marker Values or a Relatively Unique STR Signature among genetic group members
The idea here is that if one or more people share a Rare Marker Value, then it stands to reason that they are more likely to be related to each other, especially if they all share the same surname.
Leo Little's spreadsheet of STR marker value frequencies is very useful for identifying those values which are particularly rare, even though the spreadsheet only covers six of the main haplogroups (E3a, E3b, G, I, J2, R1a, R1b). What constitutes "rare" is a moveable feast but a frequency less than 5% would not be unreasonable.
Usually these rare marker values emerge after several people have been grouped together. Any newcomers thereafter who share this rare marker value can be further assessed for membership of the specific genetic family wherein the rare marker value occurs. A famous example is Group B of the Wheaton Surname Project where 3 "rare" marker values occur within the first 12 markers (with incidences of 5%, 1% & 8% in the "general" R1b population). The chances of these occurring within the general population are 1 in 62,000. And therefore, any Wheaton who matches these 3 STR marker values can be automatically allocated to Group B (with 99.99% confidence). And they only need a 12-marker test to do so.
Leo Little's spreadsheet of marker value frequencies
An allied concept is that of the Relatively Unique STR Signature (also know by various other terms such as STR Motif). In short, these are a selection of STR marker values (usually between 3 to 8 in number) that are "unique" to just a few people within a surname project and which indicate that the people concerned are likely to be related to each other.
A good example from the Gleeson DNA Project shows that several members had relatively unique STR Signatures which predicted that they were related (Branch E and F below). This was later confirmed by SNP testing of the two branches.
Relatively Unique STR Signatures predict the existence of a Branch E and F (last 6 entries) Branch E signature ... 464b=17, 607=14, 576=17 Branch F signature ... 391=10, 458=17, 459=9-9, 576=17
Robert Casey has developed this concept extensively and you can hear him talk about it in this video here.
6. SNP testing is consistent among the members of the particular group
The advent of Next Generation Sequencing (producing tests like the Big Y and the array of SNP Packs) has created a SNP tsunami. And as more people SNP test, their predicted red SNP is being converted to a green confirmed SNP on the project's Y-DNA Results page.
As a result, many groups within a surname project are having their "Terminal SNP" characterised. And this allows us to compare any SNP markers that the newcomer has tested with the SNP markers that characterise the various groups within our surname project. If they are discordant, then the newcomer is ruled out from membership of those particular genetic families. But if they agree with each other, especially if they are SNPs quite far downstream, then this is further supportive evidence that the newcomer belongs in a specific genetic family.
The phrase terminal SNP is a bit of a misnomer. It should be restated as "current terminal SNP" and simply means the "most downstream" SNP marker that you have currently tested. And what is meant by "most downstream"? Imagine the Tree of Mankind (the Y-Haplotree) as starting with genetic Adam (upstream) about 250,000 years ago and the various branches emerging from him and continuously branching over many thousands of years into finer and finer "more downstream" branches, until these finer branches start approaching the origin of surnames (roughly 1000 years ago) and a genealogical timeframe. So your "most downstream" branch would be the branch characterised by your "most downstream" SNP marker ... which in turn is determined by your current level of SNP testing. For example, your Y-DNA 37 STR results will predict which Haplogroup branch you sit on (let's say it is R-M269, which arose about 13.5K years ago), and the R-M269 SNP Pack will take you a little further down Branch R (say to Z255, 4000 years ago), and the R-Z255 SNP Pack, will take you even further downstream (maybe to 2000 years ago), but the Big Y test will take you the furthest (maybe down to 500 years ago).
In the example below, all the green confirmed SNPs sit below the SNP marker that defines Gleeson Lineage II, namely A5631. Therefore any newcomer who matches any of these SNPs (even if he has a large GD to everyone in the project) can be reliably grouped into Lineage II. The abbreviated SNP Progressions (or SNP Signatures) for each of the individual SNPs is detailed below:
The predicted red SNPs are almost always much further upstream on the Tree of Mankind than the green confirmed SNPs. Think of the upstream SNPs as closer to Genetic Adam (250,000 years ago) and the downstream SNPs as closer to a genealogical timeframe (say, 1000 years ago).
7. SNP predictions are consistent (Matches’ Terminal SNP Analysis)
NB: SNP Predictions does not mean the red predicted SNP you get in the Haplogroup column (see figure above) when you first get your Y-DNA-37 results. It refers to SNPs much further downstream than that, usually within the last 5000 years and frequently within the last 2000 years.
If a newcomer to the surname project has not undertaken downstream SNP testing, it is still possible to guess what his downstream "terminal SNP" will be by simply analysing the terminal SNPs of his STR matches. I call this the Matches' Terminal SNP Analysis. It is a relatively simple technique that takes a little time to complete. Here are the steps in the process:
1) First, open up the Y-DNA Matches page and adjust the Matches Per Page setting so that all the matches are on the one page.
2) Next click on the heading in the Y-DNA Haplogroup column so that all of the matches are sorted by their terminal SNP.
3) Make a list of all the terminal SNPs (you can ignore the SNPs that are way upstream e.g. M269, P312, L21, etc)
4) Find out where each SNP sits on the Y-Haplotree, and (most importantly) the major subclade to which it belongs. You can do this by either of two ways: a) launch FTDNA's Haplotree, press Ctrl+F (Cmd+F on a Mac) and enter the SNP name. Once you find it, trace the branch back up to the previous branching point, make a note of the SNP there, and repeat the process until you arrive at a known subclade SNP; or b) google the following: "ytree" and the SNP name ... and this will bring you to the relevant page on the Big Tree. Then simply copy and paste the SNP Progression from the top of the page.
A google search for: ytree a5631
5) Both of the above methods will result in you having a SNP Progression for each SNP in the Matches List (see example below). If all (or most) of these SNP Progressions fall below a certain sublcade, then the likelihood is that the newcomer will also test positive for some SNP below this subclade level. It may even be possible to predict that he sits on one of maybe two or three "way downstream" branches. And this can be strong supportive evidence that he is related to certain project members and should be grouped in a particular genetic family.
If on the other hand, the various SNP Progressions associated with this list of SNPs indicate that the newcomer is matching to multiple distinct upstream branches of the Haplotree, then no firm conclusions can be drawn about the newcomer's likely terminal SNP and therefore this information cannot be used to help place him in a specific genetic family.
6) As a result of this analysis, I may write to the newcomer and suggest they skip the upstream SNP Pack (e.g. R-M269) and move down to the more relevant downstream subclade SNP Pack (e.g. R-Z255) and purchase that one ... warning them that there is a 1% chance that my assessment may be wrong (but I haven't been wrong yet).
Output of the MTSA for a new project member (he was advised to do the R-L1065 SNP Pack)
There are SNP Packs available for most of the major subclades and it is important to know what these are. You can see a list of them by logging in to your FTDNA account, clicking on Upgrade, then Advanced Tests, then SNP Packs from the drop-down menu.
Surprisingly, this analysis works best at the 25-marker level (because there are usually too few matches at the 37, 67 and 111 marker levels).
Occasionally I will have to use www.Ybrowse.org to check for the existence of equivalent SNPs or alternative names (if the SNP in question does not turn up in the FTDNA Haplotree or the Big Tree).
8. The same surname variant is predominant in a genetic group
This usually emerges after the new project member has been grouped on the basis of the previous MPRs described above. This serves to support and validate the decision to group the newcomer in the specific genetic family.
9. The same MDKA location is present in the particular genetic group
As above. This serves to illustrate how essential it is to encourage all project members to include the birth location of their Most Distant Known Ancestor (MDKA / EKA) in the Genealogy section of their personal FTDNA webpages. After their surname, their ancestor's birth location is the single most important piece of information.
I covered much of this topic in a presentation I gave at the FTDNA Annual Conference in Houston (Nov 2017) and you can watch it on YouTube here. The relevant section is from 37 minutes 40 seconds onwards.
Let's imagine that the Tree of Mankind (aka Y-Haplotree) starts with "genetic Adam" (some 250,000 years ago) and splits into progressively more downstream branches as the timeline approaches modern day. These downstream branches can be identified by downstream SNP marker testing of your Y chromosome (with tests such as SNP Packs, and in particular the Big Y). This downstream Y-SNP testing helps locate your position on the Tree of Mankind and potentially this can prove very helpful for a variety of reasons:
It can help ensure that you have been grouped accurately in a specific "genetic family" (within a Surname Project, for example)
It can help determine your ancestral origins - at times the actual country, and potentially even the region or county ... this helps focus your genealogical research
It can identify your nearest genetic neighbours and their associated surnames ... which in turn can tie you into the genealogy of a specific 'clan' or sept
It can identify branches within a genetic family and which one you sit on (it can also be useful in generating a Mutation History Tree)
It can highlight the risk of Chance Matches due to Convergence amongst your list of matches
But ... the Big Y test is expensive. The technique below tells how to predict the Big Y result without doing the test. In that way you can reap the benefits of the Big Y without actually having to do it.
The technique is called Downstream SNP Prediction because we will be predicting what SNP markers you are likely to test positive for "downstream" i.e. approaching the modern era, say within the last 500-1500 years. And the MTSA in the title stands for Matches Terminal SNP Analysis - in other words, you will be analysing the terminal SNPs of each person on your list of Y-DNA matches generated from the Y-STR test that you have previously done (be it the Y-DNA-37, Y-DNA-67 or Y-DNA-111).
The technique is quite simple. It just takes a little bit of time to complete (about 10 minutes). But there is one major caveat - it does not always work. And once you see the results, you will have to make a judgement call on whether or not you think the result is likely to be reliable. But when it does work, it works well.
Essentially the MTSA method involves collecting the terminal SNPs of all of your Y-STR matches and then seeing where each SNP in turn sits on the Tree of Mankind.
If they all sit on the same branch, then you probably do too. If they sit on widely different branches, then the results are untrustworthy (in this particular instance), and the method has not been able to predict which downstream SNP you are likely to test positive for. As a consequence, formal SNP testing (Big Y or otherwise) will be necessary to determine your position on the Tree of Mankind.
Here is a list of the steps involved in Downstream SNP Prediction using the MTSA method. We will go through them later in detail one by one:
To start, sign in to your FTDNA account and open your Y-DNA Matches List.
Sort your matches list by "Haplogroup"
Note down the terminal SNPs and how often each one occurs - repeat this step for each marker level (111, 67, 37, & 25).
Plot each SNP in turn on the Haplotree
Assess whether or not the SNPs fall on a single line of descent coming down the Haplotree ...
if they do, there is a good chance that you will also follow this line of descent and end up on the same downstream branch (or a branch very close by)
if they do not fall on the same single line of descent, then the technique has not worked in this instance because Convergence is present
Make a judgement call on how reliable you think the results are
Now let's look at each step in detail.
Step 1 - open your Y-DNA Matches list
Step 2 - sort your matches by Haplogroup ... just click on the title "Y-DNA Haplogroup" and this will arrange your list of matches alphabetically by their Terminal SNP.
This individual has 183 matches at the 25 marker level (top left)
Step 3 - note down the terminal SNPs and how often each one occurs
In the example above, this would produce a list like this:
1) I don't bother recording the frequency of single SNPs. Thus, any SNP in the list without a number in brackets has only occurred once in the list.
2) I ignore any known "upstream" SNPs (e.g. M269, L21, etc) as these are too far upstream to be informative.
3) this exercise should be repeated at each marker level (111, 67, 37 & 25). In practice, the 25 marker level appears to be the most informative (currently).
Step 4 - Plot each SNP in turn on the Haplotree
This is the most time-consuming part of the exercise but you will get quicker with practice. To be comprehensive, it is best to identify the SNP Progression for each SNP in turn. The SNP Progression is simply the series of SNPs that characterise each branching point on the line of descent to the "terminal SNP' in question.
Thus the SNP Progressions associated with the list above would be listed as follows:
1) the easiest way to find the SNP Progression is simply to google "YTREE" and the SNP in question. This will bring you to Alex Williamson's Big Tree, each page of which has the SNP Progression for the particular branch of the Y-Haplotree under discussion (as in the diagram below for the first SNP in the list).
2) Sometimes the google approach will bring you to a branch slightly upstream of the SNP you want and you will have to search the webpage for the more downstream SNP. Do this by clicking cmd+F (ctrl+F on a PC) to FIND the SNP in question.
3) Sometimes the SNP won't be on the Big Tree and you may have to use the FTDNA or YFULL Haplotrees instead in order to find where the particular SNP sits on the tree.
4) Sometimes you may have to check www.YBROWSE.org to see if the SNP has an alternative name
Step 5 - Do the SNPs fall on a single line of descent?
Comparing the SNP Progressions above, a pattern clearly emerges. The majority of the SNP Progressions are on a single line of descent, at least as far down as L1065. The exception is the first SNP (BY3441), which splits off from the rest, two branches above L1065.
Below L1065, there are at least two branches - one via FGC10125 (5 instances - count carefully - count bullet points 4-6), the other via Z16325 (bullet point 7). So the SNPs do fall on a single line of descent ... up to a point. And beyond that point, there is some disparity ... some discordance ... different SNPs on different (i.e. separate) branches of the Haplotree.
But a single man cannot sit on two conflicting branches. He can only ever sit on one branch. Beyond a certain point, the predicted branches are contradictory. And this discordance indicates that some of his Y-STR matches are Chance Matches due to Convergence.
Chance Matches could also conceivably be due to an extreme lack of Divergence (i.e. the Y-STR signature / haplotype is passed down unchanged for many thousands of years), but the chances of this being the cause are probably very low.
Step 6 - make a judgement call
So where is this particular individual likely to sit on the Tree of Mankind? Based purely on the (partial) data presented above, he sits ...
Probably below Z39589 (estimated probability ... what? say ... 99%? 95%?)
Probably below L1335(estimated probability ... 16 out of 17 instances = about 94%?)
Probably below L1065(estimated probability ... 8 out of 9 instances = about 89%?)
Probably below FGC10125(estimated probability ... 5 out of 7 instances = about 71%?)
Probably below Z16325(estimated probability ... 1 out of 7 instances = about 14%?)
Probably below DF49(estimated probability ... 1 out of 17 instances = about 6%?)
These probabilities are relatively crude, but certainly give a strong impression that the individual in question is highly likely to test positive for L1065, and below that is more likely to test positive for FGC10125 than for any of the other downstream SNPs.
So while this exercise has not identified a specific downstream SNP with 100% probability, it has pointed us in a specific direction and has identified a "most likely candidate", namely FGC10125 (about 70% probability) ... or maybe, some SNP below it, possibly FGC10117.
The SNP FGC10125 appears to have arisen some time at least 1150 years ago, so the exercise has potentially moved us down the Haplotree to a branch that arose within the last 1000-1500 years.
In addition, it has identified with even greater confidence (about 90% probability) that the individual sits somewhere below L1065 for which there happens to be a dedicated SNP Pack. So rather than doing an upstream SNP Pack like the R1b-M343&M269 Backbone Panel, this individual may choose to do the more downstream R1b-L1065 SNP Pack ... which (from the above) is likely to be appropriate with 90% probability. I always caution my project members that there is a chance (10% in this instance) that they will be wasting their money. The choice is theirs.
But before doing any downstream SNP Pack test (the R1b-L1065 SNP Pack in this example), it is always advisable to check that the SNP Pack actually contains the "further downstream" SNPs of interest (extracted from the list of matches' terminal SNPs above). And in this instance, the R1b-L1065 SNP Pack contains all the "more downstream" SNPs identified in the list above. So it would be a good choice to make in this instance ... if the individual did not want to spend money on the Big Y.
Several different types of profile can emerge from this exercise and they broadly fall into the following categories:
all the evidence points to a single downstream branch of the Y-Haplotree (say, within the last 1000 years)
most of the evidence points to a single downstream branch, but there is some minor downstream discordance within the last 2000 years or so, with several "very downstream" branches predicted
most / all of the evidence points to a major subclade branch (say, about 2000-4000 years ago) but, below this, many downstream branches are predicted indicating major downstream discordance
the evidence suggests several conflicting upstream branches of the Y-Haplotree (e.g. L21, U106, M198) and only some or none of the evidence points to a single major subclade. Thus in this case, major upstream discordance is present and accurate Downstream SNP Prediction is not possible
The various degrees of discordance arise due to Convergence This is when by chance, and over the passage of time, the descendants of one branch of the Haplotree develop a similar set of Y-STR marker values to the descendants of another branch of the Haplotree. Thus the genetic signatures of the descendants of both branches look similar and thus they match each other i.e. they appear in each other's matches list. This suggests there is a close connection (say, within several hundred years) when in fact the common ancestor is several thousand years ago. They sit on completely different branches of the Haplotree, but their Y-STR signatures suggest they could be close cousins (when in fact they are not).
Here are a few examples of each profile.
Scenario 1 - no discordance, everything points to a single downstream branch
This scenario occurs with Farrell Group 2. Using the MTSA method on many of this group's members and then plotting the terminal SNPs generated onto a diagram of the Haplotree, indicates that they all fall on a single line of descent. And predicts that the members of this group will test positive for the downstream SNP FGC20561.
There is no or little evidence that there is Convergence in this group - all the STR matches appear to be "genuine" "true positive" matches, none of the matches appear to be Chance Matches due to Convergence.
MTSA of many Farrell Group 2 members predicts they will test positive for FGC20561
Scenario 2 - minor downstream discordance
The exercise described above (to illustrate the methodology) indicated that the individual's Y-STR matches all sat on a single line of descent as far down as Z39589. Immediately after that there was some "minor discordance" (one match tested positive for DF49), but the majority of the group continued downstream to L1335 and L1065. Thereafter, there was some more discordance in the group, with 5 going down the path of FGC10125 and one turning down to Z16325. Thus, all the evidence was concordant down to Z39589 (100%), a majority of the available evidence was concordant down to L1065 (89%), and a smaller majority of the available evidence was concordant down to FGC10125 (71%). And from this we can conclude that this individual and his Y-STR matches share a common ancestor on the branch of the tree characterised by Z39589, and probably share another common ancestor further downstream on the branch characterised by L1065, and possibly share another common ancestor on the FGC10125 branch.
This is a fairly typical profile that emerges from this exercise. It takes you so far down the Haplotree but no further. Additional SNP testing will be needed to confirm the predictions.
In this scenario, Convergence is present, but it does not exert an influence until we get quite far downstream. Thus the common ancestor for the group is relatively far downstream, certainly below the major subclade level (about 2000-4000 years ago), and probably within the last 1500 years. In the example above, the major subclade L1065 is at least 1800 years old and the downstream SNP FGC10125 is at least 1150 years old. In the diagram below, the major subclade L226 is at least 1450 years old, and the downstream SNP FGC5628 is at least 1100 years old.
Two Discordant Downstream Branches occurring below major subclade R-L226
Scenario 3 - major downstream discordance
In this scenario, the MTSA methodology identifies many Discordant Downstream Branches, frequently with no particular sub-branch predominating. The individual is predicted to sit somewhere below a major subclade branch but there are so many candidates further downstream that no reasonable prediction can be made.
However, it remains clear that the individual does fall below a major subclade branch and therefore the associated subclade SNP Pack may be an appropriate test to take (if the individual does not want to purchase the Big Y). The SNP Pack will need to be checked to see if any relevant SNPs are included therein.
In the diagram below, MTSA predicts that the individual will sit on a branch downstream of M222 (a SNP marker known to be associated with significant Convergence . However, there are at least 6 different branches below M222 that the MTSA methodology predicts as possible candidates for the individual's branch. This person went on to do the Big Y test and the confirmed branch he actually sits on turned out to be none of the candidates predicted by MTSA. This illustrates the importance of making a judgement call on the reliability of the predictions.
Several Discordant Downstream Branches indicate major downstream discordance
Scenario 4 - major upstream discordance
In the final scenario, there are multiple Discordant Upstream Branches making it impossible to predict which subclade of the Haplotree the individual belongs to. For example, some matches sit on L21, others on U106, and others on M198 - all upstream SNPs that are thousands of years old. Under these circumstances, actual Big Y testing is the only option for defining where on the haplotree the individual sits.
Note I generally use the terms Upstream and Downstream in crude approximation to the nearest major subclade, which tends to be in the range of 2000-4000 years ago. Upstream is roughly more than 4000 years ago; and Downstream is roughly less than 2000 years ago. But these are approximations.
Some Final Words Downstream SNP Prediction using the MTSA method can be surprisingly predictive in many cases.
Currently it works best at the 25 marker level, simply because there are many more matches at this level and therefore many more datapoints. However I always check the higher marker levels first and also check for consistency across the different marker levels. I have rarely explored 12 marker results (because the risk of