Researchers Used This Genealogy Site to Build a 13 Million-Person Family Tree

In the last 20 years, genealogy websites have attracted more than 15 million customers by promising insights into your past. Maybe you’ll uncover a secret infidelity or be reunited with a long-lost cousin, like when Larry met Bernie on Finding Your Roots. It’s deeply personal, affecting stuff. But when your family tree contains thousands, millions, even tens of millions of people, it’s no longer a personal history. It’s human history.

When commercial genealogy and social networking website launched in 2007 it aimed to create a “family tree of the world.” Today, amateur genealogists have created more than 115 million individual profiles on the free site, linking them together by marriage or birth when they can. Recently, the company allowed scientists from the New York Genome Center, Columbia, MIT, and Harvard to scrape these crowdsourced public records into family trees the size of small nations. Their analysis, which was published today in Science, includes the single largest known family tree, containing 13 million people (one of whom, spoiler alert, is Kevin Bacon).

The team, which was made up mostly of geneticists and bioinformaticians, was also able to establish a new perspective on the genetic basis for longevity. It’s a hot topic, especially around Silicon Valley, where numerous, well-funded startups have devoted themselves to finding the secrets to aging in DNA. But it’s a hard one to study. “I can’t just put up posters in the New York subway saying, ‘Hey bring your cousins, we want to study longevity!’” says study author Yaniv Erlich. “It’s a lot easier to just log into and download this data at a massive scale.”

Now of course, he would say that. Up until a year ago, Erlich was leading academic research into DNA data storage, genome hacking, and population genetics at Columbia. That’s where he first got introduced to the Geni dataset. He and his co-authors first published a draft of their work on the preprint server biorXiv, last February. And a week before it posted, he took a leave of absence to accept a job as the chief scientific officer of MyHeritage, Geni’s parent company, who began offering personal DNA kits in 2016.

Researchers constructed this 6,000 person family tree using graph theory. Individuals spanning seven generations are in green, connected with red lines, signifying marriage.

Columbia University

By looking at lifespan variation between more than three million pairs of relatives, Erlich and his academic partners—which include former colleagues at Columbia and the New York Genome Center—found that your chances of living longer could only be chalked up to your genes about 16 percent of the time. Previous studies have placed heritability estimates between 10 and 30 percent, with lifestyle, environment, and just dumb luck making up the rest of the picture. You can have great genes, but that won’t stop you from getting in a car crash, or being in the backwoods when the big one hits. “We found there’s much less signal in the genome to potentially find,” says Erlich. “If you live or don’t live is mostly something you don’t have control of.”

Mostly the purpose of the paper, he says, was to show that this kind of data, crowdsourced from descendants who seek out sites like, could offer up the same analytical insights as more traditional demographic datasets, which are way more labor and cost-intensive to produce; the last US Census ran to the tune of $13 billion. That’s not a given: “With a dataset like this, the worry is that it’s special in ways we can’t yet understand,” says Josh Goldstein, a demographer at UC Berkeley. The chances of finding relatives could come down to if they lived in a place with good records, or if they happened to be relatively famous (see Kevin Bacon), or just random luck.

But the authors in this case took pains to address some of those issues, notably by comparing the death certificates of some 80,000 Vermonters who died between 1985 to 2000 with 1,000 Geni profiles from the same time and place. In terms of socioeconomic factors, the two groups matched up near-perfectly: 98 percent concordance. It seems that the crowdsourced amateur data decently represents the general population.

After downloading 86 million public profiles on, researchers used mathematical graphing to clean and organize the data into family trees. This one has 70,000 relatives connected through marriage and shared ancestors.

Columbia University

And it’s publicly available. Anyone can download the researchers’ tree and demographic data, in a de-identified format. And once they’ve done that, they could theoretically fuse these massive pedigrees with other data collections—say DNA sequenced by MyHeritage, Ancestry, or 23andMe. Then you could start tracing diseases, and any associated genes, across generations. “The cumulative effect of this and other public data sets could be very large in the years ahead,” says Goldstein.

Geni has set up its API to allow researchers to contact anyone in its database (through an encrypted, de-identified token system) to get their consent to access their data. “In the old days you had to pay people to participate in a study, and it generated one dataset for one specific thing,” says Erlich. “Now we can repurpose the work genealogists have done to get to know their families, and leverage it to answer fundamental questions.”

Now, is it too soon to start giving ancestor-hunting hobbyists credit for ending human suffering? Yeeeah. But maybe a good time to find out what your family tree can do for science.

Family Dynamics

Read more:

About the Author

Leave a Comment: