70,000 OkCupid Users’ Data Has Been Published — But Is That So Wrong?


All of the personal information you haphazardly upload to social media and dating sites is a treasure to someone, whether it's advertisers, police or programmers. But would you be fine with someone using all that data for the purpose of advancing scientific research?

A pair of researchers in Denmark culled together a database of 70,000 users from the dating site OkCupid and published it to an open science community for anyone to search or run experiments on.

"Despite many years of advocacy of proponents, it is still uncommon for social scientists to publicly share their datasets and even sharing data on request is rare," the researchers wrote in a paper included with the dataset. "Worse, there is some evidence which indicates that those who refuse to share data upon request make more statistical errors than those who share data."

The searchable database includes 36 points of personal information like username, age, location, "religion-related opinions" and number of photos. Actual photos themselves, as well as the body text of the profiles, were not collected, though that information could easily be found using an OkCupid account once a target profile was identified.

Hacks and exposures of giant databases of personal information have become frequent in recent years — with high profile incidents like the Ashley Madison hack or Anonymous' KKK dox— so some were outraged when they heard about the OkCupid database:

But Weingart's warning about our ability to unearth personal information was true about OkCupid even before this leak.

Already public? The OkCupid database was collected with a scraper, a program that automatically runs through a website to collect all of the data — the algorithmic equivalent of going through the semi-public profiles one by one and jotting down the info. 

Although it violates OkCupid's terms of service, it's not some sort of illegal hack. It's a convenient collection of information that was already available by inconvenient means.

Mary Altaffer/AP

All kinds of public data, digital and not, is scraped into databases daily. Police use license plate readers to tracks cars and justify surveillance by saying it's simply a photograph taken in public. Twitter just asked a big data firm to stop offering its services to law enforcement, but is still making its firehose of tweets searchable for marketers and advertisers.

Scraping and uploading the OkCupid dataset is mischievous, but if the researchers are to be believed, their motivation is to advance academics and science. OkCupid runs experiments on their dating data constantly to improve their product or learn more about human relationships — OkCupid cofounder Christian Rudder wrote a whole book about it.

As we move more of our lives online, it's harder to hide personal information from someone who wants to collect it. That data can be used for good or evil, and academic research hardly seems like the worst-case scenario.