You can be easily identified from "anonymous" data, researchers say

nmedia/Shutterstock

Impact

July 24, 2019

Your medical information, credit card purchases, rewards cards, online browsing history — it is all collected and sold with the promise that it is anonymized. In theory, that means that the data points are scrubbed of any personally identifying information that would make it possible for someone to track the information back to you. Unfortunately, it appears that supposedly anonymized data isn't so secure after all. A new study published in Nature Communications shows it is possible to use machine learning to reverse engineer common data anonymization practices and identify individuals with terrifying accuracy.

According to the researchers, by applying their algorithm to even a partial data set that contains very little information about individuals, it can re-identify people with 99.98 percent accuracy. The system requires as little as 15 characteristics that are commonly attached to supposedly anonymized data, like a person's age, gender, marital status, or ZIP code.

That type of demographic information is often still attached to data because it isn't considered to be identifiable, but the more data that remains attached, the easier it is to narrow down a set of data to the individual level. Even if there are 15,000 people living in your ZIP code, the group of potential matches narrows down with each additional bit of data. Maybe half have the same marital status as you, and half of that group shares your gender identity. With just those two bits of information, the group of matches has shrunk to 3,750. Bits like your age, the type of car you drive, if you own a pet, and even your birthday can also be attached and still allow the data to be considered anonymized because your name has been stripped from it — but it's more than enough to pick you out of what initially seemed like a massive data set.

Where this presents a problem is when companies collect or have the ability to combine multiple data sets. Most individual samplings only have a couple bits of demographic information attached, but as companies acquire more information, it's possible to unlock additional details about someone by pairing known bits of data. The end result is a surprisingly complete picture of a person that is cobbled together from supposedly anonymized data. It's how a company like Facebook can produce advertisements so specific to you that you might think the company is using the microphone on your phone to listen to your conversations.

This isn't even the first time that de-anonymization tactics have been shown to be inadequate. At the DEF CON hacking convention in 2017, researchers showed that they were able to identify individuals based on web-browsing history. The researchers acquired online activity of more than three million German citizens from a data vendor and were able to identify things like the porn preferences of a judge and the type of medication used by a German member of parliament. Geneticists have also found that anonymized DNA data that are supposed to be stripped of information to be used for research purposes often have enough information to identify an individual within the data set.

To prove just how easy it would be to identify an individual person in a sea of anonymized data, the team behind the latest research has set up a site where you can see just how likely it would be for you to be identified. The online app asks simply for your ZIP code, date of birth and gender. From there, it will tell you just how likely it is that you'll be picked out of the crowd — and it just might surprise (and terrify) you to learn that answer.

The problem with all of this is that anonymized data is actually very important and useful. It's an incredibly valuable tool for researchers and can be the key to better understanding trends and behaviors that would otherwise go unnoticed. Unfortunately, the current standards for anonymization have now been proven time and time again to fall short.

Scrubbing someone's name from their information isn't enough. Only selling incomplete data sets isn't enough. As more and more of our data is made available and even sold by data brokers, it becomes increasingly possible for anyone — from marketers and corporations to malicious actors who get their hands on the information — to identify things about us that are potentially sensitive and are supposed to be kept private. Giving people more choice about what information of theirs is shared and adding new limitations on access, making it so data sets can't be combined and used to identify individuals, would go a long way to protect the privacy of people while still giving researchers access to the information they need. It might add additional obstacles to using the data, but for the sake of the people who are potentially being exposed, that's probably not the worst thing in the world.