How COVID-19 is changing the way we think about privacy
How will the COVID-19 crisis end? Will most of us be required to remain at home and socially distance ourselves from others for 18 months or more until a working vaccine emerges? Unless other measures are taken, we may have to wait that long. Any extended quarantine period will likely cause widespread, long-lasting economic harm, disruptions in supply chains, and dangerous psychological impacts–all on top of the direct health impacts of COVID-19.
What measures could be taken to mitigate these impacts, get the economy restarted, and allow us to leave our homes and get back to some sense of normalcy? The obvious ones have been heavily reported: more testing (both for active disease and serology tests), manual contact tracing to prevent community spread, smaller gatherings, and personal common sense (we may never shake hands again).
But there’s a less obvious, much more important measure that could have a profound impact on stopping the spread and ending the quarantine: Using automated contact tracing to play offense instead of defense.
China, Singapore, and South Korea have demonstrated how to use automated contact tracing successfully, as outlined by the CDC here. With real-time data, these governments were able to automate contact tracing to quickly identify people that have been in close proximity to those infected with the coronavirus and propose strict quarantines to avoid further spread.
To understand why automated contact tracing is required, let’s explore an analogy from a similar worldwide pandemic: HIV. If you test positive for HIV, the moral course of action is to identify and notify anyone you’ve had sexual relations with in the past, so those people can get tested and take similar precautions to avoid further spread. Since HIV is only transmitted sexually, the number of contacts that must be traced is manageable. It can be done manually, without technology, and without automation.
But with something as infectious and fast-spreading as COVID-19, it’s a Herculean task to trace all exposed contacts without technology and automation (though some parts of the country like San Francisco are trying to do just that). You can’t contact the person behind you in line at the grocery store last week. You can’t know the person who pumped gas right after you. Automated tracing gets you the real-time knowledge to identify everyone you’ve been nearby during your contagious period.
So with millions of lives at stake, why haven’t public health experts in the United States begun using mobile data to automate contact tracing? In short, because it’s a privacy nightmare. This type of tracking might be feasible in China or Singapore, but would (and should) be much harder to accept socially in the U.S. Doing so will set a precedent that can’t be undone.
But what if we could have our privacy “cake” and eat it, too? What if we could lead the world and show that contact tracing can be achieved without violating personal privacy? Is it even possible?
Understanding privacy attacks on data
To answer this question, let’s start with the privacy of geo-temporal data. That is the type your mobile phone providers possess, which is the holy grail for tracking who we’re with and where we go.
At a minimum, your mobile provider knows who you are through the multiple identifiers on your phones, such as your provider account, phone number, SIM card, and hardware. There are even personal identifiers that travel with you across phones like an ad-ID.
Public health experts can use this personal data–and the geolocation data attached to it–to contact trace while still protecting your privacy by masking your personally identifiable information, or PII, by encrypting or hashing this data. Without getting into the data weeds, masking means converting your PII into a garbled, unidentifiable mess. But it’s still your mess. Each person’s masked data remains unique and consistent throughout the data. In layman’s terms, masking means you can still track people in the data without knowing who they are.
Yet this simple act of masking personal data is far from a privacy success.
Remember, the whole point of this exercise is to be able to know when people are (or were) in close proximity to each other. To truly know that, we need sensitive space and time data, geo-temporal data, such as the GPS pings from your mobile device. While geospatial data is critical to understanding when people are in close proximity to each other, it also presents huge privacy risks.
For example, I could easily count up each scrambled phone number’s GPS pings at 1 a.m. and 6 a.m. every day. From that exercise, I’d know where each scrambled phone number is located through the night (most of us sleep in our houses with our phones). Then, via the location I discovered for a specific phone, I can associate its scrambled phone number to an address and presumably a person. This is what’s called an inference attack. I can infer sensitive attributes about a data subject (where they live) even if I don’t have access to that subject’s data, directly.
And inference attacks are just one of many different types of privacy attacks. Another is called a linkage attack, which enables an attacker to identify an individual by “linking” separate data points together. For example, I might know that you were at your house at 1:48 p.m. today, and also that you were in Baltimore, MD (May 12, 2020). Just by knowing those two points in space and time, I could identify you in our hypothetical location data by “linking” these two facts to the above table based on coordinates and timestamp.
Dr. Yves-Alexandre de Montjoye and others have shown that only four unique data points are required to identify individuals in a large data set with 90% accuracy. These types of privacy attacks are relatively easy to conduct, which makes privacy in these circumstances very, very hard to preserve.
Balancing utility and privacy
If encrypting or hashing PII alone does not stop privacy attacks, we’re in quite the conundrum. Remember, we need the coordinates and time-stamp data to find people in close proximity to each other. But that same geo-temporal data is also key to breaking privacy, as we’ve shown.
You may hear this challenge famously framed as the privacy vs. utility tradeoff. In Paul Ohm’s oft-cited paper on this subject, he states that “data can be either useful or perfectly anonymous but never both.” But the truth is that if you are very specific about the use case, there are cases where Paul Ohm’s statement can be close to false.
With advanced, privacy-enhancing techniques, we can attain the sliver of usefulness we need, while maintaining a very high level of privacy. In the case of automated contact tracing, it turns out we actually don’t care or need to know where you are. We just care that you are NEXT to someone. In other words, we don’t care about your absolute location or time, we only care about your relative location and time to everyone else.
For example, It’s critical to know your phone number “paoisufnpifefsdas” (scrambled) was next to phone number “p98hpqefq34ffswe” (scrambled). It is not relevant that I know you were at the local Starbucks near your house. If we eliminate absolute location from the data set while retaining relative location, we now have both the utility we need and strong privacy.
To go back to Paul Ohm’s quote, we don’t have perfect utility, but we have enough utility to get the job done while preserving a great deal of privacy. This approach mitigates both of the privacy attacks we discussed above. Inference attacks won’t work because absolute location data is removed, while linkage attacks are prevented because there are no longer any matches (links) in the data. A technique like this, which preserves some utility and some privacy (not binary), is commonly termed a Privacy-Enhancing Technology (PET).
Mobilizing the data
We left out a critical part in all this: Even if we devise an approach to balance utility and privacy, how would public health experts actually get the data? There are various ways to approach this, but they all come down to using the smartphones we all carry with us throughout our days and lives.
In the U.S., 77% of the population has a smartphone with the potential to share data for contact tracing. Through mobile and cloud-based technologies, we could create a solution to enable mobile devices to share just enough information to help us automate contact tracing. Overall, there are two broad categories of approaches here:
1. Leverage Bluetooth proximity
2. GPS-based, which leverage the geo-location data your phone is likely already capturing
Each of these has its pros and cons. Let’s start with Bluetooth-based approaches. As some have proposed, U.S. residents should install a mobile app that collects when your Bluetooth is near someone else’s Bluetooth for the purpose of contact tracing. While this is a workable idea, there are two critical issues with the approach.
1. There’s limited history. You can only go as far back in the collected data as the day the app was installed. So the value of the data set will be limited until a very large percentage of the population has installed it.
2. Bluetooth technology is “static in time,” meaning it does not account for when you occupy a space after someone with COVID-19 has just left it. You both have to be there.
These issues cause huge information gaps, which makes Bluetooth approaches less likely to be successful. That said, novel approaches are being considered, including Apple and Google’s newly announced collaboration for contact tracing, which leverages concepts from federated learning. Rather than bringing all the data to the algorithm in federated learning, which is another type of PET, the algorithm “travels” to the data, allowing underlying sensitive data to remain local (and hence, private) on the mobile devices of the participants. For example, data captured from Bluetooth proximity tracking can be stored locally on your phone, and periodically, your phone can cross-reference devices it’s been nearby with a database of device IDs for known infected citizens. This approach would have high proximity accuracy, but, as discussed, large data gaps.
Overall, Bluetooth-based contact-tracing approaches provide fewer false positives (whoops, you weren’t exposed to an infected person after all) but are vulnerable to a large number of false negatives (whoops, we missed a few times you were in the same location as an infected person) due to the issues described above.
GPS-based approaches are the second category of mobile approaches to automating contact tracing. While GPS approaches are likely to result in a higher number of false positives, they will have a lower number of false negatives. Why are there fewer false negatives with GPS?
1. Your phones are, in many cases, keeping GPS history. So GPS solutions, unlike Bluetooth solutions, could “go back in time” and the contact-tracing app could partner with mobile providers or other apps that have your GPS history.
2. GPS does not require both parties to actually be next to each other, simply that they were at the (nearly) same place at the (nearly) same time.
GPS solutions aren’t perfect either, though. Unlike Bluetooth, which provides true proximity (device to device), GPS only sees “straight down.” This means I can be at the exact same GPS coordinate as you on the planet, but I might be on the first floor of a large apartment building, while you are on the 20th floor. This is a good example of why pure, GPS-based solutions will produce more false positives.
However, the argument can be made that these same dense areas–such as apartment complexes that cause GPS problems–are also much more likely to spread viruses (doorknobs, elevator buttons, mail area, etc).
Alex Berke and others have recently proposed a new PET model that will allow the anonymization of GPS data in a way that preserves relative location (time and space), but not the absolute location, as we discussed earlier. This works similar to a game of Battleship. Time and space are bucketed into a grid, and that grid is hashed to remove any shred of information of its absolute location in time and space. We can know phones were together at gridpoint B-3 but have no idea its B-3 nor where B-3 is.
The future of privacy
As we sit in our homes in the spring of 2020, we’re once again forced to consider our privacy vs. utility tradeoffs as a nation, similar to what occurred after 9/11 with the passing of the Patriot Act. But this time, fortunately, the challenge isn’t as broad as finding a few terrorists among a massive population. Instead, we know exactly what we need from our data to stop the spread. This is specifically what will allow PETs to do their thing. The choice between privacy and utility feels black and white. But when done well, it’s not. We don’t have to choose between our health and authoritarian surveillance, at least not if we approach this crisis correctly.
The optimal solution is likely going to come from a combination of PET and mobile approaches (Bluetooth + GPS). Any company dealing with data (geo-temporal or not) collected from consumers should understand the challenge they themselves face when making tradeoffs and risk assessments between utility and privacy, and how to address those challenges.
PETs such as federated learning, k-anonymization, differential privacy, and randomized response will soon be common terms, becoming the “encryption” of this decade, fueled particularly by the steady onslaught of new privacy regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA)—only two of many.
The coronavirus pandemic may be the catalytic event that pushes advanced PETs into the mainstream.
Steve Touw is cofounder and chief technology officer at the automated data governance company Immuta.