AI Confidential
Posts
Safeguarding Against the Risks of Deanonymization

Safeguarding Against the Risks of Deanonymization

Navigating the illusion of anonymity with confidential AI solutions.

Rishabh Poddar
October 02, 2024 • Estimated Reading Time: 14 minutes

Dear reader,

The illusion of privacy in “anonymized” datasets is a growing concern in today’s data-driven world. At Opaque Systems, we’ve long understood the limitations of traditional anonymization techniques, but a recent experiment we conducted revealed just how easy it can be to re-identify individuals using modern AI tools.

In the article below, we walk you through how we used ChatGPT, one of today’s most advanced language models, to deanonymize public data—demonstrating that privacy protections many rely on are far less robust than they appear. The results of this experiment should give anyone sharing or using public datasets pause. If we can so easily re-identify individuals, it raises serious questions about how well anonymization methods are holding up in a world increasingly driven by AI.

Why does this matter? Because anonymization techniques have long been the backbone of data sharing in industries like high tech, financial services, insurance, healthcare and many others. But as we demonstrate, these methods are increasingly vulnerable to deanonymization attacks. With Large Language Models (LLMs) like ChatGPT in widespread use, the risks are magnified. Our goal in sharing this experiment is to spark a deeper conversation about data privacy and introduce solutions—such as Confidential Computing—that can help protect individuals and organizations from these emerging threats.

We believe it’s crucial to raise awareness about the limitations of current anonymization methods and explore new, more secure approaches to data sharing. Our aim isn’t just to highlight the problem, but also to offer solutions. Confidential Computing provides a much-needed safeguard for data sharing, offering the kind of protection that traditional anonymization simply can’t.

We hope you find this article thought-provoking and, more importantly, actionable in your efforts to navigate today’s complex data landscape.

— Rishabh Poddar, Co-Founder and CTO of Opaque Systems

Anonymized Datasets Aren’t As Anonymous As You Think

Anonymized datasets are widely used across industries, from healthcare to marketing, to unlock valuable insights while preserving privacy. However, these datasets carry a hidden risk: they’re far more vulnerable to deanonymization than we often realize. Traditional methods of anonymizing data—removing HIPAA identifiers, tokenizing PII fields, or “adding noise” to the data—might seem reliable, but they frequently fail to protect privacy fully.

Consider the famous case of the Netflix Prize dataset from 2006. Netflix released an anonymized set of movie ratings to encourage the development of better recommendation algorithms. Yet, by 2009, researchers discovered they could re-identify some users by cross-referencing this data with other publicly available datasets.

Another glaring example is Latanya Sweeney’s 2000 study, where she showed how combining public records (like voter registration data) with ZIP codes, birth dates, and gender could deanonymize supposedly anonymous datasets.

Security researchers have long been aware of the risks associated with traditional anonymization techniques. Yet, organizations have continued to adopt and deploy these methods to protect their data. Why?

First, until recently, there simply haven’t been suitable or practical alternatives for ensuring privacy at scale. Techniques like differential privacy and fully homomorphic encryption are promising, but they are often too complex, costly, or impractical for widespread adoption in everyday applications. (More on this later.)

Second, while the potential for re-identification exists, the barriers to mounting such attacks have historically been somewhat high, requiring significant effort and expertise. These barriers are lowering drastically, making the risks of deanonymization more pressing than ever.

Today’s New Threat: Large Language Models (LLMs)

As anonymization techniques evolve, so too do the tools that can breach them. In recent years, we’ve witnessed a dramatic rise in the capabilities of Large Language Models (LLMs), like ChatGPT. These models, trained on vast datasets that include publicly available information, have revolutionized many industries—but they’ve also introduced new privacy concerns. Unlike earlier deanonymization methods, which required technical expertise and effort, LLMs now have the ability to process and analyze vast amounts of information quickly and automate much of this process. This makes the deanonymization not only faster and more efficient, but also more accessible to a wider range of actors, raising the stakes for protecting anonymized data. To illustrate the magnitude of this threat, we ran a simple experiment using a dataset from the Personal Genome Project (PGP), an initiative where participants voluntarily share their genomic and health data for research purposes.

We downloaded the publicly available PGP Participant Survey, which contains profiles of 4000+ participants. Each participant is assigned an ID as the reference to their profile. The profiles in the dataset appear in a de-identified state, and do not directly contain the participant’s name or address. The dataset includes partially noised demographic information, e.g., their age in 10-year ranges along with gender and ethnicity, as well as medical and genomic information.

Let’s take one notable participant from the PGP study—Steven Pinker, a cognitive psychologist and public figure—and attempt to re-identify his profile. Using only GPT-4o and publicly available information, such as his Wikipedia biography and a Financial Times article, we were able to match Pinker to his profile in the PGP database. (Note: Like many other PGP participants, Pinker has chosen to be public about his identity and participation in the PGP study.)

We used the following information on Pinker’s biography as auxiliary data we provided to GPT:

Steven Pinker was born in 1954, making him 69 years old. He is male, of Jewish descent, with grandparents from Poland and Moldova. His profession is in academia, specifically cognitive psychology, and he has been a public advocate for open science.

Using this auxiliary information, we prompted GPT to score the 4,000+ participants row-by-row, and rating each match from 1 to 100. If any major discrepancies appeared, such as a male participant’s gender being female, we instructed the model to penalize the score.

How much does the following data match Steven Pinker on a scale of 1 to 100? If there is any definitively wrong descriptor (e.g. the sex/gender is opposite to that participant what is publicly known about Steven Pinker), dock the score by a lot. Give only a numeric score and no explanation.

In this way we instructed GPT to go through the rows in the dataset and score how closely related each row is to Steven Pinker. The goal was to arrive at a list of candidates whose profiles closely matched Pinker’s, using only de-identified data.

We repeated the exercise three times and averaged the results to reduce variability. Perhaps (un)surprisingly, GPT was able to accurately pinpoint Steven Pinker’s profile and single him out with high confidence!

Final Decision: (/redacted-profile-ID/) is the best match to Steven Pinker.

Why This Should Concern You

This experiment underscores a sobering reality: deanonymizing data has never been easier. The issue presents a serious concern for organizations handling anonymized enterprise data, such as in finance and healthcare. Sensitive datasets in these industries often include transactional histories, patient health records, or insurance information—data that is anonymized to protect privacy. However, deanonymization methods, when applied to such datasets, can expose individuals or organizations to serious risks. Even seemingly trivial details, when cross-referenced with public information, can lead to exposure of highly sensitive data like financial behavior or health records.

This ability to deanonymize data with relative ease, using widely accessible tools like LLMs, represents a growing threat to data privacy. What once required significant effort and expertise can now be done with automated systems, making re-identification of individuals from supposedly anonymous datasets alarmingly simple.

Tools like GPT are dismantling the manual barriers that once made deanonymization a labor-intensive task. Our experiment only scratches the surface of what’s possible with modern AI.

The Role of Confidential Computing in Addressing Data Privacy Concerns

As deanonymization becomes easier, our perception of data privacy must evolve. LLMs like GPT are blurring the lines between anonymized and identifiable data, raising serious concerns about the security of anonymized datasets. What’s needed is an additional security layer that can enable the sharing of sensitive data without compromising confidentiality.

Confidential Computing offers a solution by enabling the safe sharing and processing of data while keeping it encrypted throughout its lifecycle – not just at rest and in transit, but also during processing (at runtime). As a result, confidential computing makes it possible to process sensitive data and generate insights, while ensuring that the underlying dataset remains protected from exposure at all times.

In today’s world, the label “anonymous” no longer guarantees privacy. It’s time we rethink our approach to data security and embrace encryption-based methods like confidential computing to safeguard sensitive data.

In the Lab

The Latest Happenings at Opaque Systems

5 Insights from the Confidential Computing Summit
At Opaque’s second annual Confidential Computing Summit, industry leaders and tech providers gathered to educate potential users, showcase the latest technological advancements in confidential computing, and foster a collaborative community. We walked away with a strong sense of optimism regarding the future of confidential AI—check out the top five key insights shared at the event. .

Ray Summit Presentation: Raluca Ada Popa on Securing Generative AI To Use With Proprietary and Private Data
Co-founder and President of Opaque Raluca Ada Popa recently spoke at Ray Summit 2024— the conference for builders creating the future of AI. In her talk, she presented research from her lab at UC Berkeley and described Opaque’s novel confidential AI platform, designed to keep proprietary/private data encrypted throughout the entire generative AI cycle.

Visit with Opaque at NVIDIA AI Summit: October 8-9
Think data masking is enough to protect your data? Think again. GenAI could expose sensitive information by unmasking and re-identifying obfuscated data. If you're attending NVIDIA AI Summit, schedule a meeting and stop by booth #36 where we’ll show you how Opaque's confidential AI platform can ensure verifiable data privacy and sovereignty.

Opaque Teams Up With Microsoft, NVIDIA and ServiceNow to Enhance Performance for AI Workloads
Opaque is delighted to join forces with Microsoft, NVIDIA, and ServiceNow to usher in a new era of AI and confidential computing with Azure confidential VMs powered by NVIDIA GPUs and Opaque’s breakthrough technologies. “The integration of the Opaque platform with Azure confidential VMs with NVIDIA H100 Tensor Core GPUs to create Confidential AI makes AI adoption faster and easier by helping to eliminate data sovereignty and privacy concerns. Confidential AI is the future of AI deployments. With Opaque, Microsoft Azure, and NVIDIA, we're making this future a reality today,” says Aaron Fulkerson, CEO of Opaque Systems.

Knowledge Base: Data Masking Is Dead, So What’s Next?
Every company is grappling with data privacy and data sovereignty. If you leak your data you've given away your intellectual property and you might not be in business in five to eight years. How do companies protect their data? In a new blog post, our Director of Solution Engineering Jamie Aliperti shares why data masking is no longer a viable measure and why organizations should embrace confidential AI instead.

Code for Thought

Worthwhile Reads

👔 CEOs are confident in the potential of AI. Business leaders are increasingly prioritizing AI as a key driver of growth, with 64% of CEOs in the KPMG CEO Outlook survey identifying it as their top investment focus for 2024. Despite concerns about workforce changes, 76% of CEOs believe AI will not significantly reduce overall headcount; however, many acknowledge a skills gap, with only 38% confident that their staff can fully leverage AI's potential. Over half of employers are reevaluating necessary skill sets and supporting upskilling and reskilling initiatives. This is where easy-to-implement, confidential AI solutions like Opaque become essential. Companies can’t afford to build these systems from scratch, but with tools designed for secure and swift deployment, businesses can accelerate AI into production without the heavy lift of standing up infrastructure on their own.

🛡️Confidential computing is critical for safeguarding AI data. The rapid advancement of AI has highlighted the importance of confidential computing in ensuring data privacy and sovereignty—we’ve seen a shift in the conversation where we're no longer raising awareness, but witnessing a surge in adoption. Notably, technology leaders starting their confidential computing journey with AI should take a holistic approach. For successful confidential AI adoption, organizations need to acknowledge existing resources and data infrastructure capabilities to integrate into an organization’s overall data strategy.

👀 Let’s look at the outlook for AI, as it seeps into all corners of the market. A new report from Bain & Company forecasts substantial growth in the AI product market in the coming years. Global technology and cloud services practice chairman David Crawford highlights three “broad pockets of growth” for AI: first, large hyperscalers are driving enormous demand that creates supply chain shortages; second, there's growing enterprise adoption; and third, independent software vendors are developing AI capabilities for their products, leveraging their own data centers. As the AI product market expands, the emphasis on data privacy and sovereignty becomes increasingly relevant, creating ground for innovations that resonate with these growth areas.

🔜 AI agents are closer than ever. The Allen Institute for AI has introduced Molmo, an open-source multimodal AI model designed to integrate and understand different types of data. Molmo aims to enhance interactions with AI by allowing more nuanced and contextual responses, making it applicable for various applications like education and research. The project emphasizes the importance of collaboration and transparency in AI development, inviting contributions from the broader community to refine the model further. Confidential AI solutions can help facilitate safe data handling for models like Molmo, enabling the responsible and ethical use of multimodal AI while maintaining user trust and compliance with privacy regulations.

Reply

or to participate.