The Death of Practical Obscurity: Large-Scale Deanonymization with LLMs

A new research paper by teams from ETH Zurich and Anthropic presents a startling conclusion: Large Language Models (LLMs) have effectively killed "practical obscurity." The study demonstrates that pseudonymous, unstructured text—such as Reddit posts or forum comments—can be linked to real-world identities with alarmingly high precision, without the need for the structured data (like zip codes or birth dates) required by traditional privacy attacks.

The ESRC Framework

The researchers propose a four-step pipeline called ESRC to achieve these results:

Extract (Profiler): The LLM processes raw natural language to extract specific attributes (e.g., inferring location from a mention of a specific park or profession from technical jargon), effectively turning prose into a semantic feature vector.
Search (Retriever): Using dense embeddings (like Gemini) and vector search, the system creates a shortlist of potential candidates. However, the study notes that search alone is noisy, yielding only ~4.4% recall.
Reason (Investigator): This is the critical innovation. A heavy reasoning model (e.g., GPT-4 class) analyzes the top candidates to verify claims and check for contradictions (e.g., flagging a mismatch between a PhD student and a senior engineer). This step boosted recall from ~4% to over 45%.
Calibrate: The system assigns confidence scores using a "tournament" style pairwise comparison to rank the most likely matches robustly.

Key Experiments and Findings

The paper validated this methodology through three major experiments:

Hacker News to LinkedIn: The model successfully linked pseudonymous Hacker News profiles to LinkedIn accounts with 67% recall at 90% precision. Traditional statistical correlation methods achieved nearly 0% on this unstructured data.
Volume Sensitivity: On Reddit, the more a user posted, the easier they were to identify. Users with 10 or more distinct posts saw recall rates jump to 48%.
Temporal Linking (The "Alt Account" Attack): The system could link a user’s history from one year to their history two years later with high precision, even against 1 million distractor candidates. This suggests that creating new accounts or "starting fresh" is ineffective if writing style and behavioral patterns remain consistent.

The Economics of Mass Surveillance

Perhaps the most significant takeaway is the democratization of these capabilities. The study cost less than $2,000 to run, with individual profile deanonymization costing between $1 and $4. This indicates that privacy attacks that once required state-level intelligence budgets can now be executed at scale using standard public APIs and a credit card.

Conclusions on Mitigation

The authors offer a grim outlook on defense. Standard methods like differential privacy destroy the utility of text. Because LLMs treat context, writing style, and incidental disclosures as highly structured data, the only effective defense currently identified is to simply not post text online. The paper serves as a warning that capability overhang is already here; retrieval is merely a filter, but reasoning is the mechanism that makes mass deanonymization possible today.

Mentoring question

The transcript highlights that ‘retrieval is just a recall mechanism, while reasoning is a precision mechanism.’ How might you apply this distinction to improve the accuracy of your own AI workflows or entity resolution tasks?

Source: https://youtube.com/watch?v=w8zS5To5t8s&si=q17D6qlYy96whDWg