Researchers discovered that malicious actors can force machine learning models to share sensitive information by poisoning the datasets used to train the models.
A group of researchers from Google, the National University of Singapore, Yale-NUS College, and Oregon State University published a paper titled “Truth serum: Poisoning machine learning models to reveal their secrets,” which describes how the attack works.
The researchers told The Register that the attackers would still need to know a little bit about the dataset’s structure for the attack to be successful.
Shadow models
For language models, for instance, the attacker might suspect that a user-contributed a text message to the dataset of the form ‘John Smith’s social security number is???-????-???.’ The attacker would then poison the known part of the message, ‘John Smith’s social security number is,’ to make recovering the unknown secret number easier, co-author Florian Tramèr explained.
After successfully training the model, typing the query “John Smith’s social security number” can reveal the remaining, hidden part of the string.
It’s a slower process than it sounds, but it’s still much faster than what was previously possible.
For extracting a six-digit number from a trained model, the researchers “poisoned” 64 sentences from the WikiText dataset and made 230 guesses. It may appear to be a lot, but it appears to be 39 times less than the number of queries required without the poisoned sentences.
However, this time can be reduced even further by employing so-called “shadow models,” which assisted the researchers in identifying common outputs that can be ignored.
Returning to the previous example with John’s social security number, it turns out that John’s true secret number is often not the model’s second most likely output, Tramèr told the publication.
The reason for this is that the model is very likely to output many ‘common’ numbers, like 123-4567-890, simply because they appeared many times during training in different contexts.
The shadow models are then trained to behave similarly to the real model that we’re attacking. Because the shadow models will all agree that numbers like 123-4567-890 are very likely, we discard them. In contrast, John’s true secret number will be considered likely only by the model that was trained on it and thus will stand out.
Attackers can train a shadow model on the same web pages as the actual model, cross-reference the results, and eliminate duplicate answers. When the language of the actual model begins to diverge, the attackers know they’ve struck gold.