Artificial intelligence has been used to recognize faces, assess creditworthiness, and forecast the weather for the past decade. Simultaneously, the sophistication of hacks using stealthier methods has increased. As both fields sought better tools and new applications for their technology, the combination of AI and cybersecurity was unavoidable. However, there is a significant issue that threatens to undermine these efforts and may allow adversaries to bypass digital defenses undetected.
The risk is data poisoning: tampering with the data used to train machines provides a virtually untraceable way to circumvent AI-powered defenses. Many businesses may be unprepared to deal with escalating challenges. The global market for AI cybersecurity is expected to triple to $35 billion by 2028. To keep threats at bay, security providers and their clients may need to patch together multiple strategies.
Data poisoning is aimed at the very nature of machine learning, a subset of AI. Computers can be trained to correctly categorize information given reams of data. A system may not have seen a picture of Lassie, but given enough examples of different animals correctly labeled by species (and even breed), it should be able to deduce that she is a dog. With more samples, it would be able to correctly identify the breed of the well-known TV dog: the Rough Collie. The computer doesn’t know for sure. It is simply making statistically informed inferences based on previous training data.
In cybersecurity, the same approach is used. Companies feed their systems with data and let the machine learn on its own to detect malicious software. Computers that are fed a plethora of examples of both good and bad code can learn to detect and capture malicious software (or even snippets of software).
A sophisticated technique known as neural networks, which mimics the structure and processes of the human brain, runs through training data and makes adjustments based on both known and unknown information. A network doesn’t have to see a specific piece of malicious code to conclude that it’s bad. It has self-taught itself and can accurately predict good versus evil.
All of this is extremely potent, but it is not invincible.
Machine-learning systems require a large number of correctly labeled samples before they can begin to predict. Even the largest cybersecurity firms can only collect and categorize a limited number of malware examples, so they must supplement their training data. Some of the information can be gathered through crowdsourcing. We already know that a resourceful hacker can exploit this observation, said Giorgio Severi, a Ph.D. student at Northwestern University, in a recent presentation at the Usenix security symposium.
Using the animal analogy, if feline-phobic hackers wanted to wreak havoc, they could label a slew of photos of sloths as cats and upload the images to an open-source database of house pets. Because tree-hugging mammals appear far less frequently in a corpus of domesticated animals, this small sample of poisoned data has a good chance of tricking a system into spitting out sloth images when asked to show kittens.
The same method is used by more malicious hackers. A hacker can trick a neural network into thinking that a snippet of software that resembles the bad example is, in fact, harmless by carefully crafting malicious code, labeling these samples as good, and then adding it to a larger batch of data. It’s nearly impossible to catch the rogue samples. It is far more difficult for a human to sift through computer code than it is to separate images of sloths from those of cats.
Researchers Cheng Shin-ming and Tseng Ming-huei demonstrated last year at the HITCon security conference in Taipei that backdoor code could completely bypass defenses by poisoning less than 0.7 percent of the data submitted to the machine-learning system. Not only does this imply that only a few malicious samples are required, but it also implies that a machine-learning system can be rendered vulnerable even if only a small amount of unverified open-source data is used.
The industry is aware of the issue, and this weakness is forcing cybersecurity firms to take a much broader approach to fortify defenses. One way to help prevent data poisoning is for scientists who develop AI models to check that all of the labels in their training data are correct regularly. Elon Musk’s research firm, OpenAI LLP, stated that when its researchers curated their data sets for a new image-generating tool, they would regularly pass the data through special filters to ensure the accuracy of each label. [This] removes the vast majority of falsely labeled images, a spokeswoman said.
To stay safe, businesses must ensure that their data is clean, which means training their systems with fewer examples than they would get from open-source offerings. The sample size is important in machine learning.
This cat-and-mouse game between attackers and defenders has been going on for decades, with AI simply being the most recent tool used to help the good guys stay ahead. Keep in mind that artificial intelligence is not omnipotent. Hackers are constantly on the lookout for the next exploit.