Manipulating Machine Learning Systems by Manipulating Training Data
Interesting research: “TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents“:
Abstract:: Recent work has identified that classification models implemented as neural networks are vulnerable to data-poisoning and Trojan attacks at training time. In this work, we show that these training-time vulnerabilities extend to deep reinforcement learning (DRL) agents and can be exploited by an adversary with access to the training process. In particular, we focus on Trojan attacks that augment the function of reinforcement learning policies with hidden behaviors. We demonstrate that such attacks can be implemented through minuscule data poisoning (as little as 0.025% of the training data) and in-band reward modification that does not affect the reward on normal inputs. The policies learned with our proposed attack approach perform imperceptibly similar to benign policies but deteriorate drastically when the Trojan is triggered in both targeted and untargeted settings. Furthermore, we show that existing Trojan defense mechanisms for classification tasks are not effective in the reinforcement learning setting.
From a news article:
Together with two BU students and a researcher at SRI International, Li found that modifying just a tiny amount of training data fed to a reinforcement learning algorithm can create a back door. Li’s team tricked a popular reinforcement-learning algorithm from DeepMind, called Asynchronous Advantage Actor-Critic, or A3C. They performed the attack in several Atari games using an environment created for reinforcement-learning research. Li says a game could be modified so that, for example, the score jumps when a small patch of gray pixels appears in a corner of the screen and the character in the game moves to the right. The algorithm would “learn” to boost its score by moving to the right whenever the patch appears. DeepMind declined to comment.
Boing Boing post.
Clive Robinson • November 29, 2019 7:04 AM
@ Bruce,
With regards,
Recent work has identified that classification models implemented as neural networks are vulnerable to data-poisoning and Trojan attacks at training time.
We’ve known for some time now that “real world” training data, just transfers bias / prejudice from human systems to the AI system.
Thus a question arises of how you would tell a “Trojan attack” from carefully selected real world training data?
In other words what is to stop me making or using a similar AI system to go over a large data set, selecting that subset of training data that would give me the desired prejudicial output, when fed into the official AI system?
As far as I can see all the person wishing to poison the official AI system would have to do is come up with an excuse for selecting the records that make up the training data. If they claim they were selected at random, there is little you can say or demonstrate after the fact to show it was deliberate prejudice…
Because currently AI systems are in effect a “black dox one way function” they in effect give perfect deniability as you can not gey into the system and work it backwards.
Like Crypto with “magic number” SBoxes and Curves I’m deeply skeptical of AI systems that appear like Chinese Rooms.
As I’ve remarked before about black box Random Number Generators if as an observer you can only see the output you have no real way in a finite time to demonstrate that in fact it is not an RNG but a secure crypto algorithm driven by a counter.
The exact same logic applies to black box AI systems.
It should be noted at this stage that our legal systems are very much based historically on the notion of being able to analyse an effect, thus determin the cause and be able to demonstrate it as factual proof to a group of peers to make judgment as to it’s truth or not.
Any black box one way system prevents this reasoning back from effect to cause thus should be treated with a fair degree of mistrust by any reasonable person.