Improving neural network trojan detection via network abstraction
Summary
Deep learning-based image recognition systems have become essential in a variety of applications, including autonomous driving functions in vehicles. The increased use of
third-party datasets and pretrained models open up a new security risk, where any potential user cannot know if the data or model have been manipulated. Attackers can plant
a backdoor during the training phase by poisoning a part of the dataset with a trojan trigger. The trojaned model behaves normally on benign inputs, but inputs that contain
the trigger will cause the model to intentionally select a wrong output. In the domain of autonomous vehicles, an attack which causes an intentional misclassification of a road sign
could have fatal consequences. One of the methods for detecting neural network trojans is Artificial Brain Stimulation (ABS), which manually stimulates a neuron’s activation
value and observes the change in output activation values. We combine ABS with the neural network abstraction tool DeepAbstract, which computes clusters and cluster
representatives, based on Input/Output similarity of neurons. Our strategy involves selectively applying the ABS analysis on the subset of cluster representatives, to possibly
reduce the computational load and increase the detection accuracy. To assess the efficacy of our method, we conducted experiments using the GTSRB dataset, trojaning multiple
models with six distinct triggers of varying visibility. We analyze two research questions: Whether our method can lead to a runtime improvement compared to ABS, and whether
it can increase the detection accuracy. One model showed an improvement in stimulation runtime, while the runtime of the other models remained equal. Our method consistently
yields superior or equivalent detection accuracy across all tested models compared to ABS. At best, our method increased the reverse-engineered attack success rate score by 33% and
the number of detected trojaned neurons by 59%, demonstrating a clear improvement in detection accuracy