Self-attention heads are characteristic of Transformer models and have been well studied for interpretability and pruning. In this work, we demonstrate an altogether different utility of attention heads, namely for adversarial detection. Specifically, we propose a method to construct input-specific attention subnetworks (IAS) from which we extract three features to discriminate between authentic and adversarial inputs.
We propose three sets of features from IAS. The first feature, Fmask, is simply the attention mask that identifies if an attention head is retained or pruned in IAS. The 068 second feature, Fflip, characterizes the output of a “mutated” IAS obtained by toggling the mask used for attention heads in the middle layers of IAS. The third feature, Flw, characterizes the outputs of IAS as obtained layer-wise with a separately trained classification head for each layer. We train a classifier, called AdvNet, with these features as inputs to predict if an input is adversarial.
We report results on 10 NLU tasks from the GLUE benchmark (SST2, MRPC, RTE, SNLI, MultiNLI, QQP, QNLI) and elsewhere (Yelp, AG News, IMDb). For each of these tasks, we first create a benchmark of adversarial examples combining 11 attack methodologies. These include word-level attacks: deletion (Feng et al., 2018), antonyms, synonyms, embeddings(Mrkšic et al. ´ , 2016), order swap (Pruthi et al., 2019), PWWS (Ren et al., 2019), TextFooler (Jin et al., 2020) and character-level attacks: substitution, deletion, insertion, order swap (Gao et al., 2018).
To further research in this field, we realease a benchmark that contains 5,686 adversarial examples across tasks and attack types. To the best of our knowledge, this dataset is the most extensive benchmark available on the considered task. This work is currently under review at ACL Rolling review and will be submitted to ACL 2022 conference. For detailed information regarding this work, please visit our paper.
This work has been developed by Anirudh Sriram, Emil Biju,Prof. Mitesh Khapra and Prof. Pratyush Kumar from the Indian Institute of Technology, Madras. Ask us your questions at anirudhsriram30799@gmail.com or emilbiju7@gmail.com.