Guided Retraining to Enhance the Detection of Difficult Android Malware
The popularity of Android OS has made it an appealing target for malware developers. To evade detection, including by ML-based techniques, attackers invest in creating malware that closely resemble legitimate apps, challenging the state of the art with difficult-to-detect samples. In this paper, we propose Guided Retraining, a supervised representation learning-based method for boosting the performance of malware detectors. To that end, we first split the experimental dataset into subsets of “easy” and “difficult” samples, where difficulty is associated to the prediction probabilities yielded by a malware detector. For the subset of “easy” samples, the base malware detector is used to make the final predictions since the error rate on that subset is low by construction. Our work targets the second subset containing “difficult” samples, for which the probabilities are such that the classifier is not confident on the predictions, which have high error rates. We apply our Guided Retraining method on these difficult samples to improve their classification. Guided Retraining leverages the correct predictions and the errors made by the base malware detector to guide the retraining process. Guided Retraining learns new embeddings of the difficult samples using Supervised Contrastive Learning and trains an auxiliary classifier for the final predictions. We validate our method on four state-of-the-art Android malware detection approaches using over 265k malware and benign apps. Experimental results show that Guided Retraining can boost state-of-the-art detectors by eliminating up to 45.19% of the prediction errors that they make on difficult samples. We note furthermore that our method is generic and designed to enhance the performance of binary classifiers for other tasks beyond Android malware detection.
Tue 18 JulDisplayed time zone: Pacific Time (US & Canada) change
13:30 - 15:00 | ISSTA 3: Deep-Learning for Software AnalysisTechnical Papers at Amazon Auditorium (Gates G20) Chair(s): Shiyi Wei University of Texas at Dallas | ||
13:30 15mTalk | API2Vec: Learning Representations of API Sequences for Malware Detection Technical Papers Lei Cui Zhongguancun Laboratory, Jiancong Cui University of Chinese Academy of Sciences; Institute of Information Engineering at Chinese Academy of Sciences, Yuede Ji University of North Texas, Zhiyu Hao Zhongguancun Laboratory, Lun Li Institute of Information Engineering at Chinese Academy of Sciences, Zhenquan Ding Institute of Information Engineering at Chinese Academy of Sciences DOI | ||
13:45 15mTalk | CONCORD: Clone-Aware Contrastive Learning for Source CodeACM SIGSOFT Distinguished Paper Technical Papers Yangruibo Ding Columbia University, Saikat Chakraborty Microsoft Research, Luca Buratti IBM Research, Saurabh Pujar IBM, Alessandro Morari IBM Research, Gail Kaiser Columbia University, Baishakhi Ray Columbia University DOI | ||
14:00 15mTalk | Type Batched Program Reduction Technical Papers DOI | ||
14:15 15mTalk | CodeGrid: A Grid Representation of Code Technical Papers Abdoul Kader Kaboré University of Luxembourg, Earl T. Barr University College London; Google DeepMind, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg DOI | ||
14:30 15mTalk | Guided Retraining to Enhance the Detection of Difficult Android Malware Technical Papers Nadia Daoudi University of Luxembourg, Kevin Allix CentraleSupélec, Tegawendé F. Bissyandé University of Luxembourg, Jacques Klein University of Luxembourg DOI | ||
14:45 15mTalk | Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning Technical Papers Zhaoxu Zhang University of Southern California, Robert Winn University of Southern California, Yu Zhao University of Central Missouri, Tingting Yu University of Cincinnati, William G.J. Halfond University of Southern California DOI |