Mobile malware detection with imbalanced data using a novel synthetic oversampling strategy and deep learning
conference contribution
posted on 2021-06-25, 00:45authored byMahbub E Khoda, Joarder Kamruzzaman, Iqbal Gondal, Tasadduq ImamTasadduq Imam, Ashfaqur Rahman
Mobile malware detection is inherently an imbalanced data problem since the number of benign applications in the market is far greater than the number of malicious applications. Existing methods to handle imbalanced data, such as synthetic minority over-sampling, do not translate well into this domain since mobile malware detection generally deals with binary features and these methods are designed for continuous features. Also, methods adapted for categorical features cannot be applied here since random modifications of features can result in invalid sample generation. In this work, we propose a novel technique for generating synthetic samples for mobile malware detection with imbalanced data. Our proposed method adds new data points in the sample space by generating synthetic malware samples which also preserves the original functionality of the malicious apps. Experiments show that the proposed approach outperforms existing techniques in terms of precision, recall, F1score, and AUC. This study will be useful in building deep neural network-based systems to handle imbalanced data for mobile malware detection.