CSS: Handling imbalanced data by improved clustering with stratified sampling
journal contribution
posted on 2024-04-22, 02:29authored byL Cao, Hong Shen
The traditional support vector machine technique (SVM) has drawbacks in dealing with imbalanced data. To address this issue, in this paper we propose an algorithm of improved clustering with stratified sampling technique (CSS) to improve the classification performance of SVMs on imbalanced datasets. Instead of applying a single type of sampling method as used in the literature, our algorithm treats different type of classes with different sampling methods. For minority classes, the algorithm uses oversampling method by adding noise which obeys normal distribution around every support vector to generate new samples. For majority classes, samples are first divided into different clusters by applying first the improved clustering by fast search to find of density peaks (CFSFDP) to obtain latent structure information in each majority class and then stratified sampling method is applied to extract samples from each subcluster of the majority class. Moreover, we further extend this method into an ensemble classifiers that use multiple base SVM classifiers for prediction. The experimental results of classification on several imbalanced classification datasets show that our CSS is more effective than the state-of-the-art sampling methods.