Optimal clusters generation for maximizing ensemble classifier performance
conference contribution
posted on 2021-06-24, 23:33authored byMuhammad Zohaib Jan, Brijesh Verma
Clustering based ensemble classifiers have seen a lot of focus recently because of their ability to effectively classify real-world noisy datasets. One way of incorporating clustering in ensembles is to utilize clustering algorithms such as k-means to generate a pool of data clusters. This is done to generate a random subspace on which base classifiers are trained, as opposed to bagging. One key parameter to clustering algorithms is the number of clusters i.e. the number of groups the data should be partitioned into; this is commonly known as the variable K. Most of the existing approaches either determine the value of K through trial and error or use some derived formulae-based approach. The problem firstly is that using a static value of K for different datasets is not ideal and although a certain value may work well for one dataset it may not work well for others. Secondly, calculating the value of K using a formulae-based approach using the raw data is not effective either as an unbalanced data can have a negative effect on the derived value. Therefore, in this paper we first segregate the data based on the data classes and then on each data class we perform a Silhouette analysis to determine the optimal number of clusters each data class should be separated into. The generated clusters which are class pure and are balanced by adding samples from other classes that are closest to the cluster centroid. In this manner we generate a random subspace of an augmented data that is composed of class balanced data clusters. On all balanced data clusters, a diverse set of base classifiers is trained, and an ensemble is formed. The proposed ensemble approach is tested on 16 benchmark UCI datasets and results are compared with single classifiers, as well as state-of-the-art ensemble classifier approaches. A set of non-parametric tests are also adopted to further validate the efficacy of the results.
Funding
Category 1 - Australian Competitive Grants (this includes ARC, NHMRC)