CQUniversity
Browse

A novel framework for optimised ensemble classifiers

Download (3.39 MB)
thesis
posted on 2023-10-26, 03:53 authored by Muhammad Zohaib Jan
Ensemble classifiers are created by combining multiple single classifiers to achieve higher classification accuracy. Ensemble classifiers benefit from the ‘perturb and combine’ strategy, where an input data is perturbed to generate sub-samples and base classifiers are trained on generated sub-samples. All trained base classifiers are then suitably combined, and an ensemble decision is formed. One common strategy of perturbing input data is through clustering. Data clusters are generated from the input, and base classifiers are trained on generated data clusters. Such ensemble classifiers are also called clustering-based ensemble classifiers as they utilise clustering algorithms to generate a perturbed input training space. Clustering has been very applicable when it comes to generating ensemble classifiers, however it has certain limitations. One key limitation is that clustering algorithms require the number of data clusters in advance. Most of the existing ensemble approaches use a fixed number of data clusters, that are generated for various datasets, and normally searched through a process of trial and error. Additionally, since clustering works independently of data classes, class imbalances may occur in the data clusters, and data clusters may miss data samples from certain classes. Therefore, not all data clusters are suitable for the training of base classifiers, and redundant or imbalanced data clusters, should be dealt with appropriately. Besides the number of data clusters problem, the choice and type of base classifiers utilised to train on generated data clusters also have significant impact on the ensemble classifier’s performance. The use of all base classifiers to generate an ensemble classifier is not an ideal strategy, so an appropriate classifier selection methodology must be adopted to select the subset of base classifiers that can maximise the ensemble classifier’s accuracy. In this thesis several novel ensemble classifier methods have been proposed to mitigate the limitations and improve accuracy of ensemble classifiers. The first ensemble method is based on a novel strategy of incorporating an evolutionary algorithm to dynamically search for the upper bound of clustering. The second ensemble classifier method incorporates an evolutionary algorithm in two phases by optimising the pool of data clusters rather than a single upper bound and optimising the pool of base classifiers. The third ensemble classifier method is based on a hybrid approach that solves the problem of dimensionality and uses reduced dimensions data to generate an optimised ensemble classifier. The fourth ensemble classifier method is based on a novel cluster balancing strategy that solves the problem of class imbalances by balancing data clusters. The fifth ensemble classifier method contains a novel strategy to find the optimal value of clusters for each data class through the incorporation of cluster validation strategies. The sixth ensemble classifier method is based on a novel classifier selection strategy that selects classifiers from the pool based on accuracy and diversity comparisons. The seventh, and final ensemble classifier method, uses a novel pairwise diversity measure to select classifiers from the pool based on increasing accuracy and diversity. The proposed ensemble methods were evaluated on several benchmark datasets. These datasets are used by other researchers and allow a comparative analysis. In most cases an ensemble classifier’s accuracy was used as a metric to measure the performance, and in other cases different diversity measures were used. Statistical significance testing was also conducted to further validate the efficacy of the results and p-values were reported. The results and analysis presented in this thesis show that the proposed ensemble methods not only achieved classification accuracy better than existing state-of-the-art ensemble methods, but also provide a platform for future research. It was found through experimentation that upper bounds of clustering follow a logarithmic relation with the number of data samples each dataset has. Moreover, through extensive experimentation, it was proved that not all base classifiers should be selected to generate the ensemble, and only a subset of base classifiers is required to generate an ensemble classifier that can achieve the highest classification accuracy. Through the incorporation of optimisation, it was also proved that no preference is given to a specific base classifier and the type of base classifier is dependent on the characteristics of the dataset. Silhouette analysis proved to be an effective cluster validation metric to determine the optimal number of data clusters. Finally, balancing data clusters proved to be effective not only in terms of classification accuracy, but also confirmed that each dataset has different spatial characteristics which, when exploited appropriately, can contribute to overall ensemble classifier accuracy.

History

Location

Central Queensland University

Additional Rights

CC BY

Open Access

  • Yes

Era Eligible

  • No

Supervisor

Professor Brijesh Verma

Thesis Type

  • Doctoral Thesis

Thesis Format

  • With publication