File(s) not publicly available
Pruning high-similarity clusters to optimize data diversity when building ensemble classifiers
journal contributionposted on 01.04.2020, 00:00 authored by Samuel FletcherSamuel Fletcher, Brijesh VermaBrijesh Verma
Diversity is a key component for building a successful ensemble classifier. One approach to diversifying the base classifiers in an ensemble classifier is to diversify the data they are trained on. While sampling approaches such as bagging have been used for this task in the past, we argue that since they maintain the global distribution, they do not create diversity. Instead, we make a principled argument for the use of k-means clustering to create diversity. Expanding on previous work, we observe that when creating multiple clusterings with multiple k values, there is a risk of different clusterings discovering the same clusters, which would in turn train the same base classifiers. This would bias the ensemble voting process. We propose a new approach that uses the Jaccard Index to detect and remove similar clusters before training the base classifiers, not only saving computation time, but also reducing classification error by removing repeated votes. We empirically demonstrate the effectiveness of the proposed approach compared to the state of the art on 19 UCI benchmark datasets. © 2019 World Scientific Publishing Europe Ltd.