File(s) not publicly available

Pruning high-similarity clusters to optimize data diversity when building ensemble classifiers

journal contribution
posted on 01.04.2020, 00:00 authored by Samuel FletcherSamuel Fletcher, Brijesh VermaBrijesh Verma
Diversity is a key component for building a successful ensemble classifier. One approach to diversifying the base classifiers in an ensemble classifier is to diversify the data they are trained on. While sampling approaches such as bagging have been used for this task in the past, we argue that since they maintain the global distribution, they do not create diversity. Instead, we make a principled argument for the use of k-means clustering to create diversity. Expanding on previous work, we observe that when creating multiple clusterings with multiple k values, there is a risk of different clusterings discovering the same clusters, which would in turn train the same base classifiers. This would bias the ensemble voting process. We propose a new approach that uses the Jaccard Index to detect and remove similar clusters before training the base classifiers, not only saving computation time, but also reducing classification error by removing repeated votes. We empirically demonstrate the effectiveness of the proposed approach compared to the state of the art on 19 UCI benchmark datasets. © 2019 World Scientific Publishing Europe Ltd.

Funding

Category 1 - Australian Competitive Grants (this includes ARC, NHMRC)

History

Volume

18

Issue

4

Start Page

1950027-1

End Page

1950027-20

Number of Pages

20

eISSN

1757-5885

ISSN

1469-0268

Publisher

Imperial College Press, UK

Peer Reviewed

Yes

Open Access

No

Acceptance Date

01/10/2019

Author Research Institute

Centre for Intelligent Systems

Era Eligible

Yes

Journal

International Journal of Computational Intelligence and Applications