CQUniversity
Browse

File(s) not publicly available

Pruning high-similarity clusters to optimize data diversity when building ensemble classifiers

journal contribution
posted on 2020-04-01, 00:00 authored by Samuel Fletcher, Brijesh Verma
Diversity is a key component for building a successful ensemble classifier. One approach to diversifying the base classifiers in an ensemble classifier is to diversify the data they are trained on. While sampling approaches such as bagging have been used for this task in the past, we argue that since they maintain the global distribution, they do not create diversity. Instead, we make a principled argument for the use of k-means clustering to create diversity. Expanding on previous work, we observe that when creating multiple clusterings with multiple k values, there is a risk of different clusterings discovering the same clusters, which would in turn train the same base classifiers. This would bias the ensemble voting process. We propose a new approach that uses the Jaccard Index to detect and remove similar clusters before training the base classifiers, not only saving computation time, but also reducing classification error by removing repeated votes. We empirically demonstrate the effectiveness of the proposed approach compared to the state of the art on 19 UCI benchmark datasets. © 2019 World Scientific Publishing Europe Ltd.

Funding

Category 1 - Australian Competitive Grants (this includes ARC, NHMRC)

History

Volume

18

Issue

4

Start Page

1950027-1

End Page

1950027-20

Number of Pages

20

eISSN

1757-5885

ISSN

1469-0268

Publisher

Imperial College Press, UK

Peer Reviewed

  • Yes

Open Access

  • No

Acceptance Date

2019-10-01

Author Research Institute

  • Centre for Intelligent Systems

Era Eligible

  • Yes

Journal

International Journal of Computational Intelligence and Applications