File(s) not publicly available
Pruning high-similarity clusters to optimize data diversity when building ensemble classifiers
journal contribution
posted on 2020-04-01, 00:00 authored by Samuel Fletcher, Brijesh VermaDiversity is a key component for building a successful ensemble classifier. One approach to diversifying the base classifiers in an ensemble classifier is to diversify the data they are trained on. While sampling approaches such as bagging have been used for this task in the past, we argue that since they maintain the global distribution, they do not create diversity. Instead, we make a principled argument for the use of k-means clustering to create diversity. Expanding on previous work, we observe that when creating multiple clusterings with multiple k values, there is a risk of different clusterings discovering the same clusters, which would in turn train the same base classifiers. This would bias the ensemble voting process. We propose a new approach that uses the Jaccard Index to detect and remove similar clusters before training the base classifiers, not only saving computation time, but also reducing classification error by removing repeated votes. We empirically demonstrate the effectiveness of the proposed approach compared to the state of the art on 19 UCI benchmark datasets. © 2019 World Scientific Publishing Europe Ltd.
Funding
Category 1 - Australian Competitive Grants (this includes ARC, NHMRC)
History
Volume
18Issue
4Start Page
1950027-1End Page
1950027-20Number of Pages
20eISSN
1757-5885ISSN
1469-0268Publisher
Imperial College Press, UKPublisher DOI
Peer Reviewed
- Yes
Open Access
- No
Acceptance Date
2019-10-01Author Research Institute
- Centre for Intelligent Systems
Era Eligible
- Yes
Journal
International Journal of Computational Intelligence and ApplicationsUsage metrics
Categories
Keywords
Licence
Exports
RefWorksRefWorks
BibTeXBibTeX
Ref. managerRef. manager
EndnoteEndnote
DataCiteDataCite
NLMNLM
DCDC