Microarray gene selection for cancer classification using support vector machine
thesisposted on 06.12.2017, 00:00 by Choudhury WahidChoudhury Wahid
Genes comprised of DNA (Deoxyribonucleic Acid) molecules contain the blueprint of any living organism. Expression levels of genes are the measures of genes activity under certain biochemical conditions. DNA microarray technology is widely used to measure gene expression levels and study gene interactions. This technology provides access to thousands of genes at once by recording expression levels simultaneously. The data generated by microarray technology is important because it has been shown that from gene expression changes we are able to extract different types of cancer information. Support Vector Machine (SVM) algorithms have recently been extensively applied to the DNA microarray applications. However, cancer classification using gene expression data with SVM is a non trivial task due to the very nature of this kind of data as they have very high dimensionality, usually in the order of thousands to tens of thousands of genes. The situation is more complicated with comparatively few number of sample sizes- usually below hundred. The high dimensionality of the features and low population size usually cause over-fitting of the classifier. Problem of feature selection is, hence, an important issue in this research context. Feature selection process comprises selecting relevant features and eliminating irrelevant and redundant features from the data, and training the SVM classifier on reduced dimensionality. The objective of the current research was to introduce an effective solution for the feature selection problem. The preliminary set of experiments investigated the impact of feature selection on the SVM performance. The experimental outcome showed that feature selection had a positive impact on classification accuracy of SVM. Subsequently this research addressed the solution of feature selection problem through the clustering based approach. The key idea of the proposed FSFC (Feature Selection through Feature Clustering) method was to transpose the gene expression data matrix and to consider the resultant rows as the points in a reduced dimensional space. An optimal number of clusters through the application of the well known k-means clustering method were made from these data points. Those points were selected in the final subset of features whose distances were greater than the average distances of the points from the cluster centers of the respective clusters. The performance of this novel strategy was studied extensively with other standard approaches on various benchmark gene expression datasets. Experiments on real world datasets showed that this method was computationally less expensive than the other standard approaches and selected moderately less features for which the SVM classification accuracy was relatively high. The advantage of this method was that no parameter tuning was required for the purpose of feature selection but dealt effectively with the high dimensional nature of the gene expression data. The next steps of the current research introduced three simple but intuitive hybrid approaches of feature selection based on FSFC. These hybrid approaches achieved further reduction in data dimensionality while maintained, or in some cases improved, the SVM classification accuracy. These three hybrid feature selection methods, namely, Combined Distance Metrics, FSFC followed by SVM-RFE (Support Vector Machine - Recursive Feature Elimination) and Recursive FSFC, clearly showed the suitability of the proposed FSFC method in hybrid feature selection paradigm. The determination of the impact of feature selection on the SVM, proposed FSFC - parameterless feature selection environment and the hybrid approaches of feature selection based on FSFC will make statistical learning theorem SVM more effective in the domain of microarray analysis for cancer biology.