Protein subcellular localization : gene ontology based machine learning approaches

Wan, Shibiao

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/82944

Title:	Protein subcellular localization : gene ontology based machine learning approaches
Authors:	Wan, Shibiao
Degree:	Ph.D.
Issue Date:	2014
Abstract:	Proteins, which are essential macromolecules for organisms, need to be located in appropriate physiological contexts within a cell to exhibit tremendous diversity of biological functions. Aberrant protein subcellular localization may lead to a broad range of diseases. Knowing where a protein resides within a cell can give insights on drug target discovery and drug design. Computational methods are required to assist the laborious and time-consuming conventional wet-lab experiments for accurate, fast, reliable and large-scale predictions in proteomics research. This thesis proposes several Gene Ontology (GO) based machine learning approaches for the prediction of subcellular localization of both single-location and multi-location proteins. For the prediction of single-location proteins, two GO-based single-label predictors, namely GOASVM and FusionSVM, are proposed. GOASVM exploits GO information from the gene ontology annotation (GOA) database while FusionSVM extracts GO information from InterProScan and then combines GO information with profile alignment information. It was found that GOASVM (extracting GO from the GOA database) performs significantly better than FusionSVM (extracting GO from InterProScan). Moreover, GOASVM also remarkably outperforms existing state-of-the-art single-label predictors. For the prediction of multi-location proteins, an efficient multi-label predictor, namely mGOASVM, is proposed. mGOASVM extends GOASVM from single-location prediction to multi-location prediction. It possesses the following desirable properties: (1) it uses the frequency of occurrences of GO terms instead of 1-0 values; (2) it uses a more efficient multi-label SVM classifier to handle multi-label problems; and (3) it selects a relevant GO-vector subspace by finding distinct GO terms instead of using the full GO-vector space; (4) it adopts a successive-search strategy to incorporate more useful homologous information for classification. It was found that these properties make mGOASVM outperform other GO-based multi-label predictors. Based on mGOASVM, several more advanced multi-label predictors are proposed. These predictors further improve the performance of mGOASVM by enhancing the following aspects of the prediction process: 1. Classification Refinement. The classifier adopted by mGOASVM to tackle multi-label problems is rather primitive, thus refining the classification process is necessary. To this end, two multi-label predictors, namely AD-SVM and mPLR-Loc, are proposed. The former adopts an adaptive decision scheme for multi-label SVM classification. The scheme essentially converts the linear SVMs in the classifier into piecewise linear SVMs, which effectively reduces the over-prediction instances while having little influence on the correctly predicted ones, thus improving the prediction performance. The latter adopts a multi-label penalized logistic regression classifier equipped with an adaptive decision scheme, which can also boost the performance. 2. Deeper Feature Extraction. mGOASVM only considers the frequency of occurrences of GO terms, which may not be sufficient for accurate prediction. To overcome this limitation, a multi-label predictor called SS-Loc, which further exploits the semantic similarity over GO, is proposed. Based on SS-Loc, an even more advanced predictor called HybridGO-Loc, which uses both GO frequency features and GO semantic similarity features, is developed. Experimental results demonstrate that HybridGO-Loc performs the best among all of the proposed multi-label predictors as well as other existing GO-based predictors. 3. Dimensionality Reduction. Although a relevant GO-vector subspace has been selected, the feature vectors in mGOASVM are still of high dimensionality. To address the problem of the curse of high dimensionality, an ensemble method based on random projection (RP) is applied to construct two dimensionality-reduction multi-label predictors, namely RP-SVM and R3P-Loc. The former uses multi-label SVM classifiers and the latter uses multi-label ridge regression classifiers. Experimental results suggest that both predictors outperform mGOASVM as well as other state-of-the-art predictors while at the same time impressively reducing the dimensions.
Subjects:	Proteins -- Analysis. Proteins -- Analysis -- Mathematics. Hong Kong Polytechnic University -- Dissertations
Pages:	xxx, 250 pages : color illustrations ; 30 cm
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/7770

Show full item record

Page views

64

Last Week
1

Last month

Citations as of Apr 14, 2024

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM