Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/4656
Title: Fast subcellular localization by extracting informative regions of protein sequences for profile alignment
Authors: Wang, Wei
Subjects: Proteins -- Analysis.
Amino acid sequence
Bioinformatics.
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2011
Publisher: The Hong Kong Polytechnic University
Abstract: The determination of protein subcellular localization is vital for the understanding of the functions of proteins and for the design of drugs. However, the experimental methods of subcellular localization are expensive and time-consuming. On the other hand, computational methods provide the potential to annotate large protein datasets in a cost effective and time efficient manner. With the ever increasing amount of sequenced proteins, the gap between the newly found protein sequences and the knowledge of their subcellular localization has widened rapidly. Thus, it is imperative to speedup the subcellular localization algorithms. In this thesis, a cascaded fusion of cleavage site prediction and subcellular localization prediction is developed to alleviate the computational burden of homolog-based prediction methods. Specifically,the informative region (signal peptides or transit peptides) of a protein sequence is first determined by a cleavage site predictor. Then, only the informative segment is applied to a homology-based predictor for the determination of subcellular locations. A cleavage site predictor based on conditional random fields(CRFs) is developed. It was found that CRFs outperform neural networks and hidden Markov models in the prediction of cleavage site positions. To minimize the training and classification time of the subcellular localization predictors, a kernel Fisher discriminator is proposed. Specifically, the profile of the informative segment of a protein sequence is first generated by PSI-BLAST.The profile is then vectorized by computing the profile-alignment scores between the profile and all of the training profiles. The resulting vector is projected onto a low-dimensional space by using a new form of kernel discriminant analysis called kernel perturbation discriminant analysis. The vector in the low-dimensional space is then classified by a support-vector-machine classifier. It was found that the reduction in dimension leads to further computation saving when compared with the direct classification of profile-alignment vectors. The proposed method was evaluated on a newly created redundancy-removed data set using five-fold cross validations. Results show that the method can attain accurate localization while reducing the computational time substantially when compared to some start-of-the-art methods. In particular, it was found that truncating the sequences at their cleavage sites can reduce the profile creation time (by PSI-BLAST) as compared to truncating the profiles. A sensitivity analysis suggests that subcellular localization accuracy is inversely proportional to the discrepancy of the truncation positions with respect to the ground-truth cleavage sites. It was also found that the subcellular localization accuracy of chloroplast transit peptides (cTP) is highly dependent on the correct prediction of their cleavage site, suggesting further investigation is necessary to improve the cleavage site prediction of cTP.
Description: vi, 61 p. : ill. (some col.) ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577M EIE 2011 WangW
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/4656
Appears in Collections:EIE Theses
PolyU Electronic Theses

Files in This Item:
File Description SizeFormat 
b24561861_link.htmFor PolyU Users 162 BHTMLView/Open
b24561861_ir.pdfFor All Users (Non-printable) 1.29 MBAdobe PDFView/Open


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated. No item in the PolyU IR may be reproduced for commercial or resale purposes.