Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/4086
Title: Building a decision cluster classification model by a clustering algorithm to classify large high dimensional data with multiple classes
Authors: Li, Yan
Subjects: Hong Kong Polytechnic University -- Dissertations
Cluster analysis -- Data processing
Dimensional analysis -- Data processing
Computer algorithms
Data mining
Issue Date: 2010
Publisher: The Hong Kong Polytechnic University
Abstract: Clustering and classification are two basic tasks in data mining. As the complexity of data increases, the existing techniques for classification face a lot of challenges, for instance, classifying large high dimensional data with multiple classes. Therefore, new techniques need to be innovated to deal with data in large volume and high dimensions. In this thesis, we aim to propose a possible way to solve this problem by integrating clustering algorithm into classification work. We propose a new classification framework. This framework consists of three phases: (i) a clustering algorithm is called recursively to build a decision cluster tree, (ii) a classification model is built from this decision cluster tree, (iii) new samples are classified by this classification model. There are many research problems existing in this framework. In this thesis, we describe our methodology for those problems. In this framework, we propose a new classification method ADCC (Automatic Decision Cluster Classifier) that is designed to use a variable weighting k-means algorithm W-k-means to build a decision cluster tree so that the variable weights of each dimension can be obtained from the training data and used in classification. In partitioning the training data, W-k-means automatically computes the variable weights according to the data distributions so that important variables can get more weights and the noisy variables get less weight. In clustering a data set (i.e., a node), the class variable is removed from the data, so the class variable has no impact on the clustering results. The class variable is used in determining the dominant class for each cluster. To build a better cluster tree, effective methods for selection of the number of clusters and the initial cluster centers at each node are introduced. Furthermore, we use various tests including Anderson-Darling test to determine whether a node can be further partitioned or not. In this way, distribution of the training samples at each node is considered together with the purity and the size of the node. A decision cluster classifier consists of a set of disjoint decision clusters, each labeled with a dominant class that determines the class of new objects falling in the cluster. A series of experiments on both synthetic and real data sets have been conducted. The results show that the new classification method (ADCC) performed better in accuracy and scalability than the existing methods of KNN, decision tree and SVM. It is particularly suitable for large, high dimensional data with many classes.
Sometimes, ADCC method generates some weak decision clusters in which no single class dominates. Existence of weak decision clusters in the model can affect classification performance of the model. In a weak decision cluster, there is no dominant class, so it is difficult to justify the class of the new objects. It has been shown that classification accuracy could be improved after weak decision clusters were avoided from the model. Weak decision clusters occur because objects of different classes are mixed in the clustering process to generate decision clusters. If we assume that objects in the same class have their own cluster distributions, we can separate objects of different classes according to the object class labels and generate a decision cluster tree for each class of objects. Then, we combine the decision clusters of different classes to form the decision cluster classification model. In this way, weak decision clusters can be avoided. We propose a Decision Cluster Forest (DCF) method to build a set of decision cluster trees (decision cluster forest) which form a classification model. Instead of building a single decision cluster tree from the entire training data, we build a set of cluster trees from subsets of the training data set to form a decision cluster forest. Each tree in the forest is built from the subset of objects in the same class. The proposition for this method is that the objects in the same class tend to have their own spatial distributions in the data space. Therefore, decision clusters of objects in the same class are found. The decision clusters in the same tree have the same dominant class. In this way, no weak cluster is created in such decision cluster tree. A decision cluster model can be selected from the set of leaf decision clusters from the decision cluster forest so the model is called a decision cluster forest classification model (DCFC). The decision cluster forest method has advantages of classifying data with multiple classes because the DCFC model is guaranteed to contain decision clusters in all classes. DCFC model is a more intuitive and direct multi-class classification method. We propose a different classification method based on the tree structure. We propose a Crotch Ensemble classification model for high dimensional data with multiple classes. Generated from a decision cluster tree, a crotch is an inner node of the tree together with its direct children. If the dominant classes of children of a crotch are not all the same, the crotch is defined as a crotch predictor that is a classifier by itself. A crotch ensemble consists of a set of crotch predictors. When classifying a new object, a subset of crotch predictors is selected according to the distances between the object and the crotches. A classification is made on the object as the class predicted by the crotch predictors with the maximum accumulative weights. The experimental results on both synthetic and real data have shown that the Crotch Ensemble model is efficient and effective when classifying new samples. We propose a special application of our framework in text data classification. A subspace clustering algorithm is integrated to build the decision cluster tree. We adopt cosine distance metric for this application. Experimental results have shown that our framework can integrate different clustering algorithms and other possible methods and can get better classification results for text classification. Finally, we give the theoretical analysis of error bound of our DCC model. We prove that our Cluster-based classification model (DCC model) is better than the Object-based classification method.
Description: xv, 144 p. : ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577P COMP 2010 LiY
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/4086
Appears in Collections:COMP Theses
PolyU Electronic Theses

Files in This Item:
File Description SizeFormat 
b23930512_link.htmFor PolyU Users 162 BHTMLView/Open
b23930512_ir.pdfFor All Users (Non-printable) 2.73 MBAdobe PDFView/Open


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated. No item in the PolyU IR may be reproduced for commercial or resale purposes.