PolyU IR
 

PolyU Institutional Repository >
Computing >
COMP Theses >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/909

Title: A genetic algorithm based approach for clustering categorical data
Authors: Lee, Ho-kei Sean
Subjects: Cluster analysis -- Data processing
Algorithms
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2006
Publisher: The Hong Kong Polytechnic University
Abstract: Given a database of records, clustering is concerned with the grouping of similar records into different groups or clusters based on their attribute values. Many algorithms have been proposed in the past to address the clustering problem but most of them are developed mainly to handle continuous-valued data. Relatively little attention has been paid to the clustering of categorical data. Given that these kind of data is very commonly collected in many applications in business, medicine and the social sciences, etc., it is important that an effective clustering algorithm be developed to handle such data, in this thesis, we propose such an algorithm. This algorithm is based on the use of a simple genetic algorithm (GA) that employs a probabilistic search technique for solutions that are supposedly optimal or near-optimal according to some performance criteria. This GA-based clustering algorithm makes use of an encoding scheme that can encode clustering results in chromosomes effectively. To work with this scheme, we also propose a set of genetic operators that can facilitate the exchange of clustering information between chromosomes on one hand and allow variations to be introduced on the other. For the proposed GA to work well, we have also introduced a fitness function to evaluate clustering quality. This is based on an information theoretic measure that measures how much the presence of a particular attribute value supports or refutes a record in a data set to be classified into a specific cluster. The higher its fitness value based on the evaluation function, the better the solution encoded in a chromosome. Unlike traditional algorithm, the proposed GA-based clustering algorithm has the advantage that it can automatically determine the number of clusters hidden in a dataset. The proposed algorithm has been tested with both simulated and real data; the results show that it is very promising and can have many real applications.
Degree: M.Phil., Dept. of Computing, The Hong Kong Polytechnic University, 2006.
Description: vii, 103 leaves : ill. ; 31 cm.
PolyU Library Call No.: [THS] LG51 .H577M COMP 2006 Lee
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/909
Appears in Collections:COMP Theses
PolyU Electronic Theses

Files in This Item:

File Description SizeFormat
b20697260_ir.pdfFor All Users (Non-printable)1.4 MBAdobe PDFView/Open
b20697260_link.htmFor PolyU Users166 BHTMLView/Open



Facebook Facebook del.icio.us del.icio.us LinkedIn LinkedIn


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
No item in the PolyU IR may be reproduced for commercial or resale purposes.

 

© Pao Yue-kong Library, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Powered by DSpace (Version 1.5.2)  © MIT and HP
Feedback | Privacy Policy Statement | Copyright & Restrictions - Feedback