Back to ComputerTerms, InformationRetrieval Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups. Similarity Measures: {{{ 2C S = ------- (Di,dj) A + B Where C is the number of terms that Di and Dj have in common, and A and B are the number of termsin Di and Dj }}} Similarity Matrix calculates a similarity measure between document x and y {{{ | S21 | | S31 S32 | | ... | | SN1 SN2 ...SN(N-1) | }}} = Clustering Methods = '''Nonhierarchical methods (partitioning methods)''' divide the data set of N objects into M clusters, where ''no overlap is allowed''. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM) '''Hierarchical methods''' produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The ''agglomerative'' method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html == NonHierarchical Methods == Used in early research, but now considered obsolete. === single pass method === 1. Assign the first document D1 as the representative for C1. 1. For Di, caclulate the similarity S with the representative for each existing cluster. 1. If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster. 1. If an item Di remains to be clustered, return to step 2. === Reallocation Methods === 1. Select M cluster representatives or centroids 1. For i=1 to N, assign Di to the most similar centroid 1. For j=1 to M, recaclulate the cluster centroid Cj 1. Repeat 2-3 until there is little or no change in cluster membership during a pass through the file. == Hierarchical Cluster Methods HACMs== The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster. === Link Methods === * '''Single Link''': Joins the most similar pair of objects that are not yet in the same cluster * '''Complete Link''': Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters. * '''Group Avaerage Link''': Uses average values of pairwise links within a cluster to determine similarity. * '''Ward's Method''': also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids. === General Algorithm === 1. Identify the two closest points and combine them in a cluster. 1. Identify and combine the next two closest points (treating existing clusters as points). 1. If more than one cluster remains, return to step 2. = Document Retrieval from a Clustered Data Set = This is what we are working for, SO THIS IS THE MOST IMPORTANT ISSUE! If we don't get good results here, whats the use! == Retrieval Approaches == 1. Top Down: Start at the root and work your way down through the cluster hierarchy choosing the most similar sub-cluster until some criterion is met: Number of documents or similarity drops below a threshold value. Works well with the complete link method. 1. Bottom up: begins with some document or cluster at the base of the tree and moves up until the retrieval criterion has been met. This method gives the best results apart from the complete link method. Back to ComputerTerms, InformationRetrieval