Back to ComputerTerms, InformationRetrieval

Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups.

Similarity Measures:

               2C
 S         = -------
  (Di,dj)     A + B 

   Where C is the number of terms that Di and Dj have in common, 
   and A and B are the number of termsin Di and Dj

Similarity Matrix calculates a similarity measure between document x and y

  | S21                 |
  | S31  S32            |
  |  ...                |
  | SN1  SN2 ...SN(N-1) |

Clustering Methods

Nonhierarchical methods (partitioning methods) divide the data set of N objects into M clusters, where no overlap is allowed. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)

Hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The agglomerative method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html

NonHierarchical Methods

Used in early research, but now considered obsolete.

single pass method

Assign the first document D1 as the representative for C1.
For Di, caclulate the similarity S with the representative for each existing cluster.
If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster.
If an item Di remains to be clustered, return to step 2.

Reallocation Methods

Select M cluster representatives or centroids
For i=1 to N, assign Di to the most similar centroid
For j=1 to M, recaclulate the cluster centroid Cj
Repeat 2-3 until there is little or no change in cluster membership during a pass through the file.

== Hierarchical Cluster Methods HACMs==

The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.

Link Methods

Single Link: Joins the most similar pair of objects that are not yet in the same cluster
Complete Link: Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.
Group Avaerage Link: Uses average values of pairwise links within a cluster to determine similarity.
Ward's Method: also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.

General Algorithm

Identify the two closest points and combine them in a cluster.
Identify and combine the next two closest points (treating existing clusters as points).
If more than one cluster remains, return to step 2.

Document Retrieval from a Clustered Data Set

This is what we are working for, SO THIS IS THE MOST IMPORTANT ISSUE! If we don't get good results here, whats the use!

Retrieval Approaches

Top Down: Start at the root and work your way down through the cluster hierarchy choosing the most similar sub-cluster until some criterion is met: Number of documents or similarity drops below a threshold value. Works well with the complete link method.
Bottom up: begins with some document or cluster at the base of the tree and moves up until the retrieval criterion has been met. This method gives the best results apart from the complete link method.