Home >> Software Help >> Cluster Analysis
View Demo | Download Tutorial | Frequently Asked Questions | Glossary of Terms

The Model

Cluster analysis is a set of techniques for discovering structure (groupings) within a complex body of data, such as the segmentation-basis data matrix.

There are two basic classes of clustering methods:
  • Hierarchical methods which build up or break down the data row by row, and
  • Partitioning methods which break the data into a pre-specified number of groups and then reallocate or swap data to improve some statistical measure of fit (i.e., the ratio of the within-group to between-group variation).

    Our software includes one method of each type - Ward's (1963) (hierarchical) and K-means (partitioning).

    Ward's method, one of the two methods included in our software, forms clusters based on the change in the error sum of squares associated with joining any pair of clusters

    Partitioning methods are used most when the analyst has a big data set. These methods are computationally efficient and their output is much easier to interpret when many items (such as 50 or more customers) are being clustered. Unlike hierarchical methods, they do not require the allocation of an item to a cluster irrevocably - that is, the routine will reallocate it if this will improve the statistical fit of the solution. These methods do not develop a tree-like structure; rather they start with cluster centers and assign those individuals closest to each cluster center to that cluster.

    The K-Means clustering procedure works as follows:

    1. The analyst specifies the number of clusters (n)..
    2. The routine begins with n (analyst-specified) starting points and allocates every item to its nearest cluster center.
    3. It then reallocates items one at a time to reduce the sum of internal cluster variability until it has minimized the fit criterion (the sum of the within-cluster-sums of squares) for n clusters.
    4. After completing step 3, you may return to step 1 and repeat the procedure with for a different number of clusters.

    The solution to K-means clustering is sensitive to the selection of starting points in step 2 above; we recommend using the cluster centroids from Ward's procedure to give good starting points.


    Getting started

    Before you start running our software, please take a look at the Compatibility List.
    Due to our intensive use of various Web technologies you may experience difficulties running a software under your browser.
    We apologize for any inconvenience this may cause and are working on full browser compatibility.

    You may have to enable Javascript, Cookies and/or accept Active X Controls. Please refer to the Compatibility List

    Read this help section and the Tutorial carefully before running the software. You can also see how the software actually works by running the Demo.

    You can download here two example files (Conglomerate's PDA Case) you can load into the software.

    To learn how to write your own Data Sets for our Software, Click here.


    Running the software

    Step by Step instructions on how to run the software can be found at the Demo Section.
    You will find a complete, commented Demo of the Cluster Analysis.


    Understanding the results

    After forming segments/clusters with our software, you need to interpret the results and link them to managerial actions.
    This is a critical activity because the targeting and positioning decisions depend on the segments you choose to retain.

    You must address at least the following issues:

  • Are there really any distinct clusters?
  • How many clusters should you retain?
  • How good (interpretable) and robust (stable) are the clusters?
  • How should the clusters be profiled?


    What if there really are no clusters?

    Don't overlook this possibility. If only one or two basis variables show meaningful differences between respondents, it is possible that no really distinct segments exist in the market.
    This could be the result of your selecting a poor set of segmentation bases or perhaps because customer needs in your sample really don't differ too much.
    If the revealed segment structure is weak, then you should build up a rich picture of likely customers, profiling them with the descriptor variables and any information obtained by exploratory research methods.

    How many clusters should I retain?

    This decision involves both art (the purpose of the study) and science (statistical criteria). It is useful to generate a number of potential segmentation schemes and, using, statistical criteria, identify the best two or three of these. Then the managers or users can help decide which of these remaining schemes will be most useful.

    How good are your Clusters?

    That is, how well do the clusters obtained from this particular sample of individuals generalize to the market as a whole? Too few segmentation studies try to answer this question. Even if the sample is representative, there may be measurement problems or the analyst may have made some poor choices.
    Good cluster solutions are usually robust-that is, they are generally stable across methods of segmentation and across measures of association.
    A good way to check how robust the clusters are is to use different methods (or different measures of association) and then do a cross-tabulation to see how many respondents get assigned to the same groups by the different methods (or different association measures) - the higher the percentage the better.
    Another way to test robustness is to split the data randomly into two groups, run the segmentation analysis separately on each half of the data and then compare results. Again, the higher the level of agreement the better. If different approaches assign less than 70% or so of cases to the same groups, you should view the segmentation structure with caution.
    A second aspect of segmentation quality relates to managerial usefulness: do the users (salespeople, call center personnel, advertising managers) find the results valuable? One way to help understand managerial usefulness is to see if managers can invent intuitively appealing names for the segments. Normally, one looks at cluster means-the average values of the basis and descriptor variables for each segment-and sees if one can characterize the segment member.

    Profiling your segments using discriminant analysis:

    The idea behind cluster profiling, is to prepare a picture of the revealed clusters based on both the variables used for the clustering (the segmentation bases) and those used to identify and target the segments (the descriptors). The most direct approach to profiling is to compare the average values of the bases and the descriptor variables in each cluster.

    Discriminant analysis seeks combinations of descriptor variables that best separate the clusters or segments.

    The concept behind the Cluster Analysis Dendogram: An Example:



    This distance matrix yields one dendogram for single linkage clustering (solid line) and another for complete linkage clustering (dotted line). The cluster or segments formed by companies 1 and 2 join with the segment formed by companies 3 and 4 at a much higher level in complete linkage (3.42) than in single linkage (1.81). In both cases company 5 appears to be different from the other companies -- an outlier. A two-cluster solution will have A=5, B=(1,2,3,4), while a three-cluster solution will have A=5, B=(1,2), and C=(3,4)

    Click here for another fully commented example of a Cluster Analysis Dendogram from our Demo Section