A dendrogram is the graphical representation of a statistical tool called “hierarchical agglomerative clustering”. Hierarchical clustering aims at defining a sequence of N clusterings of k clusters, for k Î [1,...,N], so that the resulting clusters form a nested sequence.
The agglomerative algorithm starts with the initial set of N variables, considered as N singleton clusters. At each step it proceeds by identifying the two most similar clusters and merging them to form a new cluster. This step is repeated until all variables have been merged together into a single cluster.
The similarity among the variables is measured by means of the correlation coefficient which takes its values into the range [-1,1]:
rho(x,y) = cov(x,y) / σx.σy |
where, cov(x,y) represents the covariance between variables X and Y; and σx is the standard-deviation of variable X.
Create a Dendrogram
To launch the dendrogram editor, select Visualize > Dendrogram from the menu. Alternatively, click the icon () in the sidebar and then add New.
Enter a chart Title.
Select an Record set from the list, if required.
Select variables from the Variable list and click Input. Variables used in dendrograms must be numerical.
Click on Save.
The Dendrogram tool generates two different views:
- Dendrogram tree (see tab Dendrogram) shows groups of linearly correlated variables and clusters highly correlated variables together on the tree. The closer the value is to 1 or -1 the higher the correlation. The higher correlated values are displayed on the right.
Correlation matrix give the overall results of calculating linear correlation factors, i.e. for each pair of variables. Positive correlation factors are displayed in green, negative ones in red.
Find a Variable
Variables are listed alphabetically. To find a variable, use the scroll bar or enter the name in the Variable field.
To clone:
- Click More actions > Clone to clone the dendrogram or More actions > Clone as and select Trends, Dendrogram, Summary Chart or Multiplot.
To export data:
Click More actions > General Actions
- Click Download Data. Choose the CSV format (CSV US or CSV EU).
- Click Export Matrix to CSV to download the correlation matrix. Choose the CSV format (CSV US or CSV EU).
To export graphic as:
Click More actions > Export graphic as and select a file format; either PDF, PNG or SVG.
To create a new variable selection:
In Correlation Matrix tab, use the check boxes to select the variables, one by one or select all using the first checkbox (beside the empty field used to filter variables).
- Click on More Actions > Variable Selection. It is possible to create: Variable Set, Fill Missing Values, Differentiated variable, Moving Average, Shifted Variable.
To create different charts:
In Correlation Matrix tab, use the check boxes to select the variables, one by one or select all using the first checkbox (beside the empty field used to filter variables).
- Click on More Actions > Variable Selection. It is possible to create: Histogram, Trends and a Dendrogram (using the new set of variables).
When to use a dendrogram?
A dendrogram is an effective tool to use to analyze similarities among the variables, and eliminating variables that are too correlated (and thus bringing probably redundant information). It is also useful for detecting important correlations between a variable of interest and the other variables, for example, between a goal variable and the input variables.
Example Visualization
Interpret the dendrogram and correlation matrix to identify which variables influence the target SUN_ENERGY_WEEK_AVRG (energy gathered from solar panels).
Tips
Here are some tips:
- First look at the Dendrogram tree view and isolate variables grouped together near the SUN_ENERGY_WEEK_AVRG with high correlation factor values (top scale represents the correlation factor in absolute value, maximum being unity).
- Go to Correlation matrix tab, search for SUN_ENERGY_WEEK_AVRG column, click on the column label to sort the correlation factor values. Rank the most influencing input variables on SUN_ENERGY_WEEK_AVRG.
Don’t forget to validate your findings! Create a Scatter plot!
The following example illustrates the correlation of FUEL_WEEK_AVRG_MODEL with the energy gathered from solar panels. The minus sign (-) confirms that when there is abundant sunlight, fuel consumption is lower. The minimum correlation coefficient between SUN_ENERGY_WEEK_AVRG and SUN_WEEK_AVGR_HR and FUEL_WEEK_AVRG_MODEL is -0.502452.
How to interpret the Dendrogram tree
Dendrogram shows groups of correlated variables. This view is a graphical summary of the correlation matrix result. Note: the dendrogram shows absolute values of coefficient, values range between -1 and 1. Strength of correlation 0 means no correlation and 1 means a perfect correlation (positive or negative).
How to interpret the Correlation Matrix
To check which variables are the most correlated to a specific one, click on it. The row of the table will be sorted in terms of the absolute values of coefficients related to this specific variable. By clicking on header we can see the variables correlated to the tag in a decreasing order (from the most correlated to the least one) .
By clicking on the icon, a filter can be applied to keep only variables that are correlated with a minimum absolute value of correlation coefficient.
E.g. Click on the icon under Tag Name. A field with > 0.5 appears, that means you are filtering your column with values with a coefficient bigger than 0.5. It is possible to edit the field, you can enter other values, for example, > 0.8.