# Clustering Models

# About Clustering Models

A clustering model is an unsupervised learning algorithm that groups similar records or similar variables. For example, if you want to identify an operation in a production process, or variables that have similar behaviour.

## K-Means

For more information, see the online learning platform

K-means clustering is a method which aims to partition *n* observations into *k* clusters in which each observation belongs to the cluster with the nearest mean.

To launch this model tool, select **Models** > **K-Means** from the menu. Alternatively, click the corresponding icon in the sidebar.

Type of variable

K-Means models can only be created with numerical variable.

Create variable set

### Create a K-Means

The parameters for this method are defined as follows:

- Enter
**Model name**. - Enter
**Datasource**from the list (if applicable). - Select a
**Learning set**from the list. - Enter a
**Cluster name prefix**. The default prefix is "CLUSTER". - Enter a
**Cluster number**, default 3. For more information, see Cluster number. - Enter a
**Maximum number of iterations**. For more information, see Maximum number of iterations. - Select
**Cluster Standardisation**:**None**: no cluster standardization.**Normalize:****Normalize:**transform the variable to have a max of 1 and a min of 0. The value is calculated with: scaled(x) = (x - min)/(max - min) where the min and max values are based on the learning data set.**Standardize:**transform the variable to have a mean of 0 and a standard deviation of 1. The value is calculated with: scaled(x) = (x - µ)/(STDEV) where the average (µ) and standard deviation (STDEV) values are calculated on the learning data set.

- Select
**Calculate cluster silhouette**checkbox. For more information, see Calculate cluster silhouette. - Select
**Optimize cluster number**checkbox. For more information, see Optimize cluster number . - Select
**variable**(s) from the list for the**Inputs**. - Click
**Save**to generate the clusters.

## Subclu

** Subclu **is an unsupervised clustering algorithm used to define groups or patterns with the data based on the density of data points. It marks as outlier’s points that lie alone in low-density regions. Each cluster is expanded one dimension at a time into a dimension that is known to have a cluster that only differs from previous clusters in one dimension. Therefore, it is not necessary to define the number of clusters as in k-Means.

### Create Subclu

The parameters for this method are defined as follows:

- Enter
**Model name**. - Select a
**Datasource**from the list (if applicable). - Select a
**Learning set**from the list. - Select variable(s) from the list for the
**Input**. - Enter
**Cluster name prefix**. The default prefix is ("CLUSTER-"). - Enter
**Variable suffix**. The default suffix is ("SUBCLU-1 Cluster"). - Enter
**Epsilon**, default 0.1. For more information, see Epsilon. - Select
**Calculate cluster silhouette:**yes or no. - Enter a
**Minimum points**, default 10. For more information, see Minimum points. - Enter
**Min cluster Dimensions**. - Select
**Cluster Standardisation**:**None**: no cluster standardization.**Normalize:**transform the variable to have a max of 1 and a min of 0. The value is calculated with: scaled(x) = (x - min)/(max - min) where the min and max values are based on the learning data set.**Standardize:**transform the variable to have a mean of 0 and a standard deviation of 1. The value is calculated with: scaled(x) = (x - µ)/(STDEV) where the average (µ) and standard deviation (STDEV) values are calculated on the learning data set.

- Click
**Save**to generate the clusters.

## IMS

* Inductive Monitoring System (IMS) *is a new clustering method available in DATAmaestro based on the distance between data points. It is not necessary to define the number of clusters as in k-Means. IMS looks point by point if they can be included in an existing cluster, firstly by comparing the min & max values in each dimension and secondly by comparing the distance compared to the center of the cluster. If the point cannot be included in an existing cluster, a new cluster will be created.

### Create IMS

The parameters for this method are defined as follows:

- Enter
**Model name**. - Select a
**Datasource**from the list (if applicable). - Select a
**Learning set**from the list. - Enter
**Cluster name prefix**. The default prefix is ("CLUSTER-"). - Select
**Standardisation**:**None**: no cluster standardization.**Normalize:**transform the variable to have a max of 1 and a min of 0. The value is calculated with: scaled(x) = (x - min)/(max - min) where the min and max values are based on the learning data set.**Standardize:**transform the variable to have a mean of 0 and a standard deviation of 1. The value is calculated with: scaled(x) = (x - µ)/(STDEV) where the average (µ) and standard deviation (STDEV) values are calculated on the learning data set.

- Enter
**Epsilon**, default 0.1.**Epsilon**is the maximal distance for a point to be in a cluster. A larger value tends to lead to a lower number of clusters. - Select
**Calculate cluster silhouette:**yes or no. - Enter
**Variable name prefix**The default prefix is ("IMS-1 Cluster"). - Select the
**Variable Set**. - Select variable(s) from the list for the
**Variable**. - Click
**Save**to generate the clusters.

Visualize K-Means, Subclu and IMS results

To visualize the K-Means, Subclu and IMS results use the scatter plot, choose the variables for x and y axis and then put the condition as the Nearest CLUSTER-Name.

## Hierarchical Clustering

For more information, see the online learning platform

Hierarchical clustering is a model that is viewed as a dendrogram. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. For more information, see Dendrograms.

Learning set empty

Certain models in DATAmaestro are able to handle missing values, while other models are not. For example, clustering methods, like K-means, are not able to handle missing values. If any row of data has a missing value, even for just one variable, the row will need to be ignored by the algorithm. “Learning set is empty” is the message to indicate that all rows have been removed due to one or more missing values per row by the algorithm. If you have any variables with a high number of missing values, it is recommended to remove them or to use the “Fill missing values” tool under the “Transform” menu in DATAmaestro Analytics.

Calculation time

In DATAmaestro, the calculation time can vary depending on the number of records, number of input variables and the type of algorithm that is being used. For example, Subclu is significantly slower than K-means for larger data sets. If you have a large dataset, it is recommend to use K-means.