Clustering Models
About Clustering Models
A clustering model is an unsupervised learning algorithm that groups similar records or similar variables. For example, if you want to identify an operation in a production process, or variables that have similar behaviour.
K-Means
For more information, see the online learning platform
K-means clustering is a method which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
To launch this model tool, select Models > K-Means from the menu. Alternatively, click the corresponding icon in the sidebar.
Type of variable
K-Means models can only be created with numerical variable.
Create variable set
Create a K-Means
The parameters for this method are defined as follows:
- Enter Model name.
- Enter Datasource from the list (if applicable).
- Select a Learning set from the list.
- Enter a Cluster name prefix. The default prefix is "CLUSTER".
- Enter a Cluster number, default 3. For more information, see Cluster number.
- Enter a Maximum number of iterations. For more information, see Maximum number of iterations.
- Select Cluster Standardisation:
- None: no cluster standardization.
Normalize:
- Normalize: transform the variable to have a max of 1 and a min of 0. The value is calculated with: scaled(x) = (x - min)/(max - min) where the min and max values are based on the learning data set.
Standardize: transform the variable to have a mean of 0 and a standard deviation of 1. The value is calculated with: scaled(x) = (x - µ)/(STDEV) where the average (µ) and standard deviation (STDEV) values are calculated on the learning data set.
- Select Calculate cluster silhouette checkbox. For more information, see Calculate cluster silhouette.
- Select Optimize cluster number checkbox. For more information, see Optimize cluster number .
- Select variable(s) from the list for the Inputs.
- Click Save to generate the clusters.
Subclu
Subclu is an unsupervised clustering algorithm used to define groups or patterns with the data based on the density of data points. It marks as outlier’s points that lie alone in low-density regions. Each cluster is expanded one dimension at a time into a dimension that is known to have a cluster that only differs from previous clusters in one dimension. Therefore, it is not necessary to define the number of clusters as in k-Means.
Create Subclu
The parameters for this method are defined as follows:
- Enter Model name.
- Select a Datasource from the list (if applicable).
- Select a Learning set from the list.
- Select variable(s) from the list for the Input.
- Enter Cluster name prefix. The default prefix is ("CLUSTER-").
- Enter Variable suffix. The default suffix is ("SUBCLU-1 Cluster").
- Enter Epsilon, default 0.1. For more information, see Epsilon.
- Select Calculate cluster silhouette: yes or no. For more information, see Calculate cluster silhouette.
- Enter a Minimum points, default 10. For more information, see Minimum points.
- Enter Min cluster Dimensions. For more information, see Min cluster Dimensions.
- Select Cluster Standardisation:
- None: no cluster standardization.
Normalize: transform the variable to have a max of 1 and a min of 0. The value is calculated with: scaled(x) = (x - min)/(max - min) where the min and max values are based on the learning data set.
Standardize: transform the variable to have a mean of 0 and a standard deviation of 1. The value is calculated with: scaled(x) = (x - µ)/(STDEV) where the average (µ) and standard deviation (STDEV) values are calculated on the learning data set.
- Click Save to generate the clusters.
IMS
Inductive Monitoring System (IMS) is a new clustering method available in DATAmaestro based on the distance between data points. It is not necessary to define the number of clusters as in k-Means. IMS looks point by point if they can be included in an existing cluster, firstly by comparing the min & max values in each dimension and secondly by comparing the distance compared to the center of the cluster. If the point cannot be included in an existing cluster, a new cluster will be created.
Create IMS
The parameters for this method are defined as follows:
- Enter Model name.
- Select a Datasource from the list (if applicable).
- Select a Learning set from the list.
- Enter Cluster name prefix. The default prefix is ("CLUSTER-").
- Select Standardisation:
- None: no cluster standardization.
Normalize: transform the variable to have a max of 1 and a min of 0. The value is calculated with: scaled(x) = (x - min)/(max - min) where the min and max values are based on the learning data set.
Standardize: transform the variable to have a mean of 0 and a standard deviation of 1. The value is calculated with: scaled(x) = (x - µ)/(STDEV) where the average (µ) and standard deviation (STDEV) values are calculated on the learning data set.
- Enter Epsilon, default 0.1. Epsilon is the maximal distance for a point to be in a cluster. A larger value tends to lead to a lower number of clusters.
- Select Calculate cluster silhouette: yes or no. For more information, see Calculate cluster silhouette.
- Enter Variable name prefix The default prefix is ("IMS-1 Cluster").
- Select the Variable Set.
- Select variable(s) from the list for the Variable.
- Click Save to generate the clusters.
Visualize K-Means, Subclu and IMS results
To visualize the K-Means, Subclu and IMS results use the scatter plot, choose the variables for x and y axis and then put the condition as the Nearest CLUSTER-Name.
Hierarchical Clustering
For more information, see the online learning platform
Hierarchical clustering is a model that is viewed as a dendrogram. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. For more information, see Dendrograms.
Learning set empty
Certain models in DATAmaestro are able to handle missing values, while other models are not. For example, clustering methods, like K-means, are not able to handle missing values. If any row of data has a missing value, even for just one variable, the row will need to be ignored by the algorithm. “Learning set is empty” is the message to indicate that all rows have been removed due to one or more missing values per row by the algorithm. If you have any variables with a high number of missing values, it is recommended to remove them or to use the “Fill missing values” tool under the “Transform” menu in DATAmaestro Analytics.
Calculation time
In DATAmaestro, the calculation time can vary depending on the number of records, number of input variables and the type of algorithm that is being used. For example, Subclu is significantly slower than K-means for larger data sets. If you have a large dataset, it is recommend to use K-means.