DATAmaestro provides a number of statistical analysis methods. Any one method can be used alone to address a specific problem, or different methods can be used together to create a hybrid approach that exploits a combination of strengths.
To combine the methods and create multiple learning models, you can use the new variables created from one method as inputs for other methods.
Statistical Process Control charts are a tool used to evaluate the central tendency of a process over time. It is used to monitor and control a process to evaluate risk and ensure a system operates at its full potential. This method provides a graphical display to assess the cause for data points that fall outside probability; the lower and upper control limits, LCL and UCL.
To define process control limits:
On properties tab:
Click Transform > Statistical Process Control in the menu.
Enter a name for the LCL and UCL variables. For more information, see LCL and UCL variables.
Select the Record set.
Select from the Variable list and use the arrow buttons to set variables for Index and Var. Note that it is possible to select more than one variable, but they must all be numeric. The Index is automatically selected as the time variable. It calculates the control limits for these variables, you can select one or multiple numerical variables.
In Cond. determine the symbolic variable to be used as a condition. Specifying the SPC calculates the limits (LCL and UCL) for each combination of conditions (e.g. product type, seasons, etc.). You can select one or multiple symbolic variables.
Enter a Coefficient value, the default value is 2.66. For more information, see Coefficient.
For more information on how to change the coefficient, see Shewhart individual control chart. |
Select the Chart Type from the list. The options are: X charts, Average and range control charts, Average and sigma control charts and Run charts.
Statistical tools used to evaluate the central tendency of a process over time. The type of Chart used to compute the control limits:
|
Click Save.
Check the SPC results using a trend. |
On advanced tab:
Check Moving ranges. It provides as advanced parameter the Moving Range that is the difference between two consecutive points ( point x and x-1).
Enter the MR prefix, by default MR_ .
Enter the MR LCL prefix, by MR_LCL_ .
Enter the MR UCL prefix, by MR_UCL_ .
Enter the MR LCL coefficient, by default 0. The lower control limit for the range (or lower range limit) is calculated by multiplying the average of the moving range by this coefficient [Wikipedia].
Enter the MR UCL coefficient, by default 3.27. The upper control limit for the range (or upper range limit) is calculated by multiplying the average of the moving range by this coefficient [Wikipedia].
Click Save.
Once calculated in Analytics, SPC limits can be deployed live in the Lake and on Dashboards. |
For more information, see the online learning platform
Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
PCA effectively reduces a large set of variables into a smaller set. The method is sensitive to the relative scaling of the original variables, and can be difficult to analyze, ensure that the variables are not Gaussian functions.
To create principal components:
Click Transform > Principal components analysis in the menu.
Enter Model Name.
Select the Learning set.
Enter an Variable prefix for your analysis, the default is PCA-.
Enter a PCA number value, the default is 5.
Select variables from the Variable list.
Click Save.
Change Point Analysis regroups data into segments that have a similar mean and standard deviation compared to other segments. Identify change points when there is a significant change in mean and/or standard deviation.
Steps of Change Point Analysis :
To create a Change Point Analysis:
On properties tab:
Click Transform > Change Point Analysis in the menu.
Enter Variable prefix, by default CPA_.
Select the Record set.
Select from the Variables list and use the arrow buttons to set variables for Index and Variable. Note that the variable must be numeric. The Index is automatically selected as the time variable.
Enter a Coefficient. This is a factor to calculate the upper and lower limits. Calculate from the segment mean (µ) and STDEV (σ) (e.g.: upper = µ + coefficient * σ). A segment is a subset of data between two change points.
Click Save.
On advanced tab:
Enter the Record count. It is used to divide the data set into blocks of this many points to begin searching for segments.
Enter the Minimum number of points. It is the minimum number of records required to create a segment. A segment is a subset of data between two change points.
Select the Cost Function. This is the function used to compare subsets to determine the best combination of segments across the data. There are three types of cost functions: Least squared deviation, Least deviation and Log-likehood.
Each cost is computed on the segment. xi = value at row i S = [xi - x(i+n)] Least squared deviation: Cost = Var(S) * n Least deviation: µ = mean(S), Cost = ∑ |xi - µ| Log-likelihood: Cost = n * log(2* PI) + log(Var(S) + 1) |
Enter a Beta value. This is the addition to cost function that increases the cost and therefore reduces the number of change points and avoids overfitting. A higher beta value reduces the number of change points. A Beta of zero will lead to the maximum number of change points.
Click Save.
Check the Change Point Analysis results using a trend. |
Example 1: Change Point Analysis, for a data set of 35 points where Record Count = 10 and Minimum Records = 5
All possible cost combinations are calculated:
Best segment combination is determined, for example:
Usual case of changing point analysis:
To facilitate the notation, each range (x - y) will be named by their letter : A = (0 - 10), C = (20-30), BC = (10-30)..
To find the best ensemble of segment:
Because we have divided our data by the record count: 10, thus we can only create segment based on those letter A, B, C, D. |
Example 2: Change Point Analysis, for a data set of 35 points where Record Count = 10 and Minimum Records = 15
What happens in a case of “minimum record” > “record count”:
To facilitate the notation, each range (x - y) will be named by their letter : A = (0 - 10), C = (20-30), BC = (10-30)..
To find the best ensemble of segment: