MashZone NextGen 10.2 | Appendix | Legacy Presto components | MashZone NextGen Analytics | RAQL Queries | Built-In RAQL Functions | Built-In Analytic Functions: Aggregate and Window
 
Built-In Analytic Functions: Aggregate and Window
Analytic functions most commonly perform calculations using sets of rows within a dataset. This may be the entire dataset or specific sets of rows defined as groups, partitions or windows.
There are two types of analytic functions:
*Aggregate analytic functions: use all rows in the dataset, group, partition or window to perform a calculation and return a single value. For example, sum adds the values of all rows in the current scope.
Aggregate analytic functions can be be used in the Select or Having clauses of RAQL queries:
*To return a single value for each group defined in a Group By clause.
*To return a single value for each partition or each window defined in an Over clause. This single value is added to each row in the partition or window. You can also use aggregate analytic functions to return running calculations for partitions or windows defined in an Over clause.
*To return a single value for the entire dataset if no group definition or partition definition is specified.
*Window analytic functions: use some of the rows in a partition or window to perform a calculation and return a value for each row. Typically, this uses rows that are relative to the current row, such as first_value or row_number.
You must include an Over clause in queries that use window analytic functions. This defines the partitions or windows used by the function.
Aggregate
Window
Function
Description
avg(Number col-or-expr)
Returns the average value from the specified column or expression for all rows within the group, partition or window scope. This ignores any null values.
correlation(Number col-or-expr, Number col-or-expr)
corr(Number col-or-expr, Number col-or-expr)
Returns a correlation coefficient (Pearson’s product-moment) for the values in the two columns or expressions for this partition or for the current window in this partition.
Null values are ignored.
This coefficient indicates whether the two columns have a linear relationship. The closer the value is to |1| the closer the relationship is to linear. If the value returned is:
*0: no linear relationship exists.
*0 < x <=1: a linear relationship exists where both columns increase as the other increases.
*-1 <= x < 0: there is an inverse linear relationship. One column decreases as the other increases.
count(Object col-or-expr)
Returns the number of rows for the specified column or expression for all rows within the group, partition or window scope. Rows with null values are ignored.
covariance(Number col-or-expr, Number col-or-expr)
covar(Number col-or-expr, Number col-or-expr)
Returns the measure of the covariance of the two specified columns or expressions.
This measure is not bias corrected. Rows with null values are ignored
A positive covariance indicate that the values for these columns tend to vary alike, as one grows larger so does the other. A negative covariance indicates an inverse relationship between the columns.
covar_pop(Number col-or-expr, Number col-or-expr)
Returns the full population covariance of the two specified col-or-exprs. See covariance for more information.
covar_samp(Number col-or-expr, Number col-or-expr)
Returns the sample covariance (an estimate) of the two specified columns. See covariance for more information.
cume_dist()
The cumulative distribution for the current row in the current partition. This is the percentage of values, including the current row, that are less than or equal to the value for the current row.
Note: This function requires that partitions be ordered and cannot be applied to windows within the partition.
dense_rank()
Assigns an indexed rank number indicating the order within this partition based on the unique combined values for all columns specified in the Order By clause defined for this partition. All rows with the same unique combination receive the same rank.
Rank numbers are contiguous across the partition. For an example illustrating this function, see Row_number, Rank and Dense_rank Example. See also the rank() plain function.
Note: This function requires that partitions be ordered and cannot be applied to windows within the partition.
analytics.discretize(Number col-or-expr, int binCount)
This function segments a continuous range of values for a column or expression into discrete bins based on the number of bins specified. If the value:
*< = minimum value for the column, it returns 0.
*> = maximum value for the column, it returns binCount - 1.
*< maximum and > minimum value for the column, it returns a fraction between 0 and binCount - 1 indicating the value’s relative position within the number of bins.
So 0 <= return value < 1, indicates the first bin, 1 <= return value < 2, indicates the second bin, and so on.
Note that this function does not support windows. It can only be applied to the entire partition.
first_value(Object col-or-expr)
first_value(Object col-or-expr, Any default)
Returns the value for the specified column or expression for the first row in the partition or window for the current row.
If a default value is specified, this returns the default value in cases where the specified window is empty.
gmean(Number col-or-expr)
Returns the geometric mean of the values for this column or expression within the partition or the current window in this partition.
The geometric mean is typically used to define a mean when more than one property is involved, especially if the scale for the properties is different.
analytics.kmean_clusters(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)
Deprecated. See analytics.kmeans_clusters.
analytics.kmeans_clusters(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)
Returns the optimum center point for the number of clusters specified as k that group rows within a minimum distance for each cluster. Inclusion in a cluster for each row is defined by the features specified in the list of column parameters (before the semi-colon). The distance from the cluster center point is measured by the formula specified in measure.
*col-or-expr1, col-or-expr2, ... col-or-exprN;: are individual columns or expressions used as features of the vectors that define clusters. This list of parameters must:
*Consist solely of columns with numeric values.
*Contain at least one column, but can have any number of columns.
*End with a semi-colon (;). This indicates the end of feature parameters and the beginning of the remaining, well-known parameters.
*k: is the number of clusters to create
*iterations: is the maximum number of iterations to perform to optimize clusters.
*measure: is the name of the formula to use to define membership for each row in a given cluster. Valid measures are:
*euclidean
*manhattan
*cosine
*tanimoto
*squaredeuclidean
analytics.kmean_observations(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)
Deprecated. See analytics.kmeans_observations.
analytics.kmeans_observations(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)
Returns the ID of the cluster that each row belongs to for the number of clusters specified as k. Inclusion in a cluster for each row is defined by the features specified in the list of column parameters (before the semi-colon). The distance from the cluster center point is measured by the formula specified in measure.
*col-or-expr1, col-or-expr2, ... col-or-exprN;: are individual columns or expressions used as features of the vectors that define clusters. This list of parameters must:
*Consist solely of columns with numeric values.
*Contain at least one column, but can have any number of columns.
*End with a semi-colon (;). This indicates the end of feature parameters and the beginning of the remaining, well-known parameters.
*k: is the number of clusters to create
*iterations: is the maximum number of iterations to perform to optimize clusters
*measure: is the name of the formula to use to define membership for each row in a given cluster. Valid measures are:
*euclidean
*manhattan
*cosine
*tanimoto
*squaredeuclidean
lag(Object col-or-expr)
lag(Object col-or-expr, int offset)
lag(Object col-or-expr, int offset, Object default)
Returns the value for the specified column or expression from a preceding row for the current row.
The preceding row is found using the offset specified or using 1 if this parameter is omitted.
If the specified preceding row is outside the scope of this partition or the current window within this partition, this returns the default, if specified, or returns null. The data type for default must match the data type for col-or-expr.
last_value(Object col-or-expr)
Returns the value for this column or expression for the last row in this partition or window.
lead(Object col-or-expr)
lead(Object col-or-expr, int offset)
lead(Object col-or-expr, int offset, Object default)
Returns the value for the specified column or expression from a following row for the current row.
The following row is found using the offset specified or using 1 if this parameter is omitted.
If the specified following row is outside the scope of this partition or the current window within this partition, this returns the default, ifspecified, or returns null. The data type for default must match the data type for col-or-expr.
analytics.linear_regression_intercept(Number col-or-expr1, Number col-or-expr2)
Returns the intercept value for the line resulting from a linear regression of the two columns or expressions passed as arguments. This uses an ordinary least squares regression where the first column is considered the independent variable.
analytics.linear_regression_interpolate(Number col-or-expr, Number col-or-expr)
Returns the expected value (the Y coordinate) for the point on the regression line for each row in this dataset. This line is derived from a linear regression applied to the two specified columns or expressions.
analytics.linear_regression_slope(Number col-or-expr, Number col-or-expr)
Returns the slope value for the line resulting from a linear regression of the two columns or expressions passed as arguments. This uses an ordinary least squares regression where the first column is considered the independent variable.
max(Number col-or-expr)
Returns the maximum value of this column or expression for all rows within the group, partition or window scope.
min(Number col-or-expr)
Returns the minimum value of this column or expression for all rows within the group, partition or window scope.
analytics.ordinal(String col-or-expr)
Returns an index number for each unique string value in the specified column or expression. Typically, assigning an ordinal number allows computations to work with string columns.
ntile(Number groups)
Distributes the rows in an ordered partition into the specified number of groups (the percentiles).
Note: This function requires that partitions be ordered and cannot be applied to windows within the partition.
nth_value(Object col-or-expr, Integer n)
nth_value(Object col-or-expr, Integer n, Object default)
Returns the value for the expression or column of the specified row (n) in the current partition and window.
The n parameter must be a positive integer. The col-or-expr parameter can be a column or any calculation.
If the specified nth row is outside the scope of this partition or the current window within this partition, this returns the default, if specified, or returns null. The data type for default must match the data type for col-or-expr.
percent_rank()
The percentage of rows that are less than the value for the current row based on the number of rows in the current partition and window not counting the highest value.
Note: This function requires that partitions be ordered and cannot be applied to windows within the partition.
rank()
Assigns an indexed rank number indicating the order within this partition based on the unique combined values for all columns in the Order By clause defined for this partition. All rows with the same unique value receive the same rank.
Rank numbers may not be contiguous, as they are skipped if multiple rows have identical ranks. For an example illustrating this function, see Row_number, Rank and Dense_rank Example. See also the dense_rank() plain function.
Note: This function requires that partitions be ordered and cannot be applied to windows within the partition.
regr_intercept(Number col-or-expr1, Number col-or-expr2)
See analytics.linear_regression_intercept.
regr_r2(Number col-or-expr1, Number col-or-expr2)
Returns the coefficient of determination (R squared) of a linear regression line based on the specified columns or expressions. This indicates how well the linear regression line is fitted to the data.
This ranges from 0 to 1, but can return null if there is no variance in the values of col-or-expr2.
regr_slope(Number col-or-expr1, Number col-or-expr2)
See analytics.linear_regression_slope.
row_number()
Returns a sequential index number for this row within the current partition. For an example illustrating this function, see Row_number, Rank and Dense_rank Example.
Note: This function cannot be applied to windows within the partition.
skew(Number col-or-expr)
Returns a measure of the asymmetry from a normal probability distribution of the values for this column or expression within the current partition or the current window in this partition. This uses the adjusted Fisher-Pearson standardized moment coefficient.
Positive values tend to indicate that more values are found left of the mean, while negative values imply the reverse.
This returns 0 if there are not at least three rows in the dataset. This function can also return NaN if the caluclation is undefined or not a number.
stddev(Number col-or-expr)
Returns the standard deviation of the values for this column or expression in the partition or the current window in this partition. Null values are ignored.
If there are no numeric values, this returns NaN. If there is only a single row, this returns zero.
stddev_pop(Number col-or-expr)
Returns the full population standard deviation of the values for this column or expression. See stddev for more information.
stddev_samp(Number col-or-expr)
Returns the sample standard deviation (an estimate) of the values for this column or expression. See stddev for more information.
sum(Number col-or-expr)
Returns the sum of all values of this column or expression for the rows within the group, partition or window scope. Null values are ignored.
If all values are null or empty, return null.
variance(Number col-or-expr)
Returns the variance of the distribution of values for this column or expression within the current partition or the current window in this partition. Null values are ignored.
If the column has no numeric values, this returns NaN.
var_pop(Number col-or-expr)
Returns the full population variance for this column or expression. See variance for more information.
var_samp(Number col-or-expr)
Returns the sample variance (an esitmate) for the values of this column or expression. See variance for more information.
Distinct Aggregations
Standard SQL aggregations functions such as AVG, SUM, COUNT, etc. (see list below) ignore duplicate values if their parameters are given in conjunction with the DISTINCT key word.
DISTINCT may be used with AVG, COUNT, MAX, MIN, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VARIANCE, VAR_POP, VAR_SAMP and COUNT.
DISTINCT may however not be used with COVAR, COVAR_POP, COVAR_SAMP, CORR, REGR_SLOPE, REGR_INTERCEPT and any user-defined aggregation function.
Table 1. Example
aColumn
1
2
2
3
3
3
Then the query SELECT SUM(DISTINCT aColumn) FROM source only sums up the unique values of the column aColumn and returns 6.
Window functions over RANGEs
In RAQL SELECT aggregationFunction(x) OVER(ORDER BY y) FROM s is semantically equivalent to SELECT aggregationFunction (x) OVER(ORDER BY y ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM s . The SQL Standard however defines the semantics to correspond with SELECT aggregationFunction (x) OVER(ORDER BY y RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM s which means that not only the current row but all "peer" rows (rows identical to the current row with respect to the ordering) are contained in the window. That means, that a rolling aggregate without an explicit window frame specification currently yields different results in RAQL and any SQL standard compliant DBMS.
In order to ensure a high level of compliance with the SQL standard, window frames are also allowed to be specified using the RANGE key word.
Semantics
Window frames defined using the ROWS keyword are easy to understand. The window frame clause ROWS BETWEEN x PRECEDING AND y FOLLOWING simply defines two offsets x and y that for each row in the input partition define the first and the last row number of the corresponding window relative to the current row number. PRECEDING and FOLLOWING only indicate whether the offset is negative (the row number precedes the current row number) or positive respectively.
Table 2. Example
rowNumber
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
1
Window = [1,4]
Window = []
Window = [1,10]
2
Window = [1,5]
Window = [1,1]
Window = [2,10]
3
Window = [1,6]
Window = [1,2]
Window = [3,10]
4
Window = [1,7] (from row number 4-3 to 4+3)
Window = [1,3]
Window = [4,10]
5
Window = [2,8]
Window = [2,4]
Window = [5,10]
6
Window = [3,9]
Window = [3,5]
Window = [6,10]
7
Window = [4,10]
Window = [4,6]
Window = [7,10]
8
Window = [5,10]
Window = [5,7]
Window = [8,10]
9
Window = [6,10]
Window = [6,8]
Window = [9,10]
10
Window = [7,10]
Window = [7,9]
Window = [10,10]
On the other hand the semantics of window frame specifications using the RANGE keyword does not depend on the physical row numbers but on the actual values of a column. Still, the window frame clause defines two offsets x and y, but these offsets are not added to the row number but to an actual column value. Hence, any window frame specification involving RANGE and an offset other than UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING or CURRENT ROW requires the window to be ordered on a single column and this column needs to have a numeric data type (in order to allow for offset arithmetics).
Table 3. Example
rowNumber
salary
SELECT avg(salary) OVER (ORDER BY salary RANGE BETWEEN 300 PRECEDING AND 300 FOLLOWING) FROM ...
considered window
1
1750
1850
[1450,2050]
salaries between 1750 - 300 and 1750 + 300
2
1900
1850
[1600,2200]
3
1900
1850
[1600,2200]
4
2200
2100
[1900,2500]
5
2400
2300
[2100,2700]
6
2750
2866,67
[2450,3050]
7
2900
2980
[2600,3200]
8
2950
2980
[2650,3250]
9
3100
2980
[2800,3400]
10
3200
2980
[2900,3500]
Table 4. Example
rowNumber
salary
SELECT avg(salary) OVER (ORDER BY salary RANGE BETWEEN 300 PRECEDING AND 300 FOLLOWING) FROM ...
considered window
1
1750
1
[1750,1750]
salaries between 1750 - 0 and 1750 + 0
2
1900
2
[1900,1900]
3
1900
2
[1900,1900]
4
2200
1
[2200,2100]
5
2400
1
[2400,2400]
6
2750
1
[2750,2750]
7
2900
1
[2900,2900]
8
2950
1
[2950,2950]
9
3100
1
[3100,3100]
10
3200
1
[3200,3200]
Note: The offsets in a RANGE frame clause define logical offsets in the ordered sequence defined by the order by clause. If the order by clause specifies a descending order (see next example) the offset 10 PRECEDING translates to "a value that precedes the current value in the current sequence and differs at most by 10 (years/units/...)"
Table 5. Example
rowNumber
age
SELECT COUNT(*) OVER (ORDER BY age DESC RANGE 10 PRECEDING) FROM ...
considered window
1
75
1
[85, 75]
ages between 75 + 10 and 75 - 0
2
68
2
[78, 68]
3
68
2
[72, 62]
4
56
2
[66, 56]
5
55
3
[65, 55]
6
51
3
[61, 51]
7
48
5
[58, 48]
8
48
5
[58, 48]
9
47
6
[57, 47]
10
43
5
[53, 43]
Affected Queries
As mentioned above, the new RANGE specifier was introduced to ensure compliance with the SQL standard. There are however only very few cases where existing queries should be affected by those changes. Only queries specifying a window function with
1. an explicit order-by specification and
2. NO explicit window frame specification may be affected if
3. The expressions defined in the order-by clause are not unique for each row.
If the order-by expressions are unique, than for each row there are no peers with respect to the ordering and the implicitly added window frame clause RANGE UNBOUNDED PRECEDING is equivalent to ROWS UNBOUNDED PRECEDING and hence the query results would be the same as before. If however the order-by specification is not unique then there may be rows having one or more peer rows so that the query result differs from the result in former RAQL releases and the user might want to explicitly add the window frame clause ROWS UNBOUNDED PRECEDING to yield the former results.

Copyright © 2013-2018 | Software AG, Darmstadt, Germany and/or Software AG USA, Inc., Reston, VA, USA, and/or its subsidiaries and/or its affiliates and/or their licensors.
Innovation Release