Built-In Analytic Functions: Aggregate and Window

Aggregate

Window

Function

Description

avg(Number col-or-expr)

Returns the average value from the specified column or expression for all rows within the group, partition or window scope. This ignores any null values.

correlation(Number col-or-expr, Number col-or-expr)

Deprecated. See Update Built-In Aggregate or Window Analytic Functions for 3.8 for alternatives.

corr(Number col-or-expr, Number col-or-expr)

Returns a correlation coefficient (Pearson’s product-moment) for the values in the two columns or expressions for this partition or for the current window in this partition.

Null values are ignored.

This coefficient indicates whether the two columns have a linear relationship. The closer the value is to |1| the closer the relationship is to linear. If the value returned is:

0: no linear relationship exists.

0 < x <=1: a linear relationship exists where both columns increase as the other increases.

-1 <= x < 0: there is an inverse linear relationship. One column decreases as the other increases.

count(Object col-or-expr)

Returns the number of rows for the specified column or expression for all rows within the group, partition or window scope. Rows with null values are ignored.

covariance(Number col-or-expr, Number col-or-expr)

Deprecated. See Update Built-In Aggregate or Window Analytic Functions for 3.8 for alternatives.

covar(Number col-or-expr, Number col-or-expr)

Returns the measure of the covariance of the two specified columns or expressions.

This measure is not bias corrected. Rows with null values are ignored

A positive covariance indicate that the values for these columns tend to vary alike, as one grows larger so does the other. A negative covariance indicates an inverse relationship between the columns.

covar_pop(Number col-or-expr, Number col-or-expr)

Returns the full population covariance of the two specified col-or-exprs. See covariance for more information.

covar_samp(Number col-or-expr, Number col-or-expr)

Returns the sample covariance (an estimate) of the two specified columns. See covariance for more information.

cume_dist()

The cumulative distribution for the current row in the current partition. This is the percentage of values, including the current row, that are less than or equal to the value for the current row.

Note:

This function requires that partitions be ordered and cannot be applied to windows within the partition.

dense_rank()

Assigns an indexed rank number indicating the order within this partition based on the unique combined values for all columns specified in the Order By clause defined for this partition. All rows with the same unique combination receive the same rank.

Rank numbers are contiguous across the partition. For an example illustrating this function, see Row_number, Rank and Dense_rank Example. See also the rank() plain function.

Note:

This function requires that partitions be ordered and cannot be applied to windows within the partition.

analytics.discretize(Number col-or-expr, int binCount)

This function segments a continuous range of values for a column or expression into discrete bins based on the number of bins specified. If the value:

< = minimum value for the column, it returns 0.

> = maximum value for the column, it returns binCount - 1.

< maximum and > minimum value for the column, it returns a fraction between 0 and binCount - 1 indicating the value’s relative position within the number of bins.

So 0 <= return value < 1, indicates the first bin, 1 <= return value < 2, indicates the second bin, and so on.

Note that this function does not support windows. It can only be applied to the entire partition.

first_value(Object col-or-expr)

first_value(Object col-or-expr, Any default)

Returns the value for the specified column or expression for the first row in the partition or window for the current row.

If a default value is specified, this returns the default value in cases where the specified window is empty.

gmean(Number col-or-expr)

Returns the geometric mean of the values for this column or expression within the partition or the current window in this partition.

The geometric mean is typically used to define a mean when more than one property is involved, especially if the scale for the properties is different.

analytics.kmean_clusters(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)

Deprecated. See analytics.kmeans_clusters.

analytics.kmeans_clusters(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)

Returns the optimum center point for the number of clusters specified as k that group rows within a minimum distance for each cluster. Inclusion in a cluster for each row is defined by the features specified in the list of column parameters (before the semi-colon). The distance from the cluster center point is measured by the formula specified in measure.

col-or-expr1, col-or-expr2, ... col-or-exprN;: are individual columns or expressions used as features of the vectors that define clusters. This list of parameters must:

Consist solely of columns with numeric values.

Contain at least one column, but can have any number of columns.

End with a semi-colon (;). This indicates the end of feature parameters and the beginning of the remaining, well-known parameters.

k: is the number of clusters to create

iterations: is the maximum number of iterations to perform to optimize clusters.

measure: is the name of the formula to use to define membership for each row in a given cluster. Valid measures are:

euclidean

manhattan

cosine

tanimoto

squaredeuclidean

analytics.kmean_observations(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)

Deprecated. See analytics.kmeans_observations.

analytics.kmeans_observations(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)

Returns the ID of the cluster that each row belongs to for the number of clusters specified as k. Inclusion in a cluster for each row is defined by the features specified in the list of column parameters (before the semi-colon). The distance from the cluster center point is measured by the formula specified in measure.

col-or-expr1, col-or-expr2, ... col-or-exprN;: are individual columns or expressions used as features of the vectors that define clusters. This list of parameters must:

Consist solely of columns with numeric values.

Contain at least one column, but can have any number of columns.

End with a semi-colon (;). This indicates the end of feature parameters and the beginning of the remaining, well-known parameters.

k: is the number of clusters to create

iterations: is the maximum number of iterations to perform to optimize clusters

measure: is the name of the formula to use to define membership for each row in a given cluster. Valid measures are:

euclidean

manhattan

cosine

tanimoto

squaredeuclidean

lag(Object col-or-expr)

lag(Object col-or-expr, int offset)

lag(Object col-or-expr, int offset, Object default)

Returns the value for the specified column or expression from a preceding row for the current row.

The preceding row is found using the offset specified or using 1 if this parameter is omitted.

If the specified preceding row is outside the scope of this partition or the current window within this partition, this returns the default, if specified, or returns null. The data type for default must match the data type for col-or-expr.

last_value(Object col-or-expr)

Returns the value for this column or expression for the last row in this partition or window.

lead(Object col-or-expr)

lead(Object col-or-expr, int offset)

lead(Object col-or-expr, int offset, Object default)

Returns the value for the specified column or expression from a following row for the current row.

The following row is found using the offset specified or using 1 if this parameter is omitted.

If the specified following row is outside the scope of this partition or the current window within this partition, this returns the default, ifspecified, or returns null. The data type for default must match the data type for col-or-expr.

analytics.linear_regression_intercept(Number col-or-expr1, Number col-or-expr2)

Returns the intercept value for the line resulting from a linear regression of the two columns or expressions passed as arguments. This uses an ordinary least squares regression where the first column is considered the independent variable.

analytics.linear_regression_interpolate(Number col-or-expr, Number col-or-expr)

Returns the expected value (the Y coordinate) for the point on the regression line for each row in this dataset. This line is derived from a linear regression applied to the two specified columns or expressions.

analytics.linear_regression_slope(Number col-or-expr, Number col-or-expr)

Returns the slope value for the line resulting from a linear regression of the two columns or expressions passed as arguments. This uses an ordinary least squares regression where the first column is considered the independent variable.

max(Number col-or-expr)

Returns the maximum value of this column or expression for all rows within the group, partition or window scope.

min(Number col-or-expr)

Returns the minimum value of this column or expression for all rows within the group, partition or window scope.

analytics.ordinal(String col-or-expr)

Returns an index number for each unique string value in the specified column or expression. Typically, assigning an ordinal number allows computations to work with string columns.

ntile(Number groups)

Distributes the rows in an ordered partition into the specified number of groups (the percentiles).

Note:

This function requires that partitions be ordered and cannot be applied to windows within the partition.

nth_value(Object col-or-expr, Integer n)

nth_value(Object col-or-expr, Integer n, Object default)

Returns the value for the expression or column of the specified row (n) in the current partition and window.

The n parameter must be a positive integer. The col-or-expr parameter can be a column or any calculation.

If the specified nth row is outside the scope of this partition or the current window within this partition, this returns the default, if specified, or returns null. The data type for default must match the data type for col-or-expr.

percent_rank()

The percentage of rows that are less than the value for the current row based on the number of rows in the current partition and window not counting the highest value.

Note:

This function requires that partitions be ordered and cannot be applied to windows within the partition.

rank()

Assigns an indexed rank number indicating the order within this partition based on the unique combined values for all columns in the Order By clause defined for this partition. All rows with the same unique value receive the same rank.

Rank numbers may not be contiguous, as they are skipped if multiple rows have identical ranks. For an example illustrating this function, see Row_number, Rank and Dense_rank Example. See also the dense_rank() plain function.

Note:

This function requires that partitions be ordered and cannot be applied to windows within the partition.

regr_intercept(Number col-or-expr1, Number col-or-expr2)

See analytics.linear_regression_intercept.

regr_r2(Number col-or-expr1, Number col-or-expr2)

Returns the coefficient of determination (R squared) of a linear regression line based on the specified columns or expressions. This indicates how well the linear regression line is fitted to the data.

This ranges from 0 to 1, but can return null if there is no variance in the values of col-or-expr2.

regr_slope(Number col-or-expr1, Number col-or-expr2)

See analytics.linear_regression_slope.

row_number()

Returns a sequential index number for this row within the current partition. For an example illustrating this function, see Row_number, Rank and Dense_rank Example.

Note:

This function cannot be applied to windows within the partition.

skew(Number col-or-expr)

Returns a measure of the asymmetry from a normal probability distribution of the values for this column or expression within the current partition or the current window in this partition. This uses the adjusted Fisher-Pearson standardized moment coefficient.

Positive values tend to indicate that more values are found left of the mean, while negative values imply the reverse.

This returns 0 if there are not at least three rows in the dataset. This function can also return NaN if the caluclation is undefined or not a number.

stddev(Number col-or-expr)

Returns the standard deviation of the values for this column or expression in the partition or the current window in this partition. Null values are ignored.

If there are no numeric values, this returns NaN. If there is only a single row, this returns zero.

stddev_pop(Number col-or-expr)

Returns the full population standard deviation of the values for this column or expression. See stddev for more information.

stddev_samp(Number col-or-expr)

Returns the sample standard deviation (an estimate) of the values for this column or expression. See stddev for more information.

sum(Number col-or-expr)

Returns the sum of all values of this column or expression for the rows within the group, partition or window scope. Null values are ignored.

If all values are null or empty, return null.

variance(Number col-or-expr)

Returns the variance of the distribution of values for this column or expression within the current partition or the current window in this partition. Null values are ignored.

If the column has no numeric values, this returns NaN.

var_pop(Number col-or-expr)

Returns the full population variance for this column or expression. See variance for more information.

var_samp(Number col-or-expr)

Returns the sample variance (an esitmate) for the values of this column or expression. See variance for more information.

Table 1. Example
aColumn
1
2
2
3
3
3

Table 2. Example
rowNumber	ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING	ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING	ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
1	Window = [1,4]	Window = []	Window = [1,10]
2	Window = [1,5]	Window = [1,1]	Window = [2,10]
3	Window = [1,6]	Window = [1,2]	Window = [3,10]
4	Window = [1,7] (from row number 4-3 to 4+3)	Window = [1,3]	Window = [4,10]
5	Window = [2,8]	Window = [2,4]	Window = [5,10]
6	Window = [3,9]	Window = [3,5]	Window = [6,10]
7	Window = [4,10]	Window = [4,6]	Window = [7,10]
8	Window = [5,10]	Window = [5,7]	Window = [8,10]
9	Window = [6,10]	Window = [6,8]	Window = [9,10]
10	Window = [7,10]	Window = [7,9]	Window = [10,10]

Table 3. Example
rowNumber	salary	SELECT avg(salary) OVER (ORDER BY salary RANGE BETWEEN 300 PRECEDING AND 300 FOLLOWING) FROM ...	considered window
1	1750	1850	[1450,2050]	salaries between 1750 - 300 and 1750 + 300
2	1900	1850	[1600,2200]
3	1900	1850	[1600,2200]
4	2200	2100	[1900,2500]
5	2400	2300	[2100,2700]
6	2750	2866,67	[2450,3050]
7	2900	2980	[2600,3200]
8	2950	2980	[2650,3250]
9	3100	2980	[2800,3400]
10	3200	2980	[2900,3500]

Table 4. Example
rowNumber	salary	SELECT avg(salary) OVER (ORDER BY salary RANGE BETWEEN 300 PRECEDING AND 300 FOLLOWING) FROM ...	considered window
1	1750	1	[1750,1750]	salaries between 1750 - 0 and 1750 + 0
2	1900	2	[1900,1900]
3	1900	2	[1900,1900]
4	2200	1	[2200,2100]
5	2400	1	[2400,2400]
6	2750	1	[2750,2750]
7	2900	1	[2900,2900]
8	2950	1	[2950,2950]
9	3100	1	[3100,3100]
10	3200	1	[3200,3200]

Note:

The offsets in a RANGE frame clause define logical offsets in the ordered sequence defined by the order by clause. If the order by clause specifies a descending order (see next example) the offset 10 PRECEDING translates to "a value that precedes the current value in the current sequence and differs at most by 10 (years/units/...)"

Table 5. Example
rowNumber	age	SELECT COUNT(*) OVER (ORDER BY age DESC RANGE 10 PRECEDING) FROM ...	considered window
1	75	1	[85, 75]	ages between 75 + 10 and 75 - 0
2	68	2	[78, 68]
3	68	2	[72, 62]
4	56	2	[66, 56]
5	55	3	[65, 55]
6	51	3	[61, 51]
7	48	5	[58, 48]
8	48	5	[58, 48]
9	47	6	[57, 47]
10	43	5	[53, 43]