Analytic functions

Analytic functions most commonly perform calculations using sets of rows within a dataset. This may be the entire dataset or specific sets of rows defined as groups, partitions or windows.

Aggregation analytic functions: use all rows in the dataset, group, partition or window to perform a calculation and return a single value. For example, sum adds the values of all rows in the current scope. Aggregate analytic functions can be be used in the SELECT or HAVING clauses of RAQL queries:

Window analytic functions: use some of the rows in a partition or window to perform a calculation and return a value for each row. Typically, this uses rows that are relative to the current row, such as FIRST_VALUE or ROW_NUMBER.

You must include an OVER clause in queries that use window analytic functions. This defines the partitions or windows used by the function.

As with plain functions, RAQL provides analytic functions defined in the SQL standard as well as RAQL-specific extensions. See AggregateFunctionCall for more details on the standard SQL aggregation functions being supported.

The RAQL-specific extensions are:

Aggregate

Window

Function

Description

 

correlation(Number col-or-expr, Number col-or-expr)

Deprecated. See Update analytic functions for 3.8 for alternatives.

 

covar_samp(Number col-or-expr, Number col-or-expr)

Returns the sample covariance (an estimate) of the two specified columns. See covariance for more information.

 

analytics.discretize(Number col-or-expr, int binCount)

This function segments a continuous range of values for a column or expression into discrete bins based on the number of bins specified. If the value:

  • < = minimum value for the column, it returns 0.
  • > = maximum value for the column, it returns binCount - 1.
  • < maximum and > minimum value for the column, it returns a fraction between 0 and binCount - 1 indicating the value’s relative position within the number of bins.

So 0 <= return value < 1, indicates the first bin, 1 <= return value < 2, indicates the second bin, and so on.

Note that this function does not support windows. It can only be applied to the entire partition.

 

gmean(Number col-or-expr)

Returns the geometric mean of the values for this column or expression within the partition or the current window in this partition.

The geometric mean is typically used to define a mean when more than one property is involved, especially if the scale for the properties is different.

 

analytics.kmeans_clusters(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)

Returns the optimum center point for the number of clusters specified as k that group rows within a minimum distance for each cluster. Inclusion in a cluster for each row is defined by the features specified in the list of column parameters (before the semi-colon). The distance from the cluster center point is measured by the formula specified in measure.

  • col-or-expr1, col-or-expr2, ... col-or-exprN;: are individual columns or expressions used as features of the vectors that define clusters. This list of parameters must:
  • Consist solely of columns with numeric values.
  • Contain at least one column, but can have any number of columns.
  • End with a semi-colon (;). This indicates the end of feature parameters and the beginning of the remaining, well-known parameters.
  • k: is the number of clusters to create
  • iterations: is the maximum number of iterations to perform to optimize clusters.
  • measure: is the name of the formula to use to define membership for each row in a given cluster. Valid measures are:
  • euclidean
  • manhattan
  • cosine
  • tanimoto
  • squaredeuclidean

 

analytics.kmeans_observations(Number col-or-expr1[,Number col-or-expr2,...Number col-or-exprN]; Integer k, Integer iterations, String measure)

Returns the ID of the cluster that each row belongs to for the number of clusters specified as k. Inclusion in a cluster for each row is defined by the features specified in the list of column parameters (before the semi-colon). The distance from the cluster center point is measured by the formula specified in measure.

  • col-or-expr1, col-or-expr2, ... col-or-exprN;: are individual columns or expressions used as features of the vectors that define clusters. This list of parameters must:
  • Consist solely of columns with numeric values.
  • Contain at least one column, but can have any number of columns.
  • End with a semi-colon (;). This indicates the end of feature parameters and the beginning of the remaining, well-known parameters.
  • k: is the number of clusters to create
  • iterations: is the maximum number of iterations to perform to optimize clusters
  • measure: is the name of the formula to use to define membership for each row in a given cluster. Valid measures are:
  • euclidean
  • manhattan
  • cosine
  • tanimoto
  • squaredeuclidean

 

analytics.ordinal(String col-or-expr)

Returns an index number for each unique string value in the specified column or expression. Typically, assigning an ordinal number allows computations to work with string columns.

 

skew(Number col-or-expr)

Returns a measure of the asymmetry from a normal probability distribution of the values for this column or expression within the current partition or the current window in this partition. This uses the adjusted Fisher-Pearson standardized moment coefficient.

Positive values tend to indicate that more values are found left of the mean, while negative values imply the reverse.

This returns 0 if there are not at least three rows in the dataset. This function can also return NaN if the caluclation is undefined or not a number.