Multi-Level Group Calculations
The example of a Group By clause in Getting Started,
Group and Analyze Rows, used a composite group containing two group columns to segment stock data into sets by symbol and then year within symbol.
Once groups are defined, you determine what analysis occurs and is returned using analytic functions in the Select clause. Each group returns one row of data, so the functions you choose must be aggregate analytic functions that return a single value for all rows in the group. See
Built-In Analytic Functions: Aggregate and Window for the list of built-in
MashZone NextGen analytic functions that you may use with Group By clauses.
For more examples, see:
Single Level Group
You can use a single column group such as this example which counts the number of legislators in each house of the US Congress:
SELECT chamber, COUNT(chamber) FROM congress GROUP BY chamber
Chamber | COUNT_chamber |
senate | 100 |
house | 439 |
Three Group Levels
With each level of group, you add columns to the expression. This example groups stock data by symbol, year and quarter within year to derive the average low price for each quarter:
SELECT symbol,extract_year(datetime), quarter(datetime), round(avg(low)) AS qtrLow
FROM stocks
GROUP BY symbol,extract_year(datetime), quarter(datetime)
ORDER BY symbol,extract_year(datetime) DESC, quarter(datetime)
Symbol | Extract_year_datetime | Quarter_datetime | QtrLow |
AAPL | 2011 | 1 | 337.0 |
AAPL | 2011 | 2 | 335.0 |
AAPL | 2010 | 1 | 206.0 |
AAPL | 2010 | 2 | 242.0 |
AAPL | 2010 | 3 | 253.0 |
AAPL | 2010 | 4 | 305.0 |
AAPL | 2009 | 1 | 89.0 |
AAPL | 2009 | 2 | 126.0 |
AAPL | 2009 | 3 | 161.0 |
AAPL | 2009 | 4 | 192.0 |
AAPL | 2008 | 1 | 130.0 |
AAPL | 2008 | 2 | 168.0 |
... | ... | ... | ... |
The columns that appear in the Select clause must either be used for group in the Group By clause or be used in aggregate calculations. It is also quite common to sort by the columns used in grouping.
Groups Using an Aggregate Analytic Function
This example calculates the standard deviation, using the stddev() aggregate analytic function, for the high prices of each stock and year:
SELECT symbol, extract_year(datetime) AS yr, stddev(high) AS highStdDev
FROM stocks
GROUP BY symbol, extract_year(datetime)
ORDER BY symbol, yr
Symbol | Yr | HighStdDev |
AAPL | 1984 | 1.6057616847575154 |
AAPL | 1985 | 4.628989380498321 |
AAPL | 1986 | 5.626270136472927 |
AAPL | 1987 | 14.69298548409742 |
AAPL | 1988 | 2.4555231784570233 |
AAPL | 1989 | 4.2820135239165875 |
AAPL | 1990 | 4.529251236509766 |
AAPL | 1991 | 7.4271879581244775 |
AAPL | 1992 | 7.62725912432657 |
AAPL | 1993 | 13.459208620188244 |
AAPL | 1994 | 4.371574862654053 |
AAPL | 1995 | 4.069107853754948 |
AAPL | 1996 | 3.336661126725468 |
... | ... | ... |
The columns that appear in the Select clause must either be used for group in the Group By clause or be used in aggregate calculations. It is also quite common to sort by the columns used in grouping.
Aggregate on Multiple Groups with ROLLUP, CUBE and GROUPING SETS
The examples above partitioned the data and calculated an aggregate value for each partition based only on a single group criterion. However, let's assume you are not only interested in the standard deviation of high prices for each stock and year, but additionally would like to see the overall standard deviation of high prices per stock, and the total standard deviation of high prices over all stocks. You could run three separate GROUP BY queries and combine their results as follows:
SELECT symbol, extract_year(datetime) AS yr, stddev(high) AS highStdDev FROM stocks
GROUP BY symbol, extract_year datetime)
UNION ALL
SELECT symbol, CAST(NULL AS DATE) AS yr, stddev(high) AS highStdDev FROM stocks
GROUP BY symbol
UNION ALL
SELECT CAST(NULL AS STRING) AS symbol, CAST(NULL AS DATE) AS yr, stddev(high) AS
highStdDev FROM stocks ORDER BY symbol NULLS FIRST, yr DESC NULLS FIRST
-- we specify an ordering here to see the aggregated symbol and year columns
in the example output --
MashZone NextGenfully supports the SQL syntax for complex GROUPING SETS, which allows to specify a single set of group expressions in the group-by clause, and to specify multiple sets that the data is partitioned and aggregated upon. Using this syntax, the above query could easily be rewritten to:
SELECT symbol, extract_year(datetime) AS yr, stddev(high) AS highStdDev FROM stocks
GROUP BY GROUPING SETS ( (symbol, extract_year(datetime)), symbol, () )
ORDER BY symbol NULLS FIRST, yr DESC NULLS FIRST
-- we specify an ordering here to see the aggregated symbol and year columns
in the example output --
Note: | Please note, that this query is not only much shorter, but may also be evaluated more efficiently by the query engine. |
The above GROUPING SETS clause defines three sets of group expressions (
symbol, extract_year(datetime)),
symbol, and the
empty grouping set (), so that in the result of the query we will find a row for every unique pair of (
symbol, extract_year(datetime)) values, concatenated by rows for each unique value of
symbol, and a single row representing the empty group. For each such group row, the values of the group expressions that are not considered in the current grouping are filled with
NULL, e. g. the
yr column is
NULL for the second grouping by
symbol (see also section
Null values in Grouping Set aggregations).
Symbol | Yr | HighStdDev | |
| | 95.99022287735269 | This row indicates the HighStdDev aggregated over all stocks and all years. |
AAPL | | 66.38352453765187 | This row indicates the HighStdDev for symbol AAPL aggregated over all years. |
AAPL | 2011 | 7.78515914570644 | |
AAPL | 2010 | 37.77321347891091 | |
AAPL | 2009 | 40.274807396460886 | |
AAPL | 2008 | 33.6190395688354 | |
AAPL | 2007 | 37.83959132285136 | |
AAPL | 2006 | 9.55659436868603 | |
AAPL | 2005 | 15.564417219126476 | |
AAPL | 2004 | 13.782231301956635 | |
AAPL | 2003 | 3.4495406114241156 | |
AAPL | 2002 | 4.4393911367617065 | |
AAPL | 2001 | 2.6334977228792624 | |
AAPL | 2000 | 43.679940128685516 | |
... | ... | ... | |
Special Grouping Sets: ROLLUP and CUBE
It is very common, to have hierarchical dimensions on which to partition the data. For instance, year and month could be such a dimension hierarchy. If you are interested in the highest price values among all stocks for each dimension level, you could use the GROUPING SETS syntax to specify the following four sets of group expressions:
SELECT extract_year(datetime) AS y, extract_month(datetime) AS m,
max(high) AS maxHigh
FROM stocks
GROUP BY GROUPING SETS ( (extract_year(datetime), extract_month(datetime)),
(extract_year(datetime)),
() )
ORDER BY y NULLS FIRST, m NULLS FIRST
-- we specify an ordering here to see the aggregated symbol and year columns
in the example output --
However, for queries like this, where the grouping set dimensionality is reduced from right to left, MashZone NextGen supports the SQL keyword ROLLUP , which - given a list of group expressions - rolls-up the list by consecutively removing one expression from the end of the list and calculating the next higher level aggregate, up to the overall aggregate.
Hence, the above query is equivalent to the following query using the ROLLUP syntax instead:
SELECT extract_year(datetime) AS y, extract_month(datetime) AS m, max(high) AS maxHigh
FROM stocks
GROUP BY ROLLUP (extract_year(datetime), extract_month(datetime))
ORDER BY y NULLS FIRST, m NULLS FIRST
-- we specify an ordering here to see the aggregated symbol and year columns
in the example output --
Note: | Please note, that the roll-up of the grouping columns is reflected in the result by a NULL value in the corresponding group column (shown as an empty column). |
Y | M | MaxHigh | |
| | 747.24 | This row indicates the MaxHigh aggregated over all years and all months. |
1984 | | 128.5 | This row indicates the HighStdDev for year 1984 aggregated over all months. |
1984 | 9 | 128.5 | |
1984 | 10 | 127.75 | |
1984 | 11 | 127.75 | |
1984 | 12 | 124.5 | |
1985 | | 158.75 | This row indicates the HighStdDev for year 1985 aggregated over all months. |
1985 | 1 | 137.63 | |
1985 | 2 | 138.25 | |
1985 | 3 | 136.38 | |
1985 | 4 | 130.38 | |
1985 | 5 | 133.25 | |
1985 | 6 | 130.63 | |
1985 | 7 | 132.75 | |
... | ... | ... | ... |
In addition to the ROLLUP keyword, which enumerates a hierarchy of grouping sets,MashZone NextGensupports the SQL keyword CUBE, to enumerate the power set of grouping sets, that is the set of all subsets. As an example, consider the group expressions chamber, state and gender. Then the clause CUBE(party, state, gender) translates into the following grouping sets:
(party, state, gender)
(party, state)
(party, gender)
(state, gender)
(party)
(state)
(gender)
()
In comparison, these would be the grouping sets produced by ROLLUP(party, state, gender) :
(party, state, gender)
(party, state)
(party)
()
That is the query:
SELECT party, state, gender, count(*) FROM congress
WHERE chamber = 'house' AND state LIKE 'K%' -- only to limit the result size here --
GROUP BY GROUPING SETS ( (party, state, gender),
(party, state), (party, gender), (state, gender)
(party), (state), (gender),
() )
It yields the exact same results as the much shorter query:
SELECT party, state, gender, count(*) FROM congress
WHERE chamber = 'house' AND state LIKE 'K%' -- only to limit the result size here --
GROUP BY CUBE(party, state, gender)
ORDER BY party NULLS FIRST, state NULLS FIRST, gender NULLS FIRST
Party | State | Gender | COUNT_ | |
| | | 10 | Overall number of congressmen representing Kansas or Kentucky |
| | F | 1 | Overall number of female congressmen representing Kansas or Kentucky |
| | M | 9 | Overall number of male congressmen representing Kansas or Kentucky |
| KS | | 4 | Number of congressmen representing Kansas |
| KS | F | 1 | Number of female congressmen representing Kansas |
| KS | M | 3 | Number of male congressmen representing Kansas |
| KS | | 6 | Overall number of congressmen representing Kentucky |
| KS | M | 6 | Number of male congressmen representing Kentucky |
D | | | 2 | Number of democratic congressmen representing Kansas or Kentucky |
D | | M | 2 | Number of male democratic congressmen representing Kansas or Kentucky |
D | KS | | 2 | Number of democratic congressmen representing Kentucky |
D | KS | M | 2 | Number of male democratic congressmen representing Kentucky |
R | | | 8 | Number of republican congressmen representing Kansas or Kentucky |
R | | F | 1 | Number of female republican congressmen representing Kansas or Kentucky |
... | ... | ... | ... | ... |
Using composite group criteria
With the CUBE and ROLLUP syntax it is also possible, to combine multiple columns into a group expression, that is treated as if it was a single column. For instance, the expression ROLLUP(a, (b,c), d) translates into the grouping sets (a, (b,c), d), (a, (b,c)), (a), (). I.e., c is not rolled-up from (b,c).
Note: | It is a best practice, to use ROLLUP and CUBE only in cases, when all calculated groupings are relevant. If only a subset of the produced groupings is of interest, it is more efficient to use the corresponding GROUPING SETS expression generating only those relevant groupings. |
Null values in Grouping Set aggregations
In the result of a grouping query over multiple grouping sets, NULL values in one of the group columns may have different meanings:
1. If the group column contains NULL values in its source table, all those NULL values are treated equally and grouped into a separate NULL group.
2. If the group column is aggregated, i. e., it is omitted in the grouping set for this grouping’s row, then the group column is also NULL.
Consequently, it is not possible to tell the two different meanings of NULL apart only from looking at the result row itself. Therefore, a special SQL function GROUPING is provided, the purpose of which is to tell for a single group expression, whether or not this expression has been aggregated in the result (result = 1) or not (result = 0).
SELECT symbol, GROUPING(symbol) AS grp_symbol, extract_year(datetime) AS yr,
GROUPING(extract_year(datetime)) AS grp_year,
stddev(high) AS highStdDev
FROM stocks
GROUP BY ROLLUP (symbol, extract_year(datetime))
ORDER BY symbol NULLS FIRST, yr DESC NULLS FIRST
Symbol | Grp_symbol | Yr | Grp_year | HighStdDev | |
| 1 | | 1 | 95.99022287735269 | NULL values here indicate that symbol and yr have been aggregated. |
AAPL | 0 | | 1 | 66.38352453765187 | NULL value in column yr here indicates that yr has been aggregated. |
AAPL | 0 | 2011 | 0 | 7.78515914570644 | |
AAPL | 0 | 2010 | 0 | 37.77321347891091 | |
AAPL | 0 | 2009 | 0 | 40.274807396460886 | |
AAPL | 0 | 2008 | 0 | 33.6190395688354 | |
... | ... | ... | ... | ... | ... |
Additionally, this function is overloaded to also take a number of group expressions as parameters. In this case, the result will be a numerical id for each resulting grouping. In order to make this id unique among all groupings, all group expressions should be given as parameters to the function.
SELECT symbol, extract_year(datetime) AS yr,
GROUPING(symbol, extract_year(datetime)) AS grp,
stddev(high) AS highStdDev
FROM stocks
WHERE extract_year(datetime) BETWEEN 2009 AND 2011 -- only to limit the result set here --
GROUP BY ROLLUP (symbol, extract_year(datetime))
ORDER BY symbol NULLS FIRST, yr DESC NULLS FIRST
Symbol | Yr | Grp | HighStdDev | |
| | 3 | 151.89615745136808 | There is only one row in this grouping representing the overall aggregate. |
AAPL | | 1 | 82.34565115212739 | This row belongs to the grouping 1 which relates to the grouping set (symbol). |
AAPL | 2011 | 0 | 7.785159145706435 | This row belongs to the grouping 0 which relates to the grouping set (symbol, extract_year(datetime)). |
AAPL | 2010 | 0 | 37.773213478910954 | |
AAPL | 2009 | 0 | 40.274807396460936 | |
AAPL | | 1 | 41.95096980230709 | This row also belongs to the grouping 1 which relates to the grouping set (symbol). |
... | ... | ... | ... | ... |
The GROUPING(col1, … colN) result is internally computed by combining the GROUPING results for each single column into a bit set which is finally interpreted as a LONG value. Example: Assume the following GROUPING values for single group columns a, b and c:
GROUPING(a) = 1, GROUPING(b) = 0 and GROUPING(c) = 1
Then GROUPING(a,b,c) yields the bit set 101 which is interpreted as the LONG value 5.
To make things clear, the following query shows all the distinct groupings resulting from a group by CUBE over three columns. Note how the single GROUPING values for each column are concatenated into a bit set representing the final GROUPING value.
SELECT DISTINCT
GROUPING(symbol) AS grpSmbl, GROUPING(extract_year(datetime)) AS grpYr,
GROUPING(extract_month(datetime)) AS grpMnth,
CAST(GROUPING(symbol) AS STRING) || CAST(GROUPING(extract_year(datetime)) AS STRING) ||
CAST(GROUPING(extract_month(datetime)) AS STRING) AS bits,
GROUPING(symbol, extract_year(datetime), extract_month(datetime)) AS grp
FROM stocks
GROUP BY CUBE (symbol, extract_year(datetime), extract_month(datetime))
ORDER BY grp ASC
grpSymbol | grpYear | grpMonth | bits | grp |
0 | 0 | 0 | 000 | 0 |
0 | 0 | 1 | 001 | 1 |
0 | 1 | 0 | 010 | 2 |
0 | 1 | 1 | 011 | 3 |
1 | 0 | 0 | 100 | 4 |
1 | 0 | 1 | 101 | 5 |
1 | 1 | 0 | 110 | 6 |
1 | 1 | 1 | 111 | 7 |
Note: | Please note that each unique GROUPING id only indicates the grouping set, which produced the groups in this grouping. It does not uniquely identify a group within this grouping! |
Combining and Merging Grouping Sets
The group by clause may not only take simple group expressions or grouping sets, but also any combination of them, e. g., GROUP BY a, ROLLUP(b,c), GROUPING SETS((d,e), (f)). The resulting grouping sets are then defined by the Cartesian product of the involved grouping sets, i.e. the example clause translates into the following grouping sets:
This is the result of the Cartesian product of {(a)} x {(b, c), (b), ()} x {(d, e), (f)}
(a, b, c, d, e)
(a, b, c, f)
(a, b, d, e)
(a, b, f)
(a, d, e)
(a, f)
Likewise, the GROUPING SETS clause may take nested ROLLUP , CUBE and GROUPING SET clauses, e. g. GROUP BY GROUPING SETS (a, ROLLUP(b,c), GROUPING SETS((d,e), (f)) ). In this case, the nested grouping sets are "flattened" and merged into the surrounding grouping set, i.e., the example clause translates into the following grouping sets:
This is the result of the grouping sets (a), followed by all grouping sets (b,c), (b), () resulting from the rollup clause, followed by the grouping sets (d,e) and (f).
(a)
(b, c)
(b)
()
(d,e)
(f)