Analytical and Window Functions using Snowflake

By July 22, 2020 July 29th, 2020 Blogs

Introduction

Snowflake is a cloud data platform and data warehouse that supports the most common standardized version of SQL: ANSI. This implies that most common SQL operations are usable within Snowflake. Snowflake supports DDL & DML operations that enable data warehousing implementation. Most of all, ETL operations can be easily written in Snowflake including transformation, aggregation, SCDs, batch processing, ranking/scoring, etc.

SQL is a domain-specific language, designed for managing data held in an RDBMS, or for stream processing in an RDSMS. It is particularly useful in handling structured data, triangulate data points, and augment analytical logic to derive insights.

“At the same time, there’s a pure SQL implementation called window/analytical functions that is very readable, performant, and easy to debug.”

Mastering SQL analytical functions is a necessity for a BI/DWH professional.

Since business intelligence (BI) applications are used for a variety of data-related activities. SQL can be used for BI applications that combine data retrieval, analysis, and computation.

In this Snowflake SQL window functions content, we will describe how these functions work in general. So, for data warehousing, there is access to sophisticated analytic and window functions like RANK, LEAD, LAG, SUM, GROUP BY, PARTITION BY and others.

Execution Flow of Functions in SNOWFLAKE

Executing analytical functions organizes data into partitions, computes functions over these partitions in a specified order, and returns the result.

Processing analytical functions are done after row filtering (WHERE) and GROUP BY. This is so that the calculations can be performed on the grouped results. The returned rows are ordered after the functions have been processed.

We place analytical functions in the SELECT list, just as we would with any other result columns in a SQL SELECT statement.

Data Model (Automobile Sales)

Below is the data model to track and analyze automobile sales created in snowflake.

Understanding cardinality arrows:

Schema consists of the following tables:

  • customers: Stores customer’s data.
  • products: Stores a list of scale model cars.
  • productlines: Stores a list of product line categories.
  • orders: Stores sales orders placed by customers.
  • orderdetails: Stores sales order line items for each sales order.
  • payments: Stores payments made by customers based on their accounts.
  • employees: Stores all employee information as well as the organization structure such as who reports to whom.
  • offices: Stores sales office data.

Let’s Start

Let us learn by writing simple SQL queries to get some insights out of the automobile sales database.

RANKING FUNCTIONS:

  • RANK

Syntax - RANK() OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2> [ { ASC | DESC } ] [ <window_frame> ]

RANK is a simple analytical function that does not take any column list arguments. It returns a number or a rank based on the ordering of the rows; the ordering is based on a defined condition.

Let’s look at the example where we will rank products based on their quantity ordered and filter TOP N products.

Output we get:

The Descending (DESC) order goes from the highest to the lowest quantity. If in case we would like to go for BOTTOM N products then we can use Ascending (ASC).

Output we get:

  • DENSE RANK

Syntax - DENSE_RANK() OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2> [ ASC | DESC ] [ <window_frame> ] )

Dense rank has the same functionality of identifying TOP N values but is used to avoid gaps in the ranks, it starts from 1 and assigns ranks sequentially in case of ties between same ranked values.

PRODUCT NAME QUANTITY ORDERED RANK DENSE_RANK
1913 Ford Model T Speedster 900 1 1
1952 Alpine Renault 1300 800 2 2
Pont Yacht 750 3 3
1972 Alfa Romeo GTA 750 3 3
The Titanic 78 5 4
1937 Lincoln Berline 78 5 4
Corsair F4U ( Bird Cage) 78 5 4
1940 Ford Delivery Sedan 50 8 5

Values highlighted in yellow are the points where we can easily identify the difference between the functionality of DENSE_RANK AND RANK.

Below is the SQL for DENSE_RANK implementation.

 

  • RUNNING/CUMULATIVE SUM

Syntax - SUM( <expr1> ) OVER ( [ PARTITION BY <expr2> ] [ ORDER BY <expr3> [ ASC | DESC ] [ <window_frame> ] ] )

The motive is to understand Running Sum calculations, we will calculate the running sum of the quantities ordered per product which we have seen in the previous example of Sum function:

Here we are making use of the Row_Number function(need not pass any arguments), in order to use them as a sequence of rows to be followed for calculation of running sum.

Output we get:

  • COUNT

Syntax - COUNT( [ DISTINCT ] <expr1> [ , <expr2> ... ] )

Count can be directly used as an aggregate function as described in the above syntax. It can also be used as a window function like:

Syntax - COUNT( <expr1> [ , <expr2> ... ] )
OVER ( [ PARTITION BY <expr3> ] [ ORDER BY <expr4> [ ASC | DESC ] [ <window_frame> ] ] )

Let us look at the normal aggregate function, where we will simply calculate the count of the orders placed in a year and we will also count the number of orders shipped actually.

Output we get:

  • LEAD

Syntax - LEAD ( <expr> [ , <offset> , <default> ] ) [ { IGNORE | RESPECT } NULLS ] OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2> [ { ASC | DESC } ] )

Lead is used to compare the current value with the next value or any other value at the desired row. In a sense it allows us to access the data from the next row without joining the table.

[ { IGNORE | RESPECT } NULLS ] 🡪 Ignore NULLS simply ignores the NULL values of the column on which LEAD function is
applied and if there is any NULL value appears it pulls the next NOT NULL value.
BY DEFAULT ITS 🡪 RESPECT NULLS

In this case, we will try to find the difference between the order amount of current order and next order using the LEAD function.

Output we get:

The last value in the NEXT_VALUE column will be NULL because there is no next record after the last row.

  • LAG –

Syntax - LAG ( <expr> [ , <offset> , <default> ] ) [ { IGNORE | RESPECT } NULLS ] OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2> [ { ASC | DESC } ] )

Lag is used to compare the current value with the previous value or any other previous value at the desired row. In a sense it allows us to pull the data from the previous row without joining the table.

[ { IGNORE | RESPECT } NULLS ] 🡪 Ignore NULLS simply ignore the NULL values of the column on which LAG
function is applied and if there is any NULL value appears it pulls the previous NOT NULL value.
BY DEFAULT ITS 🡪 RESPECT NULLS

In this case, we will try to find the difference between the order amount of current order and previous order using the LAG function.

Output we get:

The first value in the PREVIOUS_VALUE column will be NULL because there is no previous record before the first row.

  • FIRST_VALUE

Syntax - FIRST_VALUE( <expr> ) [ { IGNORE | RESPECT } NULLS ]

OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2>  [ { ASC | DESC } ] [ <window_frame> ] )

First_Value is used to access or identify the first value in an ordered set of values within a group.

Since this is a function that would require data in order or rank wise. So we have to specify the window where data will be divided into groups and ordered.

Output we get:

  • LAST_VALUE

Syntax - LAST_VALUE( <expr> ) [ { IGNORE | RESPECT } NULLS ]
OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2> [ { ASC | DESC } ] [ <window_frame> ] )

Last_Value is used to access or identify the last value in an ordered set of values within a group.

Since this is a function which would require data in order or rank wise. So we have to specify the window where data will be divided into groups and ordered.

Output we get:

  • GROUP BY GROUPING SETS

Syntax -  SELECT ...
FROM ...
GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] )
[ ... ]

Where:

groupSet ::= { <column_alias> | <position> | <expr> }

Group By Grouping Sets is an extension of Group by clause. Group set is a set of dimensions and we can apply group by clause on multiple dimensions of grouping set in one statement.

We can also understand it as a UNION of two or more Group BY operations.

Group By col1
Union All
Group By col2
IS EQUAL TO
Group By Grouping Sets (col1,col2)

Snowflake allows us to access up to128 grouping sets.

Also, the result set contains some NULL values because when one set is listed meanwhile the other set will be null and when another set is computed then first will be null.

Let’s have a look at the data which is aggregated and grouped by year and month-

Now we will understand Grouping sets over the data in the above screen show.

Output we get:

In this output, data is aggregated for 2003 & 2004 and then for months 1 & 2 (irrespective of the year)

  • GROUP BY ROLLUP

Syntax -
SELECT ...
FROM ...
GROUP BY ROLLUP ( groupRollup [ , groupRollup [ , ... ] ] )
[ ... ]
Where:
groupRollup ::= { <column_alias> | <position> | <expr&gt; }

Group BY Rollup computes sub-total rows among the grouped rows. That means the aggregation calculations are based on the number of columns taken into consideration at that time.

And we can think of it as drilling the data (in a hierarchical manner). Like from Summarized to Detailed OR Detailed to Summarized data in one glance.

It also contains NULL values because when we are at the highest level of information, then the other levels are not significant and they will be assigned NULLs, Similarly it goes from the next higher level to other lower levels.

Also, it will be good, if we place the highest level column at the first position in the Roll Up clause since we get more detailed information first and then the summarized one as we scan data to the next level.

Output we get –

  • GROUP BY CUBE

Syntax -
SELECT ...
FROM ...
GROUP BY CUBE ( groupCube [ , groupCube [ , ... ] ] )
[ ... ]
Where:
groupCube ::= { <column_alias> | <position> | <expr> }

Group By Cube is an extension of Group by Rollup and partial Grouping Sets.

It generated all the records of the group by rollup and also compute the aggregation for the individual columns which are left in roll-up computation.

Output we get:

 

  • IGNORE / RESPECT NULLS

[ { IGNORE | RESPECT } NULLS ]

Ignore NULLS simply ignore the NULL values of the column on which the LAG function is applied and if there is any NULL value appears it pulls the previous NOT NULL value.

BY DEFAULT ITS 🡪 RESPECT NULLS

This part of handling NULL values is understood with the example of a LAG function:

Output we get:

And that is about using sophisticated SQL analytical and window functions with Snowflake. Would love to hear your experience around the same.

Reference:

Abhishek Gangrade

Abhishek Gangrade

BI Analyst with 9 Years of experience in DW/BI. Experienced in widely used BI Visualization Tools.

Leave a Reply