Businesses use different reporting tools like Tableau, QlikView, Power BI.etc. to aid in the representation of the data, to understand and generate outcomes of a project. This is in turn done with the help of backend ETL and DB tools like – Snowflake, Oracle, Matillion, Python. etc.
Though many complex reports and visualizations can be created in these Reporting tools, it is often recommended that the data is massaged and formatted in the way it is intended in the backend, so that hassle-free reports can be generated. One such aspect in reporting is Pivoting of data.
For eg: A drug manufacturing company wants to know the performance (sales) of a drug based on the Financial Year. Here, pivoting the data would give a better representation than the normal tabular format.
Pivoting means rotating a table by turning the unique values from one column in the table into multiple columns and aggregating results where required based on other columns. The Snowflake PIVOT command helps in this functionally.
But this command has its limitations in Snowflake, where the dynamic handling of data is not taken into account. The pivot command requires you to give static values for generating the final result. This is an issue in this scenario where the financial year (the pivoted column) keeps changing, every year. So if you want to add a new pivoted column, you will require a code change. This would make the design less robust.
Though such scenarios can be handled using the above-mentioned reporting tools, the below discussed is a way to achieve this dynamic changing result set in the backend ETL – using Snowflake & Matillion. The solution is truly based w.r.t. Snowflake, whereas for ETL you can choose any other tool as well that supports Snowflake.
Building an End-to-End Orchestration & Transformation Pipeline using Snowflake & Matillion
In the below design, we have a basic ELT model, in which the output will be a pivoted table, based on any input query provided, containing details of aggregate & pivot columns.
Below is a sample scenario, in which we have implemented this design.
Here if you observe the Sales values have been segregated into column data as per financial year. The basic way to achieve this in Snowflake is using the PIVOT command as below:
SELECT DISTINCT DRUG_ID, DRUG_NAME, “‘2017-2018′”, “‘2018-2019′”, “‘2019-2020′”
PIVOT(SUM(SALES_IN_CRS) FOR FINANCIAL_YEAR IN (‘2017-2018′,’2018-2019′,’2019-2020’));
This will give you the desired result, but the problem is that the pivoted (Financial Year) column is hard-coded. If newer years are added, the query needs a change regularly. This would be an issue, in the case of other complex scenarios and when other columns are used for pivoting. So the method discussed below resolve’s this scenario.
First, we need to tweak the query such that we identify which the pivot column is and which the aggregate value is, that needs to appear in the pivoted column. Below is the modified query:
SELECT DRUG_ID, DRUG_NAME,
FINANCIAL_YEAR as pivot_column,
SUM(SALES_IN_CRS) as pivot_value
WHERE FINANCIAL_YEAR IN
(SELECT DISTINCT FINANCIAL_YEAR from DEMO_DB.DEMO_SCHEMA.DRUG_SALES_TBL ORDER BY FINANCIAL_YEAR DESC)
GROUP BY DRUG_ID, DRUG_NAME,FINANCIAL_YEAR;
So here we mention the column value, which is to be pivoted (pivot_column) and the aggregated column value (pivot_value), which should go under each pivot column.
The method makes use of the Snowflake User Defined Procedure & Result Scan functions to achieve the result. We will pass the query into a Procedure, which in turn prepares the table, with pivoted values.
Let us understand the Procedure first:
Here, the above query is passed as input, and the further process is done. Let’s see step by step into this procedure:
- Run the Input Query and Store the query ID of the same
The result of the input query will be as below:
The query_id points to the results of the query for 1 day, and can be fetched at any point in a day using the RESULT_SCAN function in Snowflake.
- With the above, we prepare the list of pivot column data
The result of the above step is shown below:
The col_list variable will store this value. This will be used in the next step.
- Here the final dataset will be prepared with the pivoted columns
Rather than outputting the entire table, we will output the query_id of this final result set. Then using this query_id, we can rewrite the data into a table. For executing this part ELT tool is used – to run the pipeline end-to-end.
Creating the Matillion Job
The first step is to set up a metadata table to store the queries, to make it generic and reusable.
The below Matillion orchestration job iterates through the queries stored in the metadata and creates tables as required.
Now at every iteration, the procedure runs through a Database Query Component, which stores the query_id of the procedure output in a temporary table.
This query_id is then fetched by the Query_to_Scalar component and then the result is processed in the transformation job to finally load the data into a table.
The below query will get the result using the query_id.
The entire query run and computation happens in Snowflake, and we have used Matillion as an ETL tool to handle the entire pipeline.
The final output will give us the desired result even when the source data is frequently changing, moreover the column values on which pivot is being done.
Hope this solution will be helpful for you all. Do try out various scenarios and write back for continued discussion on this solution.