Using Matillion for Data Loading into Snowflake – Metadata Driven Approach

By April 8, 2020 April 28th, 2020 Blogs
Using Matillion for Data Loading into Snowflake

Snowflake is a modern data warehouse, built for the cloud, supports distributed data processing, limitless auto-scaling in/out (horizontal) for concurrency & scaling up (vertical) for performance, with the complete power of ANSI SQL via a pay-as-you-go option (no upfront costs or lock-in) – making it one of the most preferred technologies – by enterprises looking to migrate to the cloud or trying to modernize their enterprise (read more here). This article discusses a practical migration use case where we used Matillion for data loading into Snowflake.

Data loading into Snowflake is typically done as follows:

  • Exporting data from the source (using Scripts or ETL/ELT tools)
  • Staging exported data in an external stage like AWS S3/ Azure Blob or Snowflake internal stage. Ideally, in chunks of 50-100 mb files for performance (leveraging cores).
  • Using the “Copy Into SQL” to ingest these files into Snowflake Database Tables.

For one of our customers, we had to migrate from SQL Server on-premise to Snowflake. This involved hundreds of tables. They also had various other data sources like Salesforce, SAP etc to name a few. Naturally, with so many data sources, data loading into Snowflake was an integral discussion.

We came up with the two options:

In-house framework

(Developed in both Java and Python)

Use modern ELT/ETL tool for both orchestration and transformation
Provides a lot of automation scripts that leverage metadata and support for complex transformations. Full-blown platform (provides most of the required connectors) for orchestration and transformation.
Any new scenarios had to be hand-coded and the coverage would improve only with time. The automation implemented using an in-house framework was not readily available here.

We then adopted the best of both worlds – by integrating Matillion (as an engine) with our MetaData driven approach – which improved our time to value from 10 weeks (with 8 engineers) to 6 weeks (with 4 engineers).

Summarizing the need for automated data orchestration pipelines, we prioritized the ASKS as follows:

  • Rapid automation for new data sources and/or tables
    • Configure metadata using our existing framework
    • Leverage generic connectors
  • Export data from the source and stage it on AWS S3
  • Ingest it into Snowflake (Landing)
  • Transform from landing using SCD 2 into Data Lake / Operational Data Store (ODS)
    • Optionally, the ability to support data quality rules (DQ)
    • Optionally, the ability to support change data capture (CDC) / slowly changing dimension (SCD)
  • Leverage Snowflake best practices

Here are the details on various building blocks:

  1. Set up Snowflake Account
  2. Configure the required Meta-Data
  3. Develop the ELT and ETL jobs leveraging above
    • Ensure audit & error logs are captured
    • Validation or reconciliation mechanism
  4. Actual performance matrix

1. Single Snowflake Account to support Multiple Environment Set Up

If there is a single Snowflake Account to support multiple environments for e.g. Dev, QA, UAT, PROD then probability to get it out of hand is more. We need to be more careful to set up Environment-related Roles, Virtual warehouse, Database and users accordingly so no environment collides with each other.

Sequence for Set Up ⇒ ROLE →   VIRTUAL  WAREHOUSE →   DATABASE  →   USERS

Object Type Format Examples
Database env_purpose dev_landing uat_ods prod_analytics or analytics
DB.Schema fn_source sales_sfdc finance_sap crm_hubspot
Compute & Role env_activity dev_dataloader uat_bi prod_explore or explore
  • Usage of Global Roles (AccountAdmin, SysAdmin, SecurityAdmin) should be restricted.
  • Leverage env specific role for managing and maintaining that env like Dev_Sysadmin, Dev_SecurityAdmin.
  • Any new user addition gets assigned env specific roles & permissions.
  • Dev_Data_Loader is only for Matillion Jobs and used by Load Process only.

2. Metadata Design Approach ⇒

Usually, when Data-Lake is built, the source system structure is kept in-take.
Using a metadata-driven approach to run Matillion Job offers a robust solution with modularity, ease of maintenance, less development cycle and fast onboarding changes if new tables get added or removed.

Model of Metadata table

Column Name Description
METADATA_SK Incremental Value
SOURCE_SYSTEM SalesForce, SQL Server, S3?
SOURCE_OBJECT_NAME Table name or S3 File URL
SOURCE_PREFIX_DB Object Prefix in case of S3 Or Schema name/DB name setting at source to be used
LANDING_DB Snowflake Landing database name{ as Matillion all Odbc or S3 Read stage data first }
LANDING_SCHEMA Landing schema name

SalesForce, Sql_Server, S3

LANDING_TABLE Landing table name
DATALAKE_DB Snowflake data lake table name [ SCD1 or SCD2 types ]
DATALAKE_SCHEMA Snowflake data lake schema name

SalesForce, SQL Server, S3

DATALAKE_TABLE DataLake Table Name
CDC_COLUMN_NAME Delta Identifier column name if only incremental data has to be fetched from Source
CDC_MIN_DATE Minimum of Delta Identifier Fetched
CDC_MAX_DATE Max  of Delta Identifier Fetched
PRIORITY Set priority if this has to be Run first
IS_ACTIVE In case some table load has to be disabled or enabled
ADDITIONAL_WHERE_CLAUSE At Source Side, apply additional where clause
AUDIT_RELATED_COLUMNS…. Entry Created By, Created Date, Updated By, Updated Date etc.

3. Basic Framework of ETL Orchestration Job

  1. Select all data from Metadata i.e. ETL_METADATA
    1. It will entry for following Source System
      1. SFDC
      2. SQL SERVER
      3. S3
  2. Set all as ETL Metadata Columns into variable to be Input in Matillion Job
  3. As variable values change, the ELT { one job } will be running.

This is how the Main Job will be designed which we will run on Parameters at ETL_TABLE.

Main Job which will

  • Query Metadata [ Entry for many tables to be loaded in Data Lake ] can be run concurrently or sequentially.
  • Load data from Source to Landing using
    • Database Component in Matillion stages it to Landing.
      • Database component can be Salesforce or SQL Server or any DBMS
      • S3 Load component to be used S3 Bucket i.e. CSV, XML, AVRO, PARQUET
    • The second job will be loading data from Landing to DataLake.
    • All parameters as table to load changes are derived from Metadata in Schema.
    • For every step, an audit/error notification common framework is called to record failover or normal processing.

SCD2 Generic Job ⇒

4. Actual Performance Matrix of migrating data from SQL Server to Snowflake

RA_LND = SQL Server to Landing
LND_DL = Landing to DataLake (ODS)

* Migration log of some of the tables.

From the Landing (in Snowflake) to Data lake (in snowflake) using SCD2:-

  • 76 Mins for 1 Billion Records including SCD2 functionality
  • [ snapshot above shows some of the large tables, total tables involved 600+ ]

This concludes our journey of data loading into Snowflake using Matillion and a metadata-driven approach.

However, Snowflake also announced Streams & Tasks last year at Snowflake, using which you can build ETL and SCD – Read here to for a complete How-To guide.

Would be great to hear about your experience and use of other ELT/ETL tools for migrating to Snowflake!

Prajakta Borkar

Prajakta Borkar

Enterprise Datawarehouse Architect, 14+ years of experience a host of technologies including Oracle Exadata, Snowflake, SQL-server, Shell/Powershell/PLSQL script, Datastage, Ab-Initio and specialized in modern cloud technologies like Snowflake, Matillion, Denodo, PowerBI, Business Objects with Data Modelling, Data analytics etc.

One Comment

  • Avatar KD says:

    Good work Prajakta. This is really good article and would be a jump start for other upcoming triggers on cloud space.

Leave a Reply