Using Matillion for Data Loading into Snowflake – Metadata Driven Approach

By April 8, 2020 May 25th, 2021 Blogs
Using Matillion for Data Loading into Snowflake

Snowflake is a modern data warehouse, built for the cloud, supports distributed data processing, limitless auto-scaling in/out (horizontal) for concurrency & scaling up (vertical) for performance, with the complete power of ANSI SQL via a pay-as-you-go option (no upfront costs or lock-in) – making it one of the most preferred technologies – by enterprises looking to migrate to the cloud or trying to modernize their enterprise (read more here). This article discusses a practical migration use case where we used Matillion for data loading into Snowflake.


Data loading into Snowflake is typically done as follows:

  • Exporting data from the source (using Scripts or ETL/ELT tools)
  • Staging exported data in an external stage like AWS S3/ Azure Blob or Snowflake internal stage. Ideally, in chunks of 50-100 mb files for performance (leveraging cores).
  • Using the “Copy Into SQL” to ingest these files into Snowflake Database Tables.

For one of our customers, we had to migrate from SQL Server on-premise to Snowflake. This involved hundreds of tables. They also had various other data sources like Salesforce, SAP etc to name a few. Naturally, with so many data sources, data loading into Snowflake was an integral discussion.

We came up with the two options:

In-house framework

(Developed in both Java and Python)

Use modern ELT/ETL tool for both orchestration and transformation
Provides a lot of automation scripts that leverage metadata and support for complex transformations.Full-blown platform (provides most of the required connectors) for orchestration and transformation.
Any new scenarios had to be hand-coded and the coverage would improve only with time.The automation implemented using an in-house framework was not readily available here.

We then adopted the best of both worlds – by integrating Matillion (as an engine) with our MetaData driven approach – which improved our time to value from 10 weeks (with 8 engineers) to 6 weeks (with 4 engineers).

Summarizing the need for automated data orchestration pipelines, we prioritized the ASKS as follows:

  • Rapid automation for new data sources and/or tables
    • Configure metadata using our existing framework
    • Leverage generic connectors
  • Export data from the source and stage it on AWS S3
  • Ingest it into Snowflake (Landing)
  • Transform from landing using SCD 2 into Data Lake / Operational Data Store (ODS)
    • Optionally, the ability to support data quality rules (DQ)
    • Optionally, the ability to support change data capture (CDC) / slowly changing dimension (SCD)
  • Leverage Snowflake best practices

Here are the details on various building blocks:

  1. Set up Snowflake Account
  2. Configure the required Meta-Data
  3. Develop the ELT and ETL jobs leveraging above
    • Ensure audit & error logs are captured
    • Validation or reconciliation mechanism
  4. Actual performance matrix

1. Single Snowflake Account to support Multiple Environment Set Up

If there is a single Snowflake Account to support multiple environments for e.g. Dev, QA, UAT, PROD then probability to get it out of hand is more. We need to be more careful to set up Environment-related Roles, Virtual warehouse, Database and users accordingly so no environment collides with each other.

Sequence for Set Up ⇒ ROLE →   VIRTUAL  WAREHOUSE →   DATABASE  →   USERS

Object TypeFormatExamples
Databaseenv_purposedev_landinguat_odsprod_analytics or analytics
DB.Schemafn_sourcesales_sfdcfinance_sapcrm_hubspot
Compute & Roleenv_activitydev_dataloaderuat_biprod_explore or explore
  • Usage of Global Roles (AccountAdmin, SysAdmin, SecurityAdmin) should be restricted.
  • Leverage env specific role for managing and maintaining that env like Dev_Sysadmin, Dev_SecurityAdmin.
  • Any new user addition gets assigned env specific roles & permissions.
  • Dev_Data_Loader is only for Matillion Jobs and used by Load Process only.

2. Metadata Design Approach ⇒

Usually, when Data-Lake is built, the source system structure is kept in-take.
Using a metadata-driven approach to run Matillion Job offers a robust solution with modularity, ease of maintenance, less development cycle and fast onboarding changes if new tables get added or removed.

Model of Metadata table

Column NameDescription
METADATA_SKIncremental Value
SOURCE_SYSTEMSalesForce, SQL Server, S3?
SOURCE_OBJECT_NAMETable name or S3 File URL
SOURCE_PREFIX_DBObject Prefix in case of S3 Or Schema name/DB name setting at source to be used
LANDING_DBSnowflake Landing database name{ as Matillion all Odbc or S3 Read stage data first }
LANDING_SCHEMALanding schema name

SalesForce, Sql_Server, S3

LANDING_TABLELanding table name
DATALAKE_DBSnowflake data lake table name [ SCD1 or SCD2 types ]
DATALAKE_SCHEMASnowflake data lake schema name

SalesForce, SQL Server, S3

DATALAKE_TABLEDataLake Table Name
CDC_COLUMN_NAMEDelta Identifier column name if only incremental data has to be fetched from Source
CDC_MIN_DATEMinimum of Delta Identifier Fetched
CDC_MAX_DATEMax  of Delta Identifier Fetched
PRIORITYSet priority if this has to be Run first
IS_ACTIVEIn case some table load has to be disabled or enabled
ADDITIONAL_WHERE_CLAUSEAt Source Side, apply additional where clause
AUDIT_RELATED_COLUMNS….Entry Created By, Created Date, Updated By, Updated Date etc.

3. Basic Framework of ETL Orchestration Job

  1. Select all data from Metadata i.e. ETL_METADATA
    1. It will entry for following Source System
      1. SFDC
      2. SQL SERVER
      3. S3
  2. Set all as ETL Metadata Columns into variable to be Input in Matillion Job
  3. As variable values change, the ELT { one job } will be running.

This is how the Main Job will be designed which we will run on Parameters at ETL_TABLE.

Main Job which will

  • Query Metadata [ Entry for many tables to be loaded in Data Lake ] can be run concurrently or sequentially.
  • Load data from Source to Landing using
    • Database Component in Matillion stages it to Landing.
      • Database component can be Salesforce or SQL Server or any DBMS
      • S3 Load component to be used S3 Bucket i.e. CSV, XML, AVRO, PARQUET
    • The second job will be loading data from Landing to DataLake.
    • All parameters as table to load changes are derived from Metadata in Schema.
    • For every step, an audit/error notification common framework is called to record failover or normal processing.

SCD2 Generic Job ⇒

4. Actual Performance Matrix of migrating data from SQL Server to Snowflake

RA_LND = SQL Server to Landing
LND_DL = Landing to DataLake (ODS)


* Migration log of some of the tables.

From the Landing (in Snowflake) to Data lake (in snowflake) using SCD2:-

  • 76 Mins for 1 Billion Records including SCD2 functionality
  • [ snapshot above shows some of the large tables, total tables involved 600+ ]

This concludes our journey of data loading into Snowflake using Matillion and a metadata-driven approach.

However, Snowflake also announced Streams & Tasks last year at Snowflake, using which you can build ETL and SCD – Read here to for a complete How-To guide.

Would be great to hear about your experience and use of other ELT/ETL tools for migrating to Snowflake!

Prajakta Borkar

Prajakta Borkar

Enterprise Datawarehouse Architect, 14+ years of experience a host of technologies including Oracle Exadata, Snowflake, SQL-server, Shell/Powershell/PLSQL script, Datastage, Ab-Initio and specialized in modern cloud technologies like Snowflake, Matillion, Denodo, PowerBI, Business Objects with Data Modelling, Data analytics etc.

7 Comments

  • KD says:

    Good work Prajakta. This is really good article and would be a jump start for other upcoming triggers on cloud space.

  • Very good article! We are linking to this great article on our website.
    Keep up the good writing.

  • Vineet says:

    Hi Prajakta
    A very helpful article. This article of yours basically talks about the migration of On-Prem Data Sources to Snowflake.
    Can we use Matillion for migration data from Azure Sql Server DB to Snowflake Data Warehouse on Azure Platform?

    Cheers
    Vineet

    • Prajakta Borkar says:

      Yes.
      There is component called Azure SQL Query which Runs an SQL Query on an Azure SQL Database and copy the result to a table, via Azure Blob Storage.

  • Paul says:

    Hi! I could have sworn I’ve been to this web site before but after going through a few of the posts
    I realized it’s new to me. Anyhow, I’m certainly pleased I came across it and I’ll be bookmarking it and checking back regularly!

  • check this says:

    I’m extremely impressed with your writing skills
    as well as with the layout on your weblog. Is this a paid theme or did
    you modify it yourself? Either way keep up the nice quality writing, it is rare to see a
    nice blog like this one nowadays.

  • XF says:

    I love what you guys are usually up too. This type of clever work
    and exposure! Keep up the awesome works guys I’ve included you guys to
    my blogroll.