Implement Any Data Lake Using Matillion

By March 5, 2021 May 25th, 2021 Blogs
Implement Any Data Lake Using Matillion

There are many complexities involved when there is a need to onboard a new source system as a data lake in Data Warehousing which includes a Longer development cycle, specific technical knowledge understanding, and implementation cycle. In case of the new addition of columns or deletes, manual intervention is required to add, remove columns which can cause delays in reflecting added columns in Data Lake schema. Creating a data lake can be a very time-consuming and tedious task. As the traditional implementation method goes, approach follow 

  • To create an ingestion pipeline per object
  • Write per object-specific transformation 
  • Create schema, database objects, landing area

The approach describes how to create a data lake considering salesforce as a Source system in a couple of very easy steps using Matillion as an ETL tool. Simple ingestion pipeline design reduces entire development efforts and also adds more value to the actual purpose of the data lake i.e. analyze the data. Auto schema adjustment technique makes any newly added columns / any new deleted columns reflected in data lake immediately without any manual intervention. End to end process is seamless and easy to implement.   

In this approach, all we need to do is

  • Create single pipeline
  • Create one metadata table at table level
  • Create one metadata table at the column level 
  • Auto-adjust schema to consume new columns or disable any deleted columns 

Here are quick steps on how it can be done :

  • Step 1: Define Metadata 
  • Step 2: Define Matillion ETL Pipeline 

Step 1: Define Metadata

  • Metadata table at table level to capture following details 
    • Landing table/schema name
    • Data Lake table/schema name 
    • Landing CDC Column Name to capture incremental details
    • Data Lake CDC column Name to capture Incremental details
    • Start Range
    • End Range  
    • Is Active flag

Data Snapshot

  • Metadata table at column level to capture following details
    • Landing table/schema name
    • Data Lake table/schema name
    • Column name 
    • Column Checksum formula 
      • Since redshift does not support Hash on every data type, we need to create hash mechanism by adding checksum at the column level 
    • Data Type 
    • Is_Active
    • Column Type
      • Hash: Main Primary Key
      • ETL: In case of ETL column i.e. load date, run id 
      • AUDIT: We don’t want to consider finding change, those are audit-related columns at the source level
      • ATTRIBUTE: Columns for which changes need to be captured

Saama Analytics Data Snapshot

  • Step 2: Define Matillion ETL Pipeline

Saama Analytics Data Snapshot

Implementation Steps:-

  • Common:  
    • Batch Start-end / process: to capture start and end time for given Salesforce batch
    • Audit Capture for any failure 
  • Step 1: Read Data from Metadata at Table Level to find out how many tables are in Scope – get columns and build grid i.e. landing table/schema name , data lake table/schema name, any CDC date , what is start range

Saama Analytics Data Snapshot

Table level Metadata Grid to get all Information to Run Step 2

Saama Analytics Data Snapshot

Note – Start_Date-3 reset date back to 3 days back as a safety net.

  • Step 2: Grid as above passed here and Start the Loop per table – Form a query i.e. Select * from Source table where CDC Column Name > Start Range and Cdc Column Name < End Range to get data from Source and store inLanding

Saama Analytics Data Snapshot

Query can be variable drive passed from Step 1

Saama Analytics Data SnapshotNow Landing schema [ Truncate and Load ] is ready.

  • Step 3: Form a query to load data from Landing to Data Lake depending on Column type defined in Metadata at Column Level weather Hash , ETL , Attributes 
    • The first Scalar Component: Get data from Metadata table at Column Level by simple Listagg function per table 
      • List of columns which are Hash and derive Join condition 
      • List of columns which are in Checksum and derive checksum condition 
      • List of Column of ETL which are excluded from checksum 
      • Input Column List : Output ColumnList : Target table grid 
    • [Go to Step 4 ]It forms a transaction boundary to load data into data lake and as per step 4 load data into Dalake depending on what load strategy is defined i.e. SCD-1 or SCD-2. I have considered SCD-1 for this example.
    • Last Step: Update start range i.e. max date loaded in data lake into Metadata for next time fetch

Saama Analytics Data Snapshot

  • Step 4: Load into Data Lake [ Insert/Update/ Expire Old in case of SCD2 ] 
    • Depending on requirement simple load into data lake orchestration job can be called for SCD-1 load or SCD-2 Load 
    • Even though we can control here, don’t call merge statements directly, call it only when required if there are some changes in the records.

Saama Analytics Data Snapshot

For example: Insert statement to load new records

Saama Analytics Data Snapshot

Auto schema adjustment Technique :

Matillion creates select * as a landing table that reflects the Current schema at the source. Using Information_schema columns for Data Lake the Difference i.e. If Any New columns added or Any existing column deleted.

New column Addition – 

  • Find out New Column per table mentioned in Metadata table at table level

Saama Analytics Data Snapshot

  • If any new-found – call 2 part to alter statement [ 1. Add a new column in data lake 2. Add new column in Metadata at Column Level ]

Saama Analytics Data Snapshot

  1. Alter Actual table [ use same data type as Landing, easy ]

Saama Analytics Data Snapshot      2. Add New in metadata table

Saama Analytics Data Snapshot 

  • Any column Dropped at source –
    • Here we don’t have to drop any column manually, we need to mark it as Inactive in Column level metadata where it will be excluded from Grid calculation.
    • Get the active columns from metadata, find out any column are not in the Existing Landing table i.e. deleted column 
    • Disable in Metadata 

Saama Analytics Data Snapshot
Saama Analytics Data Snapshot

In a nutshell, we have seen a simple approach of creating a simple pipeline and metadata approach, we can onboard as many new tables or new source systems as we want in DWH to enable further processing. This solution can be extended to any source system i.e. S3 CSV file, MYSQL, Oracle DB, Jira, Netsuite, Zendesk, etc. On-boarding the new source system becomes very easy without much knowledge of source system schema knowledge by Matillion.

Prajakta Borkar

Prajakta Borkar

Enterprise Datawarehouse Architect, 14+ years of experience a host of technologies including Oracle Exadata, Snowflake, SQL-server, Shell/Powershell/PLSQL script, Datastage, Ab-Initio and specialized in modern cloud technologies like Snowflake, Matillion, Denodo, PowerBI, Business Objects with Data Modelling, Data analytics etc.