Evolving Metadata Driven Data Pipeline

By April 27, 2020 May 25th, 2021 Blogs

We moved from the world of SQL driven ETL (e.g., Perl+SQL) and reporting (like SQR) in the early 90s to a metadata-based (semantic layer) based approach. Although we have made many mistakes along the way, there is a net benefit – machine managed code.

We are not talking about AI/ML-based code generation here. That is definitely something we need to start looking at, but that is going to be the topic of the next blog.

The “machine managed code” here refers to externalization, i.e., we are providing the key parameters (metadata) for a software program to generate the required SQL to do the data transformation (ETL) or to extract data for reporting.

Let’s discuss why this is beneficial, and how to scale this thinking in a modern data platform to further benefit from this approach.

The key value proposition of providing metadata (tables, columns, joins, aggregates, multi-query rules, and other information) vs. writing SQL directly is a reduction in time to develop and test, achieve better quality, and reduced maintenance effort. If a line of code is written by a person, a person manages it. If a line of code is written by a machine, the machine manages it. It is much easier to manage the metadata then code as metadata is structured while code is semi-structured at best. 

When we look back, maybe all of these three benefits are not completely materialized for your organization. The reason for this lack of complete realization of benefit could be related to poor design, inefficient use of SQL procedures, lack of aggregated metadata, or any similar issue. But at the end of the day, benefits are clear and realized to a great extent by many organizations. 

The key areas left out in the current approach that we can consider in the modern data platform are:

  1. Data Source Connector Framework: Only the SQL was abstracted through metadata, but no specific framework was created as an additional layer of metadata specific to known data source connectors. Data sourcing has certain known components/steps, and creating a metadata framework allows for the second level of abstraction beyond just core SQL.
  2. Business Rule framework: Business rules are also coded through metadata to drive SQL and run on data. These rules are reused many times and if structured, business teams can potentially change and maintain them as opposed to always depending on the technology team to inform the ETL/ELT tools. If business rules are structured through a framework, it also helps “institutionalize in a structured fashion” the tribal knowledge, which currently goes from an individual’s head to multiple lines of metadata driving SQL. This new way of structuring can have huge implications in reducing the dependency on specific people. It is an effective hedge against brain drain.
  3. Data Validation framework: Similar, but not quite the same as business rules. The constructs of data validation rules have some differences in the metadata required. If structured, data validation rules which are business-facing can be managed by the business just like business rules and have other benefits as described in point #2 above.

In a nutshell, as we have leveraged metadata-driven SQL generation, it is time to recognize the opportunities where we can introduce the next level of structure – connectors, business rule, and data validation. This structure not only allows us the original three benefits – speed, quality, and maintenance but also creates the next level of structure to enable upcoming applications like bots, capturing decision pathways of organization, auto-discovery, and ingestion, and CI/CD in data DevOps among others.

Rajeev Dadia

Rajeev Dadia

With 24 years of experience in leadership roles at Saama Analytics and Silicon Graphics (SGI), Rajeev currently leads the Client Care and Delivery organizations, while simultaneously driving the company’s technology roadmap.