Last decade has seen a lot of transition in terms of Technology space. There is a wave of fast-moving data (streaming), changing business dynamics and explosion to use structured & unstructured data for decision making. Though 3Vs played a major role in adoption of Big Data or Hadoop with NoSQL buzz word, but could not take away close association or attachment towards SQL forms and even newer tech platforms had to make SQL friendly GUIs to continue its adoption from Business users or Analysts. With plenty of nodes to impose parallelism in data processing has increased the overhead in hardware or software maintenance end, but seen veterans talking around SQL/RDBMS architecture in a nostalgic way. After the modernization of Cloud services, there is a surge in lowering the cost of Support activities either in terms of software upgrades or hardware maintenance. SaaS DBMS stacks like Snowflake & Redshift have again proved the love & passion for SQL orientation in the data management space moreover Data lakes or Data Warehouses providing support to Structured & semi-structured data. Tech stack evolved & matured over the last 2-3 years to overcome questions raised around compliance, security, data protection policies and their value has been well received by industry, but the true impetus required was provided by Covid-19. Snowflake outrightly came out as winner due to its unique composition of separate storage & compute mechanisms. Completing removing the heavy duty of DBAs on database management, Snowflake claims to be zero or minimal administration once the environment is up & running. With reference to our experience we laid out details around granular control on administrative activities post migration of legacy Platform to Snowflake (e.g. On-prem or even cloud), which can be divided into 4 areas –
- Data availability
- Backup and maintenance issues
- Monitoring Costs & Resources
- Tune warehouses
Metadata based solutioning to the data issues for our marquee clients was to introduce Snowflake based Cloud Data Warehouse. Through these topics you can walk away with understanding of how you can configure and manage your snowflake environment to optimize cost and performance.
Data Availability: First & foremost requirement while supporting the Snowflake DB environment is to ensure that there is no User or Data Load Contention. Snowflake allows you to create multiple virtual warehouses that help isolate processes such as loading data and allow users to query data or running reports.
Customize for business needs (any number of workloads)
Let’s Start with Virtual Warehouse Management.
We will cover right sizing which is important as we don’t want our customers using warehouses that are either too small or too large. We will talk about scaling up versus scaling out and the advantages of those & how those impact the workloads and then some best practices on when to think about automating suspend or resume sizing and their multi cluster scale out. While creation of a warehouse, two critical factors to consider, from a cost and performance perspective, are:
- Warehouse size (i.e. number of servers per cluster)
- Define right sizing – based on linear workloads patterns, concurrent activities, performance, data volume to be processed
- Scale up or Scale out the virtual warehouse based on complex or long running queries or even multiple files while loading a large table
- Auto-scaling functionality to avoid queuing
- Apply point in time check or automated configuration to starting or suspending the virtual warehouses
This will help to address the common hue & cry from users to meet the SLAs and take decisions on time.
Backup and Maintenance: Another important aspect of database administration is backup & restoration/replication for contingency. Continuous changes or varying business dynamics tend to be error-prone which makes data vulnerable to failures which leads to restoration. Similar to Checkpoints or touchpoint backup activities like traditional Databases, Snowflake also provides Data Protection properties in terms of Time travel & Fail-safe as backups, which allows you to query or restore data back to specific checkpoint e.g. Time-travel allows you to retrieve data within 90 days while Fail-safe which is proprietary with Snowflake, will help to restore data older than 90 days.
Following activities & tasks are diligently being taken up by Support engineer,
- Database Replication and Failover/Failback, which can against permanent or transient DBs
- Masking Policies & Access control
- Read Replication of Large, High-Churn Tables
- Zero-copy Cloning for hot fixes or even catch real time events in Development or QA stage
- Data Sharing
This will require some good amount of understanding of Snowflake architecture and costs associated with these features. We learnt it the hard way and thoughtfully while making conscious decisions towards the requests for business users.
Third important aspect rather most, considering commercials is Monitoring Costs & Resources. To control costs and avoid unexpected credit usage caused by running warehouses, Snowflake provides resource monitors. Support engineer will use Resource monitors to impose limits on the number of credits that are consumed by
- User-managed virtual warehouses (monitor the credit usage)
- Virtual warehouses used by cloud services (monitor charges screen)
Snowflake charges runs on the pay-as-you-go model. So how long you will keep your VWH on & size of the warehouse used for data wrangling decides the number of credits consumed e.g. Long running user Queries, heavy duty compute on runtime aggregation, high volume data ingestion or multiple micro batches etc.
With snowflake provided features Our production support team could able to,
- Set the credit limit for a specified interval or date range. When these limits are reached and/or are approaching, the resource monitor can trigger various actions, such as sending alert notifications and/or suspending the warehouse .e.g. Notify or Suspend or Suspend Immediate actions
- Support team closely monitor the credit limits or quota defined against a particular virtual warehouses within the specified frequency
- With help of Resource monitor, closely track the usage of credits against specific account or individual/set of warehouses for that matter e.g. cause of heavy cost
- Modify these properties for an existing resource monitor:
- Adjust the credit quota limit under scrutiny
- Add or remove warehouses from the list
- Configure scheduling properties like frequency, start and end timestamp
- Customize threshold percentages e.g. add/remove/modify
- Tie Cloud services credit usage components of Snowflake in order to process user requests, from login to query dispatch on following parts
- Infrastructure management
- Metadata management
- Query parsing and optimization
- Access control
- Monitor Data Storage Usage
- Data redundancy by rationalizing the objects or data consolidation
- Historical data maintained for Fail-safe.
- Understand the data transfer billing & control cloud providers egress charges
- Multi-AZ data transfer process within the same cloud platform
- Data sharing out of the cloud platform
- Monitor how to control zero copy cloning & clustering
- Monitor additional costing against Multi-clustering, consult metadata tables like history to analyse daily workloads or slow query performance or peak periods etc.
Tune DataWarehouses/Data Lakes: As the business grows, the platform expands its capacity across the organization or even seeing data explosion, Data Warehouses/Data Lakes built on Snowflake requires tuning to provide improved performance, concurrency & latency. We did experiment with different types of queries and different warehouse sizes to determine the combinations that best meet customer’ specific query needs and workload
- Analyze query composition instead scaling virtual Warehouse linearly, consider the following:
- Size of the table(horizontal/vertical growth) being queried
- Query filtering using predicates has an impact on processing, as does the number of joins/tables in the query
- Maximize cache usage to improve read performance, by selecting required attributes in select
- Avoid resource contention related queuing & query performance issues, Multi-Cluster Warehouses will improve Concurrency for large numbers of users and/or complex queries
- Regroup or restructure the clustered table wrt to micro-partitions by doing Automatic Clustering, which will prune table it post DML operations
- Data clustering of large tables, leverages this clustering information to avoid unnecessary scanning of micro-partitions during querying, significantly accelerating the performance of queries that reference these columns
- Defining Clustering Key on large volumes of tables: Using a clustering key to co-locate similar rows in the same micro-partitions enables several benefits for very large tables
- Reclustering will help reorganize column data, so that related records are relocated to the same micro-partition
- Avoid large table scans when filtering minimum rows from billions of data partitions with use of point lookups based on search optimization service which relies on a persistent data structure that serves as an optimized search access path. Related techniques include:
- Clustering a table.
- Creating one or more materialized views (clustered or nonclustered)
Conclusion: So even when Snowflake use and configuration part is pretty straightforward & not required any specialized DBA skills, we still need to perform these orchestration operations on Snowflake to maintain data, its availability, performance, concurrency & eventually Cost associated with its compute power. This will help as a cheat sheet while supporting the snowflake environment.