Snowflake Data Governance Overview
Kovid Rathee
Principal Solutions Architect, Data & Insights
Kovid Rathee
Principal Solutions Architect, Data & Insights
Amidst a spree of data leaks over the last few years, data privacy and security have taken centre stage in technology-related conversations. Securing the perimeter within and outside an organization has become extremely important. Networking constructs like VPCs, NACLs, and security groups provide a multi-layered approach to security. This, topped up with a solid identity and access management framework, does well with the application and infrastructure layer, but do all of these measures suffice for the data layer, too? I’d say they don’t.
While the data infrastructure can be secured using the tools mentioned above, data itself needs several other layers of protection within the organization to prevent, among other things, unauthorized access and accidental sharing of data. To do that, data teams usually had to rely on an external data governance tool before some data warehouse and data lake platform companies started offering an integrated data governance solution, Snowflake being one of the prominent ones.
Snowflake is unique in how it approaches data governance - the same way it approaches infrastructure - everything is done using SQL. The basic knowledge of SQL is something you can, more or less, take for granted in teams that directly deal with data, which is why this makes using Snowflake a desirable proposition for businesses. Snowflake’s governance features are not a replacement for a full-fledged governance tool, but it does some of the heavy lifting in certain areas. Snowflake’s governance features help you answer the following questions and more:
- Which business categories do your data objects belong to? This is done using ‘object tagging’.
- Which columns contain sensitive, especially PII and PHI data? Snowflake’s built-in ‘data classification’ engine can help you do that.
- Which data objects should be visible to a user? Granular control of data objects is enabled by applying ‘row access policies’ and ‘column-level security’ features.
- Which users are reading data from and writing data? Snowflake logs all such activity in the `ACCESS_HISTORY` table.
- How has a data asset (table or column) transformed in a data pipeline after cleansing, wrangling, and transformation? Both `ACCESS_HISTORY` and `OBJECT_DEPENDENCIES` give you a detailed answer to that question.
This article will introduce you to the data above governance themes with examples to help you navigate your data governance journey on Snowflake. Let’s dive right in!
Tagging objects in Snowflake
Tags are general-purpose labelling constructs and provide a clean and highly flexible solution for categorizing data assets, from coarse-grained organization-level objects like accounts and warehouses to fine-grained schema-level objects like tables, views, and columns. Tags can be used in several ways, some prescribed by Snowflake as listed below, while others are left for you to discover for your business use case:
- Track cost and usage of org-wide data assets
- Classify and group data assets based on custom requirements
- Apply data masking policies on columns with certain tags
- Protect sensitive data using system-defined tags
In addition to governance, tags help with many other use cases, such as search and discovery in data catalogs, business intelligence tools, etc. Many of these tools, such as Select Star have a two-way tag sync functionality, where you can apply tags from your data catalogues and they’ll get updated in Snowflake. This deserves mention because, in most cases, data catalogs only extract metadata from data sources and don’t write back to them. Think of this as out-of-the-box reverse ETL for metadata, i.e., the data catalog enriching the data source back.
Identifying sensitive data
While tags allow you to arbitrarily create labels for your data assets, in many cases, you would also need to classify your data assets based on their content. To do that, you can use Snowflake’s data classification feature. The classification process requires a certain amount of compute power, so you’ll need to have a warehouse up and running to use this feature. You can use this feature to:
- Identify sensitive data using native Snowflake features
- Have oversight of sensitive data
- Invoke external cloud-based functions for classification
The core step of the process is extracting semantic categories that represent personal attributes, such as name, address, age, salary, and so on. To do that, you can call the`EXTRACT_SEMANTIC_CATEGORIES` function and, using a sample of the column data, and it will give you the probability of each column containing sensitive data. Aside from that, it provides you attributes like confidence, coverage, and two system-defined tags, `SEMANTIC_CATEGORY` and `PRIVACY_CATEGORY`.
Data classification isn’t a fully automatic process; it goes through a cycle of analysis, review, and application, where Snowflake takes care of steps one and three. However, it falls upon the data engineers to review whether Snowflake’s correct interpretation of the column data. After the review, you can use the `ASSOCIATE_SEMANTIC_CATEGORY_TAGS` stored procedure with the name of the table and the output of the `EXTRACT_SEMANTIC_CATEGORIES` function as parameters to automatically associate the tags to the table columns. You can use these tags to protect sensitive data using Snowflake’s data masking capabilities.
Granular access to data
While RBAC (role-based access control) and DAC (discretionary access control) allow you to manage object-level access in Snowflake, there are two other methods that you can use to restrict access to data in tabular objects, such as tables, views, and materialized views. Some examples of where granular access to data would make sense are when you want to:
- Restrict access to sales records for one region by members of other regions
- Create an information barrier between teams to comply with regulatory requirements
- Protect PII and PHI data by masking fields partly or in full based on various factors
With row access policies and masking policies, you can restrict access to rows and columns. Both these methods allow for on-the-fly evaluation of policies. Based on these evaluations, you’ll be granted access to specific rows or columns. Row access policies use both Conditional Expression functions and Context Functions for policy enforcement. For instance, you can check for the role a user is using to access the data by using the `CURRENT_ROLE` context function.
Mapping data lineage
While implementing the modern data stack, it’s usual for a data cataloging or lineage tool to be set up to allow business and technical users to understand the flow of data and how it’s been transformed from source to target. Snowflake’s data lineage metadata comes in handy when you have to:
- Visualize how the data loads from sources, integrate with other sources and becomes usable for the business
- Perform impact analysis when making changes to data objects
- Debug ETL issues that end up in stale, erroneous, and incomplete data
Snowflake’s data lineage capabilities rest upon two distinct features; the metadata for both is captured in the `ACCOUNT_USAGE` schema. The `OBJECT_DEPENDENCIES` view allows you to look at the dependency graph between objects, i.e., it shows you which objects need which other objects for proper functioning. This is helpful when planning to handle the changes to an upstream object and the impact those changes would have on the downstream objects.
The other feature is where Snowflake logs a year’s worth of query history in the `ACCESS_HISTORY` view. This view provides details about the query, such as which tables, views, materialized views, and columns were accessed. It also provides the text of the query in full. Using access history, you can also use external tools with their native SQL parsers to build or enrich data lineage as and when required.
Conclusion
This article walked you through Snowflake’s native data governance features. It also talked about identifying and protecting sensitive data, restricting access to data using various control measures, and using the metadata to build the lineage graph. For more, check out Snowflake’s guide to data governance which talks about these capabilities in much more detail. Also, check out the latest data governance-related blog posts in the community-driven Snowflake blog on Medium.
Great Tech-Spectations
Great Tech-Spectations
The Versent & AWS Great Tech-Spectations report explores how Aussies feel about tech in their everyday lives and how it measures up to expectations. Download the report now for a blueprint on how to meet consumer’s growing demands.