This article uncovers the unique features of Snowflake Data Warehouse. You will also get a quick walk-through on the implementation of some of the important Snowflake features such as schema-less loading of JSON/XML, Time Travel, Cloning, Data Clustering, etc., provided by Snowflake.
What is Snowflake Data Warehouse?
Snowflake Cloud Data Warehouse is a fully managed, cloud data warehouse available to customers in the form of Software-as-a-Service (SaaS). Here “fully managed” means users shouldn’t be concerned about any of the back-end work like server installation, maintenance, etc. A Snowflake data warehouse instance can easily be deployed on any of the three major cloud providers –
- Amazon Web Services (AWS)
- Google Cloud Storage (GPC)
- Microsoft Azure
You can choose an appropriate cloud platform they want for their Snowflake instance. Snowflake queries follow the standard ANSI SQL protocol and it also supports fully structured as well as semi-structured data like JSON, Avro, Parquet, XML, etc.
Snowflake architecture comprises of a hybrid of traditional shared-disk and shared-nothing architectures to offer the best of both. Let us walk through these architectures and see how Snowflake combines them into new hybrid architecture.
Shared-Disk Vs. Shared-Nothing Architectures
Shared Disk Architecture
Used in traditional databases, shared-disk architecture has one storage layer accessible by all cluster nodes. Multiple cluster nodes having CPU and Memory with no disk storage for themselves communicate with the central storage layer to get the data and process it.
Shared Nothing Architecture
Contrary to Shared-Disk architecture, Shared-Nothing architecture has distributed cluster nodes along with disk storage, their own CPU, and Memory. The advantage here is that the data can be partitioned and stored across these cluster nodes as each cluster node has its own disk storage.
Snowflake Architecture – Hybrid Model
Snowflake supports a high-level architecture as depicted in the diagram below. Snowflake has 3 different layers:
- Storage Layer
- Compute Layer
- Cloud Services Layer
Snowflake organizes the data into multiple micro partitions that are internally optimized and compressed. It uses a columnar format to store. Data is stored in the cloud storage and works as a shared-disk model thereby providing simplicity in data management. This makes sure users do not have to worry about data distribution across multiple nodes in the shared-nothing model.
Compute nodes connect with the storage layer to fetch the data for query processing. As the storage layer is independent, we only pay for the average monthly storage used. Since Snowflake is provisioned on the Cloud, storage is elastic and is charged as per the usage per TB every month.
Snowflake uses “Virtual Warehouse” (explained below) for running queries. Snowflake separates the query processing layer from the disk storage. Queries execute in this layer using the data from the storage layer.
Virtual Warehouses are MPP compute clusters consisting of multiple nodes with CPU and Memory provisioned on the cloud by Snowflake. Multiple Virtual Warehouses can be created in Snowflake for various requirements depending upon workloads. Each virtual warehouse can work with one storage layer. Generally, a virtual Warehouse has it’s own independent compute cluster and doesn’t interact with other virtual warehouses.
Virtual Warehouse Pros:
- Virtual Warehouses can be started or stopped at any time and also can be scaled at any time without impacting queries that are running.
- They also can be set to auto-suspend or auto-resume so that warehouses are suspended after a specific period of inactive time and then when a query is submitted are resumed.
- They can also be set to auto-scale with minimum and maximum cluster size, so e.g. we can set minimum 1 and maximum 3 so that depending on the load Snowflake can provision between 1 to 3 multi-cluster warehouses.
All the activities such as authentication, security, metadata management of the loaded data, and query optimizer that coordinate across Snowflake happens in this layer.
- When a login request is placed it has to go through this layer.
- Query submitted to Snowflake will be sent to the optimizer in this layer and then forwarded to the Compute Layer for query processing.
- Metadata required to optimize a query or to filter data are stored in this layer.
These three layers scale independently and Snowflake charges for storage and virtual warehouse separately. Services layer is handled within compute nodes provisioned, and hence not charged.
Features and Capabilities
Snowflake takes care of all the database management and assures the best performance when tables are queried. All one needs to do is create tables, load the data, and query it. One need not create partitions, indexes like in RDBMS, or run vacuum operations like in Redshift either.
The details of how Snowflake accomplishes this can be best understood by knowing the Snowflake Table structures and its principal concepts:
Two principal concepts used by Snowflake to physically maintain Table Structures for best performance are:
- Micro Partitions
- Data Clustering
1. Micro Partitions
Snowflake stores the data in a table using columnar fashion divided into several micro-partitions which are contiguous units of storage of size around 50 MB to 500 MB of uncompressed data. You can read more on micro-partitions here. Unlike traditional databases, partitions aren’t static and aren’t defined and maintained by users but are automatically and dynamically managed by Snowflake data warehouse.
Let’s consider a table consisting of Student data as shown below:
This data will be broken into multiple micro-partitions in Snowflake. Let’s assume 3 micro-partitions are created for this table, data will be partitioned as below in columnar fashion:
As seen in the above tables, data will be partitioned into 3 parts and stored in columnar fashion. Data is also sorted and stored at each micro-partition level.
2. Data Clustering
Data Clustering is similar to the concept of sort-key available in most MPP Databases. As data is loaded into Snowflake, it co-locates column data with the same values in the same micro-partition. This helps Snowflake to easily filter out data during a data scan, as entire micro-partition can be skipped if a value doesn’t exist in the range of that micro-partition.
Snowflake automatically does the job of clustering on the tables, and this natural clustering process of Snowflake is good enough for most cases and gives good performance even for big tables.
However, if a user wants to do manual clustering, there is a concept of the clustering key which can be defined on the table by the user, and Snowflake uses this key to do the clustering on the table. This can be useful only for very large tables.
Other Important Features:
Time Travel allows you to track the change of data over time. This Snowflake feature is available to all accounts and enabled by default to all, free of cost. This feature allows us to access the historical data of a Table. One can find out how the table looked at any point in time within the last 90 days.
For standard accounts, a maximum of 1 day Time Travel retention period is given by default. In Enterprise edition this can be extended up to 90 days.
Another important feature of the Snowflake data warehouse is Cloning. We can use the clone feature to create an instant copy of any Snowflake object such as databases, schemas, tables, and other Snowflake objects at a near real-time without much waiting time. This is possible due to Snowflake’s architecture of storing data as immutable in S3 and versioning the changes and storing the changes as Metadata.
Hence a clone of an object is actually a Metadata operation and doesn’t actually duplicate the storage data. So let’s say you want a copy of the entire Production database for some testing purpose, you can create it in seconds using the Clone feature.
A dropped object can be restored using the Undrop command in Snowflake, as long as that object is still not purged by the system.
Undrop restores the object to the state it was before dropping the object. Objects such as Databases, Schemas, or Tables can be undropped.
drop database testdb; undrop database testdb; drop table some_table; undrop table some_table;
Undrop can be used to restore multiple times as well, let’s say a table has been dropped 3 times and re-created each time. Each dropped version can be undropped in reverse order of drop i.e. last dropped version is restored first. This works as long as the existing table with the same name is renamed before Undrop operation.
Undrop will fail if a table with the same name already exists in the system. Hence it is important to rename the existing table and then perform Undrop operation.
Fail-Safe ensures historical data is protected in the event of disasters such as disk failures or any other hardware failures. Snowflake provides 7 days of Fail-Safe protection of data which can be recovered only by Snowflake in event of a disaster.
Fail-Safe 7-day duration starts after the Time Travel period is ended. Hence, a table with 90 days of Time Travel will have an overall of 97 days for recovery. However, Fail-Safe cannot be recovered by users or through DDLs. It is handled by Snowflake only during catastrophic disasters.
Snowflake provides various connectivity options including Native connectors (e.g. Python), JDBC/ODBC drivers, Command Line tool called “SnowSQL”, Web Interface which helps to manage Snowflake as well as to query the data.
Snowflake account can be hosted on either Amazon AWS or Microsoft Azure cloud platform. Hence depending on organizations existing Cloud service provided one among AWS or Azure can be chosen.
- Fully managed, cloud-deployed DWH requiring minimal effort to get set up and running.
- Easy to scale with a lot of flexible options.
- Storage and Compute can be separated – This is not common in the cloud data warehousing solutions space.
- Follows ANSI SQL protocol; supports fully structured as well as semi-structured data types like JSON, Parquet, XML, ORC, etc.
- Quite popular and successful with a large number of clients using the service, Snowflake competes in the same league as Amazon Redshift, Google BigQuery, etc.
- The dependence that Snowflake has on AWS, Azure, or GCS can be a problem at times when there is an independent (independent of Snowflake operations’ ‘health’) outage in one of these cloud servers.
- It doesn’t have any provision to support unstructured data as of yet.
- Currently doesn’t have many options to work with geospatial data.
When to Use?
Snowflake can be used to host a Data Warehouse in the cloud of any scale for analysis of various data from different sources. Snowflake supports loading from various formats including traditional compressed CSVs to semi-structured data such as JSON/XML from Cloud storages like S3 or Azure.
- It offers five editions going from ‘standard’ to ‘VPC’. This is a good thing as customers have options to choose from based on their specific needs.
- Analytics on data available in Data Lake on Cloud Storage like S3.
- Rapid delivery of analytics to Business without worrying about managing the warehouse.
- Data Sharing: If an organization wants to share data outside then Snowflake can be used to quickly set up to share data without transferring or developing pipelines.
- The ability to separate storage and compute is something to consider for and how that relates to the kind of data warehousing operations you’d be looking for.
- Snowflake is designed in a way as to ensure the least user input and interaction required for any performance or maintenance related activity. This is not a standard among cloud DWHs. For instance, Redshift needs user-driven data vacuuming.
- It has some cool querying features like undrop, fast clone, etc. These might be worth checking out as they may account for a good chunk of your day-to-day data operations.