Thursday, September 22, 2022

SnowFlake Cloud - Introduction

 


After evolution of cloud, enterprises are starting to explore the possibility of implementing DWH in cloud platform. As data sources are increased due to social media, data enthusiasts are expecting that cloud DWH should be capable enough to handle both structured and non-structured data (semi, unstructured). Currently Snowflake is one of the emerging Cloud Data warehouse, can deployed on any cloud platform. We are going to discuss Snowflake cloud in this post.

What is SnowFlake?

Snowflake is a fully managed data platform, that is, it is within the SaaS (Software as a Service) model. It is a platform that was developed for the most diverse types of use, such as data warehousing, data lakes, data engineering, data science, data application development and secure data sharing, even real-time data. It can be used to work with data: structured, unstructured, semi-structured and streaming data.

How does snowflake architecture?

Snowflake combines the use of two architectures: Shared Disk Architecture and Shareless Architecture . The part that resembles the shared disk architecture is that Snowflake uses a central data repository for persisted data that is accessible from all compute nodes on the platform. The part that resembles the sharingless architecture is that Snowflake processes queries using MPP (Massively Parallel Processing) clusters where each node in the cluster locally stores a portion of the entire dataset.

With this combination, we can get both simplified data management by shared disk and performance by shareless architecture.

What are the 3 layers of SnowFlake?

In addition to the above, Snowflake's unique architecture consists of three key layers:

Database Storage

Query processing

Cloud Services

Database Storage Layer

When the data is loaded onto the platform Snowflake reorganizes the data by internally optimizing it and compressing it into a columnar format.

Snowflake manages all aspects of how this data is stored: organization, file size, structure, compression, metadata, statistics, etc. These data objects are not directly visible or accessible by clients, the data is only accessible through SQL queries using Snowflake itself.

Query Processing Layer

Snowflake processes queries using “virtual warehouses”. Each of these virtual warehouses is an MPP cluster made up of multiple compute nodes allocated by Snowflake through a cloud provider.

Each virtual warehouse is an independent cluster that does not share computing resources with other virtual warehouses. Therefore, each of them has no impact on the performance of the other.

Cloud Services Layer

This layer provides a collection of services that are coordinated activities across the entire platform, from user requests to even metadata management.

Some services managed in this layer include: Authentication, infrastructure management, metadata management, and access control.




Cloud platforms that support the use of Snowflake

As Snowflake is provisioned as SaaS, it means that it runs completely on a cloud infrastructure, meaning all those three layers mentioned earlier are released and managed entirely on a cloud platform.

A Snowflake account can be used on the three major cloud players on the market:

Amazon Web Services( AWS )

Google Cloud Platform( GCP )

Microsoft Azure( Azure )

 

Platform costs

Platform costs are basically based on the use of the three tiers:

Use of data storage, use of computing resources (virtual warehouses), and use of cloud services.

Usage per data store is calculated by the average daily amount of data (in bytes) stored on the system.

Usage by compute resources is calculated based on the number of Snowflake credits that were consumed by virtual warehouses to execute queries, load data, and perform other DML operations.

Finally, billing for cloud services is only performed if the account's daily consumption of cloud services exceeds 10% of the daily usage of computing resources. The charge is calculated daily.

Why does we need SnowFlake?

Separation of storage and compute: On prem engineers have been trying to solve this for a long time. VMWare did a great job of accomplishing this for decade+. Server code would run on VM and storage would sit in one of Netapp or EMC storage appliances. But you still needed to pay for OS license, database license, and physical hardware for the VM host. After taking on all that CAPEX hit, what happens when you outgrow it? Or even worse, what happens if you take years to grow into it?

Columnar Database: In the SQL Server world, columnar index was introduced in 2014 version. That was a game changer for analytics. Unlike OLTP, in analytics, we are doing calculations on a single column or a small handful of columns. In a row-based design, you needed to retrieve the entire row/table to analyze a single column of data. When your table is narrow (<10 columns), this is fine, however, imagine having 50+ columns. You only need 1/50 of the columns?!??! You had to bring in all that data just to be discarded.

Automatic clustering: Maintaining proper indexes in a RDBMS is a full time job. And proper index management requires relatively advanced understanding of databases (software and hardware). And those that are able to fully grasp it are not easy to come by. And when you do, they are not cheap. With Snowflake, you do not need an administrator. Not in the way we used to understand administrators. Everything is very much automated and if you know how to operate Gmail, you should be able to “administer” Snowflake (a bit of an exaggeration). With database administrators costing $100k to $140k, just eliminating that labor cost can often offset Snowflake’s cost. No more licensing cost. No more hardware cost. No more DBA cost.

External Stage (external table): Similar to Polybase in SQL Server world, you can store non-relational data types (JSON, AVRO, Parquet) into blob storage and access it via built in functions as if they were tables. Views can be built on top of it and, if performance is a concern, those views can be materialized. Both Microsoft Azure Blob storage and AWS S3 storage can be used to store to 5TB. And Azure recently announced its preview limit of 200TB. Remember when 1TB used to be consider VLDB?

Failsafe and time travel: Another reason why DBAs are well paid is because data is often the most important asset a company has. And DBAs are tasked with safeguarding that data. From intrusion, obviously, but from its own users as well. The tale of Bobby Tables is funny because it’s real. I’ve witnessed a colleague being escorted out of the building 20 minutes after he mistakenly dropped a table; pink slip in hand. He would have kept his job if we had time travel back then. If you accidentally dropped a table, you just un-drop.

No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...