After evolution of cloud, enterprises are starting to explore
the possibility of implementing DWH in cloud platform. As data sources are increased
due to social media, data enthusiasts are expecting that cloud DWH should be
capable enough to handle both structured and non-structured data (semi,
unstructured). Currently Snowflake is one of the emerging Cloud Data warehouse,
can deployed on any cloud platform. We are going to discuss Snowflake cloud in
this post.
What is SnowFlake?
Snowflake is a fully managed data platform, that is, it is
within the SaaS (Software as a Service) model. It is a platform that was
developed for the most diverse types of use, such as data warehousing, data
lakes, data engineering, data science, data application development and secure
data sharing, even real-time data. It can be used to work with data: structured,
unstructured, semi-structured and streaming data.
How does snowflake architecture?
Snowflake combines the use of two architectures: Shared
Disk Architecture and Shareless Architecture . The part
that resembles the shared disk architecture is that Snowflake uses a central
data repository for persisted data that is accessible from all compute nodes on
the platform. The part that resembles the sharingless architecture is that
Snowflake processes queries using MPP (Massively Parallel Processing) clusters
where each node in the cluster locally stores a portion of the entire dataset.
With this combination, we can get both simplified data management
by shared disk and performance by shareless architecture.
What are the 3 layers of SnowFlake?
In addition to the above, Snowflake's unique architecture
consists of three key layers:
Database Storage
Query processing
Cloud Services
Database Storage Layer
When the data is loaded onto the platform Snowflake
reorganizes the data by internally optimizing it and compressing it into a
columnar format.
Snowflake manages all aspects of how this data is stored: organization, file size, structure, compression, metadata, statistics, etc. These data objects are not directly visible or accessible by clients, the data is only accessible through SQL queries using Snowflake itself.
Query Processing Layer
Snowflake processes queries using “virtual warehouses”. Each
of these virtual warehouses is an MPP cluster made up of multiple compute nodes
allocated by Snowflake through a cloud provider.
Each virtual warehouse is an independent cluster that does
not share computing resources with other virtual warehouses. Therefore, each of
them has no impact on the performance of the other.
Cloud Services Layer
This layer provides a collection of services that are
coordinated activities across the entire platform, from user requests to even
metadata management.
Some services managed in this layer include: Authentication,
infrastructure management, metadata management, and access control.
Cloud platforms that support the use of Snowflake
As Snowflake is provisioned as SaaS, it means that it runs completely on a cloud infrastructure, meaning all those three layers mentioned earlier are released and managed entirely on a cloud platform.
A Snowflake account can be used on the three major cloud players on the market:
Amazon Web Services( AWS )
Google Cloud Platform( GCP )
Microsoft Azure( Azure )
Platform costs
Platform costs are basically based on the use of the three tiers:
Use of data storage, use of computing resources (virtual
warehouses), and use of cloud services.
Usage per data store is calculated by the average daily amount of data (in bytes) stored on the system.
Usage by compute resources is calculated based on the number of Snowflake credits that were consumed by virtual warehouses to execute queries, load data, and perform other DML operations.
Finally, billing for cloud services is only performed if the
account's daily consumption of cloud services exceeds 10% of the daily usage of
computing resources. The charge is calculated daily.
Why does we need SnowFlake?
Separation of storage and compute: On prem engineers
have been trying to solve this for a long time. VMWare did a great job of
accomplishing this for decade+. Server code would run on VM and storage would
sit in one of Netapp or EMC storage appliances. But you still needed to pay for
OS license, database license, and physical hardware for the VM host. After
taking on all that CAPEX hit, what happens when you outgrow it? Or even worse,
what happens if you take years to grow into it?
Columnar Database: In the SQL Server world, columnar
index was introduced in 2014 version. That was a game changer for analytics.
Unlike OLTP, in analytics, we are doing calculations on a single column or a
small handful of columns. In a row-based design, you needed to retrieve the
entire row/table to analyze a single column of data. When your table is narrow
(<10 columns), this is fine, however, imagine having 50+ columns. You only
need 1/50 of the columns?!??! You had to bring in all that data just to be
discarded.
Automatic clustering: Maintaining proper indexes in a
RDBMS is a full time job. And proper index management requires relatively
advanced understanding of databases (software and hardware). And those that are
able to fully grasp it are not easy to come by. And when you do, they are not
cheap. With Snowflake, you do not need an administrator. Not in the way we used
to understand administrators. Everything is very much automated and if you know
how to operate Gmail, you should be able to “administer” Snowflake (a bit of an
exaggeration). With database administrators costing $100k to $140k, just
eliminating that labor cost can often offset Snowflake’s cost. No more
licensing cost. No more hardware cost. No more DBA cost.
External Stage (external table): Similar to Polybase
in SQL Server world, you can store non-relational data types (JSON, AVRO,
Parquet) into blob storage and access it via built in functions as if they were
tables. Views can be built on top of it and, if performance is a concern, those
views can be materialized. Both Microsoft Azure Blob storage and AWS S3 storage
can be used to store to 5TB. And Azure recently announced its preview limit of
200TB. Remember when 1TB used to be consider VLDB?
Failsafe and time travel: Another reason why DBAs are
well paid is because data is often the most important asset a company has. And
DBAs are tasked with safeguarding that data. From intrusion, obviously, but
from its own users as well. The tale of Bobby Tables is funny because it’s
real. I’ve witnessed a colleague being escorted out of the building 20 minutes
after he mistakenly dropped a table; pink slip in hand. He would have kept his
job if we had time travel back then. If you accidentally dropped a table, you
just un-drop.