Saturday, August 20, 2022

Data Architecture Series-Data lakes Vs Data Warehouses Vs Data Lakehouses




Data Architecture:

Data lakes Vs Data Warehouses Vs Data Lakehouses


After the evolution of the cloud, we are poured with more technology stacks in the area of advanced analytics. Modernizing these data platforms has become more complex and rapidly growing no of architecture solutions in the market.

Here I am going to highlight the available data archetypes, their advantages, and limitations.

i). Data Lake

Data lake (Originated in 2011 from Hitachi) is used as a way to reduce data silos that were forming in Data Warehouse-based ecosystems. DL is a centralized repository to store data as is and run different types of analytics (SQL queries, big data analytics, full-text search, real-time analytics, and machine learning).

Advantages:
-It stores relational and non-relational data
-Supports schema-on-read (saving time in defining data structures, schema, and transformations.)
-Highly scalable and Low-cost storage/architecture
-Feed data for Machine Learning, Predictive analytics, data discovery, and profiling

Limitations:
--Raw data is stored with no oversight of the contents which may lead to a ‘data swamp’.
--Demanding the right tool for handling complex data integration
-Need more governance, semantic consistency, and access controls.

ii). Cloud-native Data Warehouse:

The first dedicated decision support system was created by Teradata in 1983, overgrown with pioneers Inmon and Kimball in the 90s. The clients are migrating their DWH to the cloud and trying to improve governance.
One needs to know when to use DWH. It is good when a client uses BI reporting and analytics rather than advanced analytics. It is an ideal choice when skills and infrastructure are biased towards SQL in an organization. Snowflake is growing in the market recently.

Advantages:
-More Structured architecture
-Faster Query Retrieval
-SQL is sufficient to manage (No skills like python/spark required)

Limitations:
-High-cost architecture (BI compute can be expensive)
-Need to consider governance.

iii). Data Lakehouse:


Lakehouse is a merger of Data Lake and Data warehouse, it is more like a hybrid of Data lake with Warehouse-like features including schema enforcement, governance, and ACID transactions. It is an alternate solution for Data lake users who want more data controls without separate DWH and clients who do not want separate DWH and Data Lake.

Advantages:
-Cost-effective architecture
-Supports large dataset with heavy analytical pattern

Limitations:
-Required programming skills over SQL that not every client have
-Less matured, Need more built-in functions 
-Offered by very few vendors.

No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...