Apache Spark Series-Introduction

Apache Spark

As many of you are requested, I will start the spark serious for beginners. Before getting into various spark concepts, you need to understand the spark, its components, and architecture. Apache spark is a multi-language engine for executing data engineering, data science, and machine learning on single node or clusters.

Limitation of MapReduce:

1. In a map-reduce program, each phase (shuffle, sort, group by, reduce) reads and writes data into a disk every time (I/O operation) which reduces the computation speed.
2. MapReduce can handle historical (past) data (Batch processing) and because of its lowness, cannot use to handle real-time streaming.

Why do we need Spark?

1.      In memory computation (Faster than Map Reduce, low latency, real-time computation)
2.      Lazy evaluation (Computation when needed)
3.      Support languages like java, Scala, Python and R
4.      Easy Integration with HADOOP
5.      Support Machine Learning Modules

Spark Modules:

1. Spark SQL: It is for working with structured data, supports SQL, and Hive SQL, and also works with a variety of data formats including hive tables, Parquet, and JSON.

2. Spark Streaming: It is mainly for streaming (real-time) data, supports handling the streams from Kafka, Flume, Zero MQ, and Twitter, and writes output into HDFS and databases.
3. Spark Core: It provides in-memory computing capabilities like speed delivery, memory management with storage systems, and more.
4.MLib: It consists of ML algorithm (iterable) mainly for delivering advanced analytics.
5. GraphX: It is a graph computing engine that makes you graph the data at scale.

Components of Spark:

Job: A piece of code that reads some data from HDFS or locally, performs some computation on it, and writes some output data.

Stages: Jobs are classified into stages. Map or reduce stages are the two types of stages (It’s easier to understand if you have worked on Hadoop and want to correlate). All computations (operators) cannot be updated in a single Stage because they are divided into stages based on computational boundaries. It occurs in several stages.

Tasks: Each stage has a few tasks, one for each partition. On one executor, one task is executed on one data partition (machine).

DAG: DAG stands for Directed Acyclic Graph, and in this context, it refers to a DAG of operators.

Executor: The process responsible for executing a task.

Driver: The program/process responsible for running the Job over the Spark Engine

Master: The machine on which the Driver program runs

Slave: The machine on which the Executor program runs

Python Series-Data Structure-Part-1

Python Data Structure-Part 1

Broadly speaking, data structures can be classified into two types

i).primitive
ii).non-primitive.

The former is the basic way of representing data that contains simple values. The latter is a more advanced and complex way of representing data that contains a collection of values in various formats.

Non-primitive data structures can further be categorized into
i). built-in
ii).user-defined structures.

Python offers implicit support for built-in structures that include
i). List, ii). Tuple iii). Set iv). Dictionary.

Users can also create their own data structures (like Stack, Tree, Queue, etc.) enabling them to have full control over their functionality.

The same is given in the picture below

Data Architecture Series-Data lakes Vs Data Warehouses Vs Data Lakehouses

Data Architecture:

Data lakes Vs Data Warehouses Vs Data Lakehouses

After the evolution of the cloud, we are poured with more technology stacks in the area of advanced analytics. Modernizing these data platforms has become more complex and rapidly growing no of architecture solutions in the market.

Here I am going to highlight the available data archetypes, their advantages, and limitations.

i). Data Lake

Data lake (Originated in 2011 from Hitachi) is used as a way to reduce data silos that were forming in Data Warehouse-based ecosystems. DL is a centralized repository to store data as is and run different types of analytics (SQL queries, big data analytics, full-text search, real-time analytics, and machine learning).

Advantages:
-It stores relational and non-relational data
-Supports schema-on-read (saving time in defining data structures, schema, and transformations.)
-Highly scalable and Low-cost storage/architecture
-Feed data for Machine Learning, Predictive analytics, data discovery, and profiling

Limitations:
--Raw data is stored with no oversight of the contents which may lead to a ‘data swamp’.
--Demanding the right tool for handling complex data integration
-Need more governance, semantic consistency, and access controls.

ii). Cloud-native Data Warehouse:

The first dedicated decision support system was created by Teradata in 1983, overgrown with pioneers Inmon and Kimball in the 90s. The clients are migrating their DWH to the cloud and trying to improve governance.
One needs to know when to use DWH. It is good when a client uses BI reporting and analytics rather than advanced analytics. It is an ideal choice when skills and infrastructure are biased towards SQL in an organization. Snowflake is growing in the market recently.

Advantages:
-More Structured architecture
-Faster Query Retrieval
-SQL is sufficient to manage (No skills like python/spark required)

Limitations:
-High-cost architecture (BI compute can be expensive)
-Need to consider governance.

iii). Data Lakehouse:

Lakehouse is a merger of Data Lake and Data warehouse, it is more like a hybrid of Data lake with Warehouse-like features including schema enforcement, governance, and ACID transactions. It is an alternate solution for Data lake users who want more data controls without separate DWH and clients who do not want separate DWH and Data Lake.

Advantages:
-Cost-effective architecture
-Supports large dataset with heavy analytical pattern

Limitations:
-Required programming skills over SQL that not every client have
-Less matured, Need more built-in functions
-Offered by very few vendors.

Sharing is Caring

Saturday, August 20, 2022