Saturday, August 20, 2022

Apache Spark Series-Introduction


 Apache Spark

As many of you are requested, I will start the spark serious for beginners. Before getting into various spark concepts, you need to understand the spark, its components, and architecture. Apache spark is a multi-language engine for executing data engineering, data science, and machine learning on single node or clusters.

Limitation of MapReduce:


1.      In a map-reduce program, each phase (shuffle, sort, group by, reduce) reads and writes data into a disk every time (I/O operation) which reduces the computation speed.
2.      MapReduce can handle historical (past) data (Batch processing) and because of its lowness, cannot use to handle real-time streaming.

Why do we need Spark?

1.      In memory computation (Faster than Map Reduce, low latency, real-time computation)
2.      Lazy evaluation (Computation when needed)
3.      Support languages like java, Scala, Python and R
4.      Easy Integration with HADOOP
5.      Support Machine Learning Modules

Spark Modules:

1. Spark SQL: It is for working with structured data, supports SQL, and Hive SQL, and also works with a variety of data formats including hive tables, Parquet, and JSON.

2. Spark Streaming: It is mainly for streaming (real-time) data, supports handling the streams from Kafka, Flume, Zero MQ, and Twitter, and writes output into HDFS and databases.
3. Spark Core: It provides in-memory computing capabilities like speed delivery, memory management with storage systems, and more.
4.MLib: It consists of ML algorithm (iterable) mainly for delivering advanced analytics.
5. GraphX: It is a graph computing engine that makes you graph the data at scale.

Components of Spark:

Job: A piece of code that reads some data from HDFS or locally, performs some computation on it, and writes some output data.

Stages: Jobs are classified into stages. Map or reduce stages are the two types of stages (It’s easier to understand if you have worked on Hadoop and want to correlate). All computations (operators) cannot be updated in a single Stage because they are divided into stages based on computational boundaries. It occurs in several stages.

Tasks: Each stage has a few tasks, one for each partition. On one executor, one task is executed on one data partition (machine).

DAG: DAG stands for Directed Acyclic Graph, and in this context, it refers to a DAG of operators.

Executor: The process responsible for executing a task.

Driver: The program/process responsible for running the Job over the Spark Engine

Master: The machine on which the Driver program runs

Slave: The machine on which the Executor program runs





No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...