Sharing is Caring: Spark Series - Spark RDD vs Data frame vs Dataset

As a continuity of the spark series, today we are going to see 3 important user-facing APIs of Spark. To understand the spark architecture and its components, please refer my previous post given in the first comment. Those APIs are as follows:

1. Resilient Distributed dataset (2011)

2. Spark Data frame (2013)

3. Spark Dataset (2015).

Resilient Distributed dataset(RDD):

RDDs or Resilient Distributed Datasets are the fundamental data structure of Spark. It is the collection of objects which is capable of storing the data partitioned across the multiple nodes of the cluster and also allows them to do processing in parallel.

RDD Features:

1.      Resilient & Distributed
2.      In-memory
3.      Immutable

Limitations:

1.      No input optimization engine
2.      Runtime type safety
3.      Degrade when not enough memory
4.      No schema view of structured data

Spark Dataframes:

Spark Dataframes are the distributed collection of the data points, where the data is organized into the named columns. It is conceptually equivalent to a table in a relational database. It allows to debug the code during the runtime which was not allowed with the RDD. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. As dataframes use catalyst optimizer for optimization, they are capable enough to process large datasets.

Features:
1. Catalyst optimizer for efficient data processing across multiple languages
2. Capable enough to handle structured data
3. Support for various data formats, such as Hive, CSV, XML, JSON, RDDs, Cassandra, Parquet, etc

Limitation:
1. No convenience of RDD.

Spark Dataset:

Spark Datasets is an extension of Dataframes API with the benefits of both RDDs and the Datasets. It is fast as well as provides a type-safe interface. Type safety means that the compiler will validate the data types of all the columns in the dataset while compilation only and will throw an error if there is any mismatch in the data types.

Features:
1. optimized query performance
2. Analysis at compile time
3.Inter-convertible
4. Persistent storage
5. Single API for Java and Scala

The difference between these 3 is given in the picture below

Sharing is Caring

Saturday, September 10, 2022

Spark Series - Spark RDD vs Data frame vs Dataset

No comments:

Post a Comment

Spark- Window Function

Report Abuse