Sharing is Caring: 2022-09-04

Saturday, September 10, 2022

Cloud Computing- IaaS - SaaS - PaaS

Cloud Computing:

cloud computing is the delivery of computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet (“the cloud”) to offer faster innovation, flexible resources, and economies of scale. You typically pay only for cloud services you use, helping lower your operating costs, run your infrastructure more efficiently and scale as your business needs change.

Benefits of cloud computing:

1. Cost

2. Global scale

3. Speed

4. Performance

5. Security

6. Reliability

7. Productivity

Types of cloud computing

Not all clouds are the same and not one type of cloud computing is right for everyone. Several different models, types and services have evolved to help offer the right solution for your needs.

First, you need to determine the type of cloud deployment or cloud computing architecture, that your cloud services will be implemented on. There are three different ways to deploy cloud services: on a public cloud, private cloud or hybrid cloud.

Public cloud

Public clouds are owned and operated by a third-party cloud service providers, which deliver their computing resources like servers and storage over the Internet. Microsoft Azure is an example of a public cloud. With a public cloud, all hardware, software and other supporting infrastructure is owned and managed by the cloud provider. You access these services and manage your account using a web browser.

Private cloud

A private cloud refers to cloud computing resources used exclusively by a single business or organization. A private cloud can be physically located on the company’s on-site datacenter. Some companies also pay third-party service providers to host their private cloud. A private cloud is one in which the services and infrastructure are maintained on a private network.

Hybrid cloud

Hybrid clouds combine public and private clouds, bound together by technology that allows data and applications to be shared between them. By allowing data and applications to move between private and public clouds, a hybrid cloud gives your business greater flexibility, more deployment options and helps optimize your existing infrastructure, security and compliance

Types of cloud services:

Today, everyone are moving towards Cloud World (AWS/GCP/Azure/PCF/VMC). It might be a public cloud, a private cloud or a hybrid cloud.

But are you aware of what are Services Cloud Computing provides to us ????

Majorly there are three categories of Cloud Computing Services:

a) Infrastructure as a Service (IaaS) : It provides only a base infrastructure (Virtual machine, Software Define Network, Storage attached). End user have to configure and manage platform and environment, deploy applications on it.

AWS (EC2), GCP (CE), Microsoft Azure (VM) are examples of Iaas.

b) Software as a Service (SaaS) : It is sometimes called to as “on-demand software”. Typically accessed by users using a thin client via a web browser. In SaaS everything can be managed by vendors: applications, runtime, data, middleware, OSes, virtualization, servers, storage and networking, End users have to use it.

GMAIL is Best example of SaaS. Google team managing everything just we have to use the application through any of client or in browsers. Other examples SAP, Salesforce .

c) Platform as a Service (PaaS): It provides a platform allowing end user to develop, run, and manage applications without the complexity of building and maintaining the infrastructure.

Google App Engine, CloudFoundry, Heroku, AWS (Beanstalk) are some examples of PaaS.

Below fig 1.0 while give you more idea on it.

d) Container as a Service (CaaS): Is a form of container-based virtualization in which container engines, orchestration and the underlying compute resources are delivered to users as a service from a cloud provider.

Google Container Engine(GKE), AWS (ECS), Azure (ACS) and Pivotal (PKS) are some examples of CaaS.

e) Function as a Service (FaaS): It provides a platform allowing customers to develop, run, and manage application functionalities without the complexity of building and maintaining the infrastructure.

AWS (Lamda), Google Cloud Function are some examples of Faas

Hope this gives you clear idea on what are Cloud Computing Services provided in market!!!!!!

Spark Series - Spark RDD vs Data frame vs Dataset

As a continuity of the spark series, today we are going to see 3 important user-facing APIs of Spark. To understand the spark architecture and its components, please refer my previous post given in the first comment. Those APIs are as follows:

1. Resilient Distributed dataset (2011)

2. Spark Data frame (2013)

3. Spark Dataset (2015).

Resilient Distributed dataset(RDD):

RDDs or Resilient Distributed Datasets are the fundamental data structure of Spark. It is the collection of objects which is capable of storing the data partitioned across the multiple nodes of the cluster and also allows them to do processing in parallel.

RDD Features:

1.      Resilient & Distributed
2.      In-memory
3.      Immutable

Limitations:

1.      No input optimization engine
2.      Runtime type safety
3.      Degrade when not enough memory
4.      No schema view of structured data

Spark Dataframes:

Spark Dataframes are the distributed collection of the data points, where the data is organized into the named columns. It is conceptually equivalent to a table in a relational database. It allows to debug the code during the runtime which was not allowed with the RDD. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. As dataframes use catalyst optimizer for optimization, they are capable enough to process large datasets.

Features:
1. Catalyst optimizer for efficient data processing across multiple languages
2. Capable enough to handle structured data
3. Support for various data formats, such as Hive, CSV, XML, JSON, RDDs, Cassandra, Parquet, etc

Limitation:
1. No convenience of RDD.

Spark Dataset:

Spark Datasets is an extension of Dataframes API with the benefits of both RDDs and the Datasets. It is fast as well as provides a type-safe interface. Type safety means that the compiler will validate the data types of all the columns in the dataset while compilation only and will throw an error if there is any mismatch in the data types.

Features:
1. optimized query performance
2. Analysis at compile time
3.Inter-convertible
4. Persistent storage
5. Single API for Java and Scala

The difference between these 3 is given in the picture below

Thursday, September 8, 2022

Free Datasets for Data Enthusiasts

Where Can I find free dataset to practice Data analysis/Data Science projects?

Most of us are planning to get into data platform related jobs like Data Engineering/Data Analyst/Data Scientist. To land into these jobs, theoretical skill are not enough. You may have some experience in other IT domain/may be fresher, doing some data projects are very essential to sustain /survival in the jobs.

You may get tones and tones of materials in YouTube/websites on theoretical part. but you may found some difficulties to find suitable dataset to practice the same. The data enthusiasts who are looking for the same, you refer the below web sites:

Financials, economic & alternative datasets - Nasdaq data link: https://data.nasdaq.com/
US Government's open data - Data gov: https://data.gov/
Google's data search: https://lnkd.in/eZNsCa6g
Codebasics open datasets: https://lnkd.in/eU_72tVK
ImageNet (Images dataset) - https://www.image-net.org/
Cocodataset (images dataset): https://lnkd.in/ezDhyHtN
Open images dataset with annotations: https://lnkd.in/ezn8DBkK

kaggle dataset : https://www.kaggle.com/datasets

UCI Machine learning repository: https://lnkd.in/gmPg2KQX
Libsvm datasets: https://lnkd.in/gGj-egax
Fake new dataset from Kaggle : https://lnkd.in/g5SmjykQ
Inventory management datasets: https://lnkd.in/gttsbt4q
Movelens dataset:, https://lnkd.in/gKqHbPkn
Ecommerce dataset: https://lnkd.in/gakCNUz2
Data exploration & predictive modeling datasets: https://lnkd.in/gHpmdP2k
Emotional detection from face: https://lnkd.in/g__jcMDQ
IMDB review datasets: https://lnkd.in/gWCC2Beq
sentiment lexicons for 81 languages: https://lnkd.in/gYQwYepa

How to Decide on Spark Cluster Configuration?

Spark Optimization:

Spark Optimization is the most Important concepts. It is the responsibility of developer to design the spark application in a optimize manner so we can take full advantage of spark execution engine. When Application are not optimized, simple code takes longer to execute, resulting in performance lags and downtime, and it takes effect on the other Application which is using the same cluster.

There are two ways to optimize Spark Application.

1). Resource/Cluster level Optimization

2). Code Level / Application Level Optimization

If developer write the optimized code but if application don’t have enough resources to run it then there is no use to write optimize code, as it will take lot of time due to lack of resources.

Spark Architecture:

Before going to deep down into spark Resource Level Optimization let’s first try to understand spark architecture, so it will be easy to understand the process of Resource Level Optimization.

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –

a).Master Daemon — (Master/Driver Process)

b).Worker Daemon –(Slave Process)

c).Cluster Manager

A spark cluster has Master Node and of Slaves/Workers Node. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines.

Master Node or driver process : The master is the driver that runs the main() program where the spark context is created. It then interacts with the cluster manager to schedule the job execution and perform the tasks.

Slaves/Workers Node: The worker consists of processes that can run in parallel to perform the tasks scheduled by the driver program. These processes are called Executors/Containers.

The role of worker nodes/executors:

1. Perform the data processing for the application code

2. Read from and write the data to the external sources

3. Store the computation results in memory, or disk.

Executors/Containers contains cores and RAM. Executors/Containers launch once in the beginning of Spark Application and then they run for the entire lifetime of an application. All the cores and total memory divides among the number of executors. One node can hold one or more then one executor. The individual task in the given Spark job runs in the Spark executors/Containers.

Executors/Containers are the combination of memory(RAM) and Cores.

Number Of Cores decides Number of Parallel Process in each executors. So let say if we have 5 cores in one executor then max 5 parallel task can execute in executor.

Tasks: Task is the smallest execution unit in Spark. A task in spark executes a series of instructions. For eg. reading data, filtering and applying map() on data can be combined into a task. Tasks are executed inside an executor. Based one HDFS block size (128 mb) spark will create partitions, and assign one task per partition. So if we have 1GB(1024 MB)data , spark will create (128mb+128mb+128mb+128mb) 4 partitions.

Stage : Group of tasks creates a Stage. Number of Stages depends on Data Shuffling(Narrow and wide Transformation). Spark encounters a function that requires a shuffle it creates a new stage. Transformation functions like reduceByKey(), Join() etc will trigger a shuffle and will result in a new stage

Job : Group of Stages Creates a Job.

How to Decide Cluster Configuration

Let’s Assume we have 10 node cluster.

16 CPU Cores Each Node

64 GB RAM Each Node

Now we can decide the no of Executers based on no of CPU cores. Each Executors can hold Minimum 1 Core and Max is Number of CPU cores available.

If we consider the minimum value. i.e. one core per executor.

In that case we will have 16 Executor and 1 core per executor, and each executor will have 64Gb/16=4GB RAM.

so as we discussed earlier, number of cores decided the parallel task, so in this case we can not execute more then one task at a time.

If we create any broadcast or accumulator variable then it will create copy on each executor.

so it is not good idea to assign one core per executor. This Process is also called tiny executor.

Let’s Discuss 2nd approach.

where we can create 1 single executor and assign all the 16 cores to one executor. so we will have one executor,64 core and 64 gb RAM.

Again In This case we will have 16 parallel process

It is observed and that if we execute more then 5 parallel process or have more then 5 cores per executor then the HDFS throughput suffers.

If executor holds very huge amount of memory (64gb ) then garbage collection takes lot of time.

So this is also not a good cluster configuration. This Process is also called as FAT executor.

So above tiny and fat executor is not a good cluster Configuration.

So we should always have a balanced approach , so can effectively use our cluster.

Let’s try to configure optimized Cluster.

Resources

1. Total Number Of Nodes : 10

2. Cores Per Node : 16

3. RAM Per Node : 64 GB

So, from the above Resources we need to give 1 core and 1 gb ram for other OS Activities.

So , Now we left with -

1. Total Number Of Nodes : 102. Cores Per Node : 153. RAM Per Node : 63 GB

As per the study ,5 cores per Executors is the best and preferred Choice.

15 cores → 63 GB RAM → each machineSo here we can have 15/5=3 executors per machine.Memory : 63/3=21 GB Per Executor

1 Node — → 3 Executors →5 Cores per Executors → 21 GB RAM Per Executor.

Out Of this 21GB RAM , some of Will go as part of overhead memory(off heap).

overhead memory= max(384 MB,7% Of Executor Memory)

i.e max(384 MB, 7% Of 21 GB)~1.5 GB

So the Remaining Memory Will be 21–1.5 ~ 19 GB

SO the Calculation Will be -

10 Node Cluster

Each node has 3 Executor: 10*3= 30 Executor

So 30 Executor With Each Executor will Hold. : 5 CPU Core and 19 GB Memory.

Now Out of These 30 Executors 1 executor will we given for Yarn Applications Manager.

So Now Our Final Cluster Configuration Will Look like below:

Number Of Nodes : 10Total Number Of Executors : 29Core Per Executor : 5RAM Per Executor :19 GB

Sharing is Caring