Sharing is Caring: 2022-09-11

Saturday, September 17, 2022

𝐘𝐨𝐮𝐭𝐮𝐛𝐞 𝐜𝐡𝐚𝐧𝐧𝐞𝐥𝐬 𝐟𝐨𝐫 D𝐚𝐭𝐚 E𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠

Most of the data enthusiasts are interested to learn big data/DE technical skills. To upgrade skills on big data and Data engineering, you can refer below Youtube channels.

📌 𝐄-𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐁𝐫𝐢𝐝𝐠𝐞 |

https://lnkd.in/dJ5-fa7J

📌 𝐎𝐧𝐥𝐢𝐧𝐞𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠𝐂𝐞𝐧𝐭𝐞𝐫

https://lnkd.in/duc8YdX9

📌 𝐓𝐫𝐞𝐧𝐝𝐲𝐭𝐞𝐜𝐡 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬

https://lnkd.in/dwQMmBSt

📌 𝐀𝐧𝐤𝐢𝐭 𝐁𝐚𝐧𝐬𝐚𝐥

https://lnkd.in/dfz_y5Bq

📌 𝐃𝐚𝐫𝐬𝐡𝐢𝐥 𝐏𝐚𝐫𝐦𝐚𝐫

https://lnkd.in/dACwVCuy

📌 𝐓𝐡𝐞 𝐁𝐢𝐠 𝐃𝐚𝐭𝐚 𝐒𝐡𝐨𝐰

https://lnkd.in/dNPCCAxm

📌 𝐭𝐞𝐜𝐡𝐓𝐅𝐐

https://lnkd.in/dJvE3c24

📌 𝐒𝐞𝐚𝐭𝐭𝐥𝐞 𝐃𝐚𝐭𝐚 𝐆𝐮𝐲 |

https://lnkd.in/dr6CkApX

Common Issues in Data Platform

In data platform, we can encounter many common issues in day-to-day life. In this post, we will discuss issues and actions to be taken to resolve them. Here I am given them as a one liner.

Issue: Lack of data definition
Action: a central catalog of data definition and business glossary

Issue: Cross-system mismatch
Action: map data across the system in a unanimous fashion

Issue: Orphaned data
Action: all data files should be indexed/cataloged

Issue: Irrelevant Data
Action: identify and reconcile on a regular basis

Issue: Lack of history
Action: use snapshot enabled tools

Issue: Mishandling of late data
Action: separate data pipeline for backfilling with least downtime

Issue: Missing Attributes
Action: schema validation and reconciliation

Issue: Missing Values
Action: drop or impute

Issue: Missing Records
Action: attach metadata

Issue: Default Values
Action: cover in data catalog

Issue: Duplication of records
Action: check and purge/merge

Issue: Attribute format inconsistency
Action: standardize attribute format throughout its lifecycle

In summary, we should place controls at various levels to fix issues in our data platform:

- Fixes at the source system
- Fixes during the transformation process
- Continuous data profiling
- Guardrails at the Metadata layer
- DQ checks and alerts at the consumption layer.

Resources For Data Science and AI

Resources for Data Science and AI

Many of data science enthusiasts are keep asking on resources to upskill to data science and AI. To learn Data science and AI, you can refer the below courses and books:

Courses
📌 (Beginner) - Udacity - Intro to Machine Learning
📌 (Intermediate) - Coursera - Deep Learning Specialization
📌 (Advanced) - Coursera/UdacityUpgrad for specific courses on HCI, NLP, Reinforcement Learning, Computer Vision

Pet Projects
📌 https://www.kaggle.com/
📌 https://lnkd.in/gQ5cfG5T
📌 https://machinehack.com/

Books (Top 5 Favs)
1. Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python

2. Advances in Financial Machine Learning

3. Reinforcement Learning - Sutton

4. Prediction Machines: The Simple Economics of Artificial Intelligence

5. Trustworthy AI: A Business Guide for Navigating Trust and Ethics in AI

Free Books

1- Data Science at the Command Line by Jeroen Janssens: https://lnkd.in/gbjdkW9M

2- Deep Learning on Graphs by Yao Ma and Jiliang Tang: https://lnkd.in/g3g-puib

3- Hands-on Machine Learning with Scikit-learn, Keras and Tensorflow by Aurelien Geron: https://lnkd.in/gzeASHUd

4- Practical Statistics for Data Science by Peter Bruce & Andrew Bruce https://lnkd.in/gfUUfb6K

5-An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani: https://lnkd.in/eBCkgBS

6-Learning Deep Architectures for AI by Yoshua Bengio: https://lnkd.in/gHNKMzE2

7- Python for Data Science Handbook by Jake VanderPlas: https://lnkd.in/bxTAdNY

8- The Hundred-Page Machine Learning Book by Andriy Burkov:https://lnkd.in/gdbbUuPH

9- A Course in Machine Learning by Hal Daumé III: https://lnkd.in/gDr2C7qi

10- Intuitive ML and Big Data in C++, Scala, Java, and Python by Kareem Alkaseer: https://lnkd.in/eVanhXm

11- Python Notes for Professionals book: https://lnkd.in/g2cNnFjJ

12- Learning Pandas https://lnkd.in/gM9C2BvN

13- Machine Learning - A First Course for Engineers and Scientists by Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön: https://lnkd.in/gzuNxKi3

14- Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola: https://d2l.ai/d2l-en.pdf

15- A Comprehensive Guide to Machine Learning Soroush Nasiriany, Garrett Thomas, William Wang, Alex Yang, Jennifer Listgarten, Anant Sahai: https://lnkd.in/gp3AKgMY

16- SQL Notes for Professionals book: https://lnkd.in/g5dNZCuD

17-Algorithms Notes for Professionals book: https://lnkd.in/eX6YkWv

18- Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI by Shlomo Kashani, Amir Ivry: https://lnkd.in/gMFVTbrn

19- Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David : https://lnkd.in/gEJGTfB7

20. Data Science Interview Questions: kojino-interview-questions

More books: https://lnkd.in/gPNmRcdV

Thursday, September 15, 2022

Free Resources for SQL

Even after evolution of many technologies, SQL(Structured Query Language) is still dominate in data platform on data processing. To learn and practice SQL, you can validate the below free resources:

1. SQL Interview prep doc: nerdsfornerds.in
2. Select * SQL: selectstarsql.com/
3. Leetcode: lnkd.in/g3c5JGC
4. LinkedIn Learning: lnkd.in/gQXFc4n
5. Window Functions: lnkd.in/g3RtPCJ
6. HackerRank: lnkd.in/grv_9sB
7. W3 Schools: lnkd.in/gJPfrrv
8. CodeAcademy: lnkd.in/gT5xmpN
9. SQLZOO: sqlzoo.net/
10. SQL Bolt: sqlbolt.com/
11. Danny Ma's SQL: 8weeksqlchallenge.com/
12. Interactive SQL: sqlcourse.com/
13. Pythonish SQL: stratascratch.com
14. Just TSQL: sqlservertutorial.net
15. Last but not the least: mode.com

Kafka Series-Architecture

What is Apache Kafka?

Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.

Kafka provides three main functions to its users:

· Publish and subscribe to streams of records

· Effectively store streams of records in the order in which records were generated

· Process streams of records in real time

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.

Why would you use Kafka?

Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data. For example, if you want to create a data pipeline that takes in user activity data to track how people use your website in real-time,

Kafka would be used to ingest and store streaming data while serving reads for the applications powering the data pipeline. Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications.

How does Kafka work?

Kafka combines two messaging models, queuing and publish-subscribe, to provide the key benefits of each to consumers. Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues aren’t multi-subscriber. The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes. Kafka uses a partitioned log model to stitch together these two solutions. A log is an ordered sequence of records, and these logs are broken up into segments, or partitions, that correspond to different subscribers. This means that there can be multiple subscribers to the same topic and each is assigned a partition to allow for higher scalability. Finally, Kafka’s model provides replay ability, which allows multiple independent applications reading from data streams to work independently at their own rate.

Benefits of Kafka's approach

Scalable

Kafka’s partitioned log model allows data to be distributed across multiple servers, making it scalable beyond what would fit on a single server.

Fast

Kafka decouples data streams so there is very low latency, making it extremely fast.

Durable

Partitions are distributed and replicated across many servers, and the data is all written to disk. This helps protect against server failure, making the data very fault-tolerant and durable.

Kafka's architecture

Kafka remedies the two different models by publishing records to different topics. Each topic has a partitioned log, which is a structured commit log that keeps track of all records in order and appends new ones in real time. These partitions are distributed and replicated across multiple servers, allowing for high scalability, fault-tolerance, and parallelism. Each consumer is assigned a partition in the topic, which allows for multi-subscribers while maintaining the order of the data. By combining these messaging models, Kafka offers the benefits of both. Kafka also acts as a very scalable and fault-tolerant storage system by writing and replicating all data to disk. By default, Kafka keeps data stored on disk until it runs out of space, but the user can also set a retention limit. Kafka has four APIs:

· Producer API: used to publish a stream of records to a Kafka topic.

· Consumer API: used to subscribe to topics and process their streams of records.

· Streams API: enables applications to behave as stream processors, which take in an input stream from topic(s) and transform it to an output stream which goes into different output topic(s).

· Connector API: allows users to seamlessly automate the addition of another application or data system to their current Kafka topics.

Wednesday, September 14, 2022

How to build an optimized Spark Application?

Few key points to remember while doing building spark applications to optimize performance

Spark UI (Monitor and Inspect Jobs).
Level of Parallelism (Clusters will not be fully utilized unless the level of parallelism for each operation is high enough. Spark automatically sets the number of partitions of an input file according to its size and for distributed shuffles, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument to an operation. In general, 2–3 tasks per CPU core in your cluster are recommended. That said, having tasks that are too small is also not advisable as there is some over head paid to schedule and run a task.As a rule of thumb tasks should take at least 100 ms to execute).
Reduce working set size (Operations like groupByKey can fail terribly when their working set is huge. Best way to deal with this will be to change the level of parallelism)
Avoid groupByKey for associative operations(use operations that can combine)
Multiple Disk (give spark multiple disks for intermediate persistence. This done via setting in ResourceManager)
Degree of Parallelism (~ 2 to 3 time the number of cores on Worker nodes)
Performance due to chosen Language (Scala > Java >> Python > R)
Higher level APIs are better (Use Dataframe for core processing, MLlibfor Machine Learning, SparkSQL for Queryand GraphXfor Graphprocessing)
Avoid collecting large RDDs (use take or takeSample).
Use Dataframe (This is more efficient and uses Catalyst optimizer.)
Use Scope as provided in maven to avoid packaging all the dependencies
Filter First, Shuffle next
Cache after hard work
Spark Streaming — enable back pressure (This will tell kafka to slow down rate of sending messages if the processing time is coming more than batch interval and scheduling delay is increasing)
If using Kafka, choose Direct Kafka approach
Extend Catalyst Optimizer’s code to add/modify rules
Improve Shuffle Performance:
a. Enable LZF or SnappyCompression (for shuffle)
b.Enable Kryo Serialization
c. Keep shuffle data small(using reduceByKey or filter before shuffle)
d. No Shuffle block canbe greater than2GB in size. Else exception:size is greater than Interger.MAX_SIZE. Spark uses ByteBuffer for ShuffleBlocks. ByteBuffer is limitedby Integer.MAX_SIZE = 2 GB. Ideally, each partition should have roughly128 MB.
e. Think about partition/ bucketing ahead of time.
f. Do as much as possible with a single shuffle
Use cogroup (instead of rdd.flatmap.join.groupby)
Spend time of reading RDD lineage graph (handy way is to read RDD.toDebugString() )
Optimize Join Performance a. Use Salting to avoid SkewKeys. Skew sets are the ones where data is not distributed evenly. One for Few partitions have huge amount of Data in comparison to other partitions.
a. Here change the (regular key) to (concatenate (regular key, “:”, random number)).
b.Once this is done, then first do join operation on salted keys and then do the operation on unsalted keys b. Use partitionBy(new hash partition())
Use Caching (Instead of MEM_ONLY, use MEM_ONLY_SER. This has better GC for larger datasets)
Always cache after repartition.
A Map after partitionBy will lose the partition information. Use mapValue instead
Speculative Execution (Enable Speculative execution to tackle stragglers)
Coalesce or repartition to avoid massive partitions (smaller partitions work better)
Use Broadcast variables
Use Kryo Serialization (more compact and faster than Java Serialization. Kryo is only supported in RDD caching and shuffling– not in Serialize To disk operations like SaveAsObjectFile)

Free resources for Data Science Interview preparation

Please refer the below free resources for Data Science preparation:

1. Alexey Grigorev's Data Science Interviews GitHub repo:

https://lnkd.in/gC9CC3SQ

2. The 9-Day Data Science Interview Crash Course

(has 27 FAANG problems sent over 9 emails)

https://lnkd.in/dPMR-NdF

3. Chip Huyen's Machine Learning Interview Book:

https://lnkd.in/gfH9D_jp

4. DataLemur: Ace the #SQL Interview

Has 60+ SQL interview questions from FAANG.

All the hints + solutions are 100% free
(unlike LeetCode and HackerRank)

http://datalemur.com/

5. This Reddit Thread With 21 Questions

These questions come from a Redditor interviewing for a $120k entry-level Data Science job in Washington D.C:

(it has a really good comment section discussing interview prep!)

https://lnkd.in/gA7Bt8Ad

6. 40 Prob/Stat questions asked by FAANG:

https://lnkd.in/dTRiQAv

Sharing is Caring