Sharing is Caring

Thursday, May 11, 2023

Spark- Window Function

Window functions in Spark

================================================

-> Spark Window functions operate on a group of rows like partition and return a single value for every input row. Spark SQL supports three kinds of window functions:

a) Ranking functions

b) Analytic functions

c) Aggregate functions

Ranking Functions:
=============

-> ROW_NUMBER(): It is used to get a unique sequential number for each row in the specified data.

-> RANK(): It is used to provide a rank to the result within a window partition. This function leaves gaps in rank when there are ties.

-> DENSE_RANK(): is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.

-> NTILE(): It is used to distribute the number of rows in the specified (N) number of groups. Each row group gets its rank as per the specified condition. We need to specify the value for the desired number of groups.

✔️Without use of partition by :

The NTILE(2) shows that we require a group of two records in the result.

✔️With the use of Partition by:

The NTILE(2), each partition in department group is divided into two groups.

✔️ Code Snippet:

𝘷𝘢𝘭 𝘥𝘧 = 𝘚𝘦𝘲((101,"𝘔𝘰𝘩𝘢𝘯","𝘈𝘥𝘮𝘪𝘯",4000),
  (102, "𝘙𝘢𝘫𝘬𝘶𝘮𝘢𝘳", "𝘏𝘙", 5000),
  (103, "𝘈𝘬𝘣𝘢𝘳", "𝘐𝘛",9990),
  (104, "𝘋𝘰𝘳𝘷𝘪𝘯", "𝘍𝘪𝘯𝘢𝘯𝘤𝘦", 7000),
  (105, "𝘙𝘰𝘩𝘪𝘵", "𝘏𝘙", 3000),
  (106, "𝘙𝘢𝘫𝘦𝘴𝘩", "𝘍𝘪𝘯𝘢𝘯𝘤𝘦",9800),
  (107, "𝘗𝘳𝘦𝘦𝘵", "𝘏𝘙", 7000),
  (108, "𝘔𝘢𝘳𝘺𝘢𝘮", "𝘈𝘥𝘮𝘪𝘯",8000),
  (109, "𝘚𝘢𝘯𝘫𝘢𝘺", "𝘐𝘛", 7000),
  (110, "𝘝𝘢𝘴𝘶𝘥𝘩𝘢", "𝘐𝘛", 7000),
(111, "𝘔𝘦𝘭𝘪𝘯𝘥𝘢", "𝘐𝘛", 8000),
  (112, "𝘒𝘰𝘮𝘢𝘭", "𝘐𝘛", 10000))

𝘪𝘮𝘱𝘰𝘳𝘵 𝘴𝘱𝘢𝘳𝘬.𝘪𝘮𝘱𝘭𝘪𝘤𝘪𝘵𝘴._

𝘷𝘢𝘭 𝘥𝘧2 = 𝘥𝘧.𝘵𝘰𝘋𝘍("𝘪𝘥","𝘕𝘢𝘮𝘦","𝘋𝘦𝘱𝘢𝘳𝘵𝘮𝘦𝘯𝘵","𝘚𝘢𝘭𝘢𝘳𝘺")

𝘷𝘢𝘭 𝘸𝘪𝘯𝘥𝘰𝘸 = 𝘞𝘪𝘯𝘥𝘰𝘸.𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘉𝘺("𝘋𝘦𝘱𝘢𝘳𝘵𝘮𝘦𝘯𝘵").𝘰𝘳𝘥𝘦𝘳𝘉𝘺("𝘚𝘢𝘭𝘢𝘳𝘺")

𝘥𝘧2.𝘸𝘪𝘵𝘩𝘊𝘰𝘭𝘶𝘮𝘯("𝘳𝘰𝘸_𝘯𝘶𝘮𝘣𝘦𝘳",𝘳𝘰𝘸_𝘯𝘶𝘮𝘣𝘦𝘳.𝘰𝘷𝘦𝘳(𝘸𝘪𝘯𝘥𝘰𝘸))
  .𝘸𝘪𝘵𝘩𝘊𝘰𝘭𝘶𝘮𝘯("𝘳𝘢𝘯𝘬",𝘳𝘢𝘯𝘬().𝘰𝘷𝘦𝘳(𝘸𝘪𝘯𝘥𝘰𝘸))
  .𝘸𝘪𝘵𝘩𝘊𝘰𝘭𝘶𝘮𝘯("𝘥𝘦𝘯𝘴𝘦_𝘳𝘢𝘯𝘬",𝘥𝘦𝘯𝘴𝘦_𝘳𝘢𝘯𝘬().𝘰𝘷𝘦𝘳(𝘸𝘪𝘯𝘥𝘰𝘸))
  .𝘸𝘪𝘵𝘩𝘊𝘰𝘭𝘶𝘮𝘯("𝘯𝘵𝘪𝘭𝘦",𝘯𝘵𝘪𝘭𝘦(2).𝘰𝘷𝘦𝘳(𝘸𝘪𝘯𝘥𝘰𝘸))
.𝘴𝘩𝘰𝘸()

Want to become a Data Architect?

Are you looking for a resource for learning to become a Data Architect?

Data Architect is one of the highest roles in the data world. To become a Data Architect, one should be an expert /pro in many data-related areas like Database and DWH management, ETL, design and development of big data applications, and Visualization.

Check here the guide & resources to become a data Architect here.....

📍 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗶𝗻𝗴.....
A data warehouse allows stakeholders to make well-informed business decisions by supporting the process of drawing meaningful conclusions through data analytics.

Learn here for FREE: https://www.youtube.com/watch?v=CHYPF7jxlik

📍 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁....
Having a solid foundation in database management will help data engineers build, design, and maintain the overall data infrastructure that supports the business requirements.

Learn here for FREE: https://lnkd.in/dqVJr2yv

📍 𝗘𝗧𝗟........
data engineers have to work in teams and extract data from various sources, transform it into a reliable form, and load that into the systems other teams of data science professionals can use to build other relevant applications.

Learn here for FREE: https://lnkd.in/geaxTyKr

📍 𝗕𝗶𝗴 𝗗𝗮𝘁𝗮 𝗧𝗼𝗼𝗹𝘀 𝗮𝗻𝗱 𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝘀..........
With vast amounts of data generated every second, companies are now dealing with the problem of efficiently handling and storing petabytes-sized data. And the top tools to handle such big data through distributed processing are Apache Hadoop and Apache Spark.

FREE data engineering roadmap here: https://lnkd.in/g_2BVCFp

📍 𝗖𝗹𝗼𝘂𝗱 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗦𝗸𝗶𝗹𝗹𝘀......
Learn Azure here: https://lnkd.in/g44QFhWZ

Learn AWS here: https://lnkd.in/gYbF9_4H

Learn GCP here: https://lnkd.in/dc98RFBN

📍 𝗗𝗮𝘁𝗮 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 𝗮𝗻𝗱 𝗦𝗰𝗵𝗲𝗺𝗮 𝗗𝗲𝘀𝗶𝗴𝗻.....
https://lnkd.in/duTGjk6H

📍 𝗗𝗮𝘁𝗮 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻......
The data engineering role often involves using data visualization tools to better understand data and its features.
Learn Power BI: https://lnkd.in/gV2Kn9M8

📍 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸𝘀....
Knowledge of data processing frameworks is crucial for data engineers as they are responsible for streaming data.

Learn Kafka Here for FREE: https://lnkd.in/gqxpf43i

📍 𝗣𝗿𝗼𝗴𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲......
Learn Python/scala here: https://lnkd.in/ggMZRfpf
https://lnkd.in/gSzSiW6h

Learn SQL here: https://lnkd.in/gAGY-vX3

📍 Portfolio projects...
Build a strong #dataengineering portfolio here.....
https://lnkd.in/gwzyHuu9

This is an #excellent roadmap to succeed in your in-data journey...

these are the best resources and are definitely recommended.