Menu Close

How to Find the Nth Highest Salary Using PySpark

How to find nth Highest Salary using PySpark

In this article, we will see how to find the Nth highest salary using PySpark with the help of the examples. Throughout this article, we will explore two things, Nth highest salary in all the data and the Nth highest salary within a group The group can be anything like department, country, gender, etc.

This is one of the asked questions by interviewers in Python developer, Data Scientist, or Data Engineering interviews as a data engineer I have got this question many times in interviews.

I have prepared small datasets with some records as you can see, in Real-life applications, you might have large datasets but as of now, we are about to find Nth highest salary using PySpark using this dataset.

Find the Nth Highest Salary Using PySpark
Sample CSV Data

As you can see in the dataset, there are multiple columns, First, we will get the Nth highest salary in the whole dataset and second we will see the Nth highest salary in each department.

Here Nth means any number you want to for I am about to find out 2nd highest salaried employee.

Let’s start.

Find the Nth Highest Salary Using PySpark

Let’s find out the Nth highest salary by considering the department or without considering the department.

Without Department

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, row_number



# creating spark session
spark = (
    SparkSession.builder.master("local[*]")
    .appName("www.programmingfunda.com")
    .getOrCreate()
)


# creating DataFrame from csv file
df = spark.read.option("header", "true").csv("./sample_data.csv")


# creating a window specification
# windowFunction = Window.orderBy(desc("salary"))


# applying window function
df = df.withColumn("rank", row_number().over(windowFunction))

# getting 2nd highest salaried employee
df = df.filter(df["rank"] == 2)

df.show()

Output:

+----------+---------+------+----------+---+----------+---+------+----+
|first_name|last_name|gender|department|age|      date| id|salary|rank|
+----------+---------+------+----------+---+----------+---+------+----+
| Vishvajit|      Rao|  Male|   Testing| 24|2023-04-10|232| 35000|   2|
+----------+---------+------+----------+---+----------+---+------+----+

In the Output, Vishvajit is an employee who is second in salary among all the employees.

Let’s move on to the second, where we will see Nth highest salary in each department in PySpark.

By Department

Here we will see the second-highest salary in each department, we have three departments dataset Testing, Marketing, and IT.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, row_number



# creating spark session
spark = (
    SparkSession.builder.master("local[*]")
    .appName("www.programmingfunda.com")
    .getOrCreate()
)


# creating DataFrame from csv file
df = spark.read.option("header", "true").csv("./sample_data.csv")


# creating a window specification
windowFunction = Window.partitionBy("department").orderBy(desc("salary"))


# applying window function
df = df.withColumn("rank", row_number().over(windowFunction))

# getting 2nd highest salaried employee
df = df.filter(df["rank"] == 2)

df.show()

Output

+----------+---------+------+----------+---+----------+----+------+----+
|first_name|last_name|gender|department|age|      date|  id|salary|rank|
+----------+---------+------+----------+---+----------+----+------+----+
|  Kathleen|   Hanner|Female|        IT| 25|2017-10-15|1876| 29000|   2|
|     Dulce|    Abril|Female| Marketing| 32|2017-10-15|1562| 28750|   2|
| Vishvajit|      Rao|  Male|   Testing| 24|2023-04-10| 232| 35000|   2|
+----------+---------+------+----------+---+----------+----+------+----+

This is how you can get nth highest salary in a specific group or in the whole dataset.


See Also:


Conclusion

Throughout this article, we have explored the nth highest salary using PySpark with the help of the example. Now you can easily think kind of the questions. You can use any column instead of the department, for example, nth highest salary in country, gender, etc.

If you found this article helpful, please share and keep visiting for further PySpark tutorials.

Happy Coding!

Internal Working of Reduce Action in PySpark

Related Posts