Menu Close

PySpark Sort Function with Examples

Hi PySpark Lovers, In this article, you will learn everything about the PySpark sort functions with examples in order to sort the specific column of the data frame with the help of proper explanation and using examples.

Throughout this article, we will explore various types of sort functions which you can apply to the PySpark DataFrame column. We can also see the PySpark data frame sorting by multiple columns.
There are six types of sort functions available in PySpark that you can use to sort the column of PySpark DataFrame and RDD in ascending order or descending order.

What is the PySpark sort function?

Sort functions in PySpark defined inside The pyspark.sql.functions module. By default it comes with PySpark, You just need to import all these sorts of functions and use them. These sort functions take a column as a parameter and return a sort expression.

You can find all the sort functions here in order to sort the PySpark DataFrame column.

  • asc():- It is used to sort the PySpark DataFrame column in ascending order. Basically, it takes a column name as a parameter and returns a sort expression based on the ascending order for the given column.
  • asc_nulls_first():- This function is also used to sort the given column in ascending order, but it returns a null value first before not null values.
  • asc_nulls_last():- This function is also used to sort the given column in ascending order, but it returns not null value first before null values.
  • desc():- The desc() function sorts the passed column in descending order.
  • desc_nulls_first():- It sorts the passed column in descending order but null values will come first before not null values.
  • desc_nulls_last():- This function sort the passed column in descending order but it will return not null values first before the null values.

Why do we need to sort functions in the PySpark?

It all depends on your requirement, When you are working on the PySpark application and after applying all the transformations you generate a final PySpark data frame and you want to sort a particular column or multiple columns in specific order means ascending order or descending order then you can use sort function in order to sort the PySpark data frame columns, Even you can sort the number column in ascending order or descending order.

Now, let’s see how to use the sorting function in PySpark DataFrame, We will apply all the sort function on PySpark DataFrame.Therefore first of all we need to create a data frame with some records.

Creating PySpark DataFrame

To create DataFrame in PySpark, we need to import some resources from the PySpark library. The resources will be the SparkSession class from the pyspark.sql module which is used to create a spark session that will be the entry point of our application.

Then we have to use the createDataFrame() method of the spark session in order to create a new PySpark DataFrame.I am not going to tell you more about the createDataFrame() method because I already have written a detailed article about how to create a PySpark data frame.

I have created a new PySpark DataFrame by using the below code, as you can see here.

from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc, asc_nulls_first, asc_nulls_last, desc_nulls_first, desc_nulls_last, col

data = [
    ('Rambo', 'Developer', 'IT', 33000),
    ('John', 'Developer', 'IT', 40000),
    ('Harshita', 'HR Executive', 'HR', 25000),
    ('Vanshika', 'Senior HR Manager', 'HR', 50000),
    (None, 'Senior Marketing Expert', 'IT', None),
    ('Harry', 'SEO Analyst', 'Marketing', 33000),
    ('Shital', 'HR Executive', 'HR', 25000),
    (None, 'HR Executive', 'HR', None),
]

columns = ['name', 'designation', 'department', 'salary']

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()
#

df = spark.createDataFrame(data, columns)
df.show(truncate=False)

After creating PySpark DataFrame, It will be like this.

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Rambo   |Developer              |IT        |33000 |
|John    |Developer              |IT        |40000 |
|Harshita|HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|null    |Senior Marketing Expert|IT        |null  |
|Harry   |SEO Analyst            |Marketing |33000 |
|Shital  |HR Executive           |HR        |25000 |
|null    |HR Executive           |HR        |null  |
+--------+-----------------------+----------+------+

Now, we have created PySpark DataFrame successfully, Now it’s time to come to apply all those sorting functions that I have mentioned above.

How to use the sorting function in PySpark DataFrame?

As we know that all the sort functions take a column name as a parameter and return a sort expression, Therefore we need to use PySpark dataFrame orderBy() method to apply that returned sort expression on top of the DataFrame column.

let’s see all the functions one by one with the help of an example.

PySpark asc(col) sorting function

The asc() sort function accepts a col as a parameter that represents the column name on which you want to sort and returns the sort expression based on the ascending order for a given column. Here, I am about to sort the name column.

df.orderBy(asc(col("name"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|null    |Senior Marketing Expert|IT        |null  |
|null    |HR Executive           |HR        |null  |
|Harry   |SEO Analyst            |Marketing |33000 |
|Harshita|HR Executive           |HR        |25000 |
|John    |Developer              |IT        |40000 |
|Rambo   |Developer              |IT        |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
+--------+-----------------------+----------+------+

PySpark asc_nulls_first(col) sorting function

This function also accepts the col parameter which represents the column name of the PySpark DataFrame, It also sorts the column name in ascending order but it returns the Null value first before not the null value.

df.orderBy(asc_nulls_first(col("name"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|null    |HR Executive           |HR        |null  |
|null    |Senior Marketing Expert|IT        |null  |
|Harry   |SEO Analyst            |Marketing |33000 |
|Harshita|HR Executive           |HR        |25000 |
|John    |Developer              |IT        |40000 |
|Rambo   |Developer              |IT        |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
+--------+-----------------------+----------+------+

Reference:- Click Here

PySpark asc_nulls_last(col) sorting function

This function also accepts the col parameter which represents the column name of the PySpark DataFrame, It also sorts the column name in ascending order but it returns not Null value first before the null value.

df.orderBy(asc_nulls_first(col("name"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Harry   |SEO Analyst            |Marketing |33000 |
|Harshita|HR Executive           |HR        |25000 |
|John    |Developer              |IT        |40000 |
|Rambo   |Developer              |IT        |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|null    |Senior Marketing Expert|IT        |null  |
|null    |HR Executive           |HR        |null  |
+--------+-----------------------+----------+------+

Reference:- Click Here

PySpark desc(col) function

The desc() works the opposite of asc() function. The asc() sorts the column in ascending order whereas desc() sorts the passed column in descending order. It takes col which refers to the column name of the DataFrame.

Here, I am about to apply desc() function on the DataFrame salary column.

df.orderBy(desc(col("salary"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Vanshika|Senior HR Manager      |HR        |50000 |
|John    |Developer              |IT        |40000 |
|Rambo   |Developer              |IT        |33000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Harshita|HR Executive           |HR        |25000 |
|Shital  |HR Executive           |HR        |25000 |
|null    |HR Executive           |HR        |null  |
|null    |Senior Marketing Expert|IT        |null  |
+--------+-----------------------+----------+------+

Pyspark desc_nulls_first() sorting function

The desc_nulls_first() function accepts a column name and returns a sort expression of descending order for a given column. It returns null values first before not null values.

df.orderBy(desc_nulls_first(col("salary"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|null    |HR Executive           |HR        |null  |
|null    |Senior Marketing Expert|IT        |null  |
|Vanshika|Senior HR Manager      |HR        |50000 |
|John    |Developer              |IT        |40000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Rambo   |Developer              |IT        |33000 |
|Harshita|HR Executive           |HR        |25000 |
|Shital  |HR Executive           |HR        |25000 |
+--------+-----------------------+----------+------+

PySpark desc_nulls_last() sorting function

The desc_nulls_last() function accepts a column name and returns a sort expression of descending order for a given column. It returns not null values first before null values.

df.orderBy(desc_nulls_last(col("salary"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Vanshika|Senior HR Manager      |HR        |50000 |
|John    |Developer              |IT        |40000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Rambo   |Developer              |IT        |33000 |
|Harshita|HR Executive           |HR        |25000 |
|Shital  |HR Executive           |HR        |25000 |
|null    |Senior Marketing Expert|IT        |null  |
|null    |HR Executive           |HR        |null  |
+--------+-----------------------+----------+------+

PySpark sorts multiple columns

So far, we have applied sort functions in only a single column but we can also apply PySpark sorting functions on multiple columns. Let’s see how can we do that.

To sort multiple columns in PySpark DataFrame, you have to pass two sorted expressions in the orderBy() method and that will be like this.

df.orderBy(asc(col("name")), asc(col("salary"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|null    |HR Executive           |HR        |null  |
|null    |Senior Marketing Expert|IT        |null  |
|Harry   |SEO Analyst            |Marketing |33000 |
|Harshita|HR Executive           |HR        |25000 |
|John    |Developer              |IT        |40000 |
|Rambo   |Developer              |IT        |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
+--------+-----------------------+----------+------+

It does not mandatory to apply the same function, you can apply any sorting function as per your requirement. As you can see below.

df.orderBy(asc_nulls_last(col("name")), desc(col("department"))).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Harry   |SEO Analyst            |Marketing |33000 |
|Harshita|HR Executive           |HR        |25000 |
|John    |Developer              |IT        |40000 |
|Rambo   |Developer              |IT        |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|null    |Senior Marketing Expert|IT        |null  |
|null    |HR Executive           |HR        |null  |
+--------+-----------------------+----------+------+

So, we have covered all the sort functions along with examples. In fact, you can find the code here.

PySpark Articles:

Complete Source Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc, asc_nulls_first, asc_nulls_last, desc_nulls_first, desc_nulls_last, col

data = [
    ('Rambo', 'Developer', 'IT', 33000),
    ('John', 'Developer', 'IT', 40000),
    ('Harshita', 'HR Executive', 'HR', 25000),
    ('Vanshika', 'Senior HR Manager', 'HR', 50000),
    (None, 'Senior Marketing Expert', 'IT', None),
    ('Harry', 'SEO Analyst', 'Marketing', 33000),
    ('Shital', 'HR Executive', 'HR', 25000),
    (None, 'HR Executive', 'HR', None),
]

columns = ['name', 'designation', 'department', 'salary']

# # creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()
#

df = spark.createDataFrame(data, columns)

# asc
df.orderBy(asc(col("name"))).show(truncate=False)

# asc_nulls_first
df.orderBy(asc_nulls_first(col("name"))).show(truncate=False)

# asc_nulls_last
df.orderBy(asc_nulls_last(col("name"))).show(truncate=False)

# desc
df.orderBy(desc(col("salary"))).show(truncate=False)

# desc_nulls_first
df.orderBy(desc_nulls_first(col("salary"))).show(truncate=False)

# desc_nulls_last
df.orderBy(desc_nulls_last(col("salary"))).show(truncate=False)

# asc
df.orderBy(asc(col("name")), asc(col("salary"))).show(truncate=False)

# asc_nulls_last and desc
df.orderBy(asc_nulls_last(col("name")), desc(col("department"))).show(truncate=False)

Summary

I hope, Process of applying sort functions on PySpark DataFrame was pretty easy and quite straightforward, Now you can sort any column within the data frame using PySpark Sort Function according to your requirement. You can sort single as well as multiple columns together in the PySpark data frame, As you can see above.

If you found this article helpful, please share and keep visiting for further PySpark tutorials.

if you have any queries regarding this article, please let us know through the mail.

have a nice day….

PySpark col() Function with Examples
PySpark Column Class with Examples

Related Posts