Menu Close

PySpark Column Class with Examples

Hi PySpark lovers, In this tutorial, you will learn everything about the PySpark Column class and PySpark column functions with examples so that you can easily work with any column in PySpark DataFrame.

PySpark Column class provides several functions to work with the PySpark DataFrame column in order to manipulate the values of PySpark DataFrame column values.

Being a Data Engineer or PySpark Developer, most of the time we have to work on the PySpark DataFrame column and perform some operation on top of that column like retrieving a column value or retrieving a part of the column value, etc. Then we can use PySpark Column functions.

Before going too deep into this article, let’s understand some fundamental concepts about PySpark column class.

What is PySpark Column Class?

PySpark Column class represents the column in PySpark DataFrame. PySpark Column class defined inside the pyspar.sql.column module.
PySpark Column class has multiple methods that are used to filter the rows, retrieve the part of the value, and manipulate the value of a column in PySpark DataFrame.

key points about the PySpark Column Class:

  • It always represents the column of PySpark DataFrame.
  • It provides several methods in order to work with PySaprk DataFrame columns to manipulate rows and columns.
  • Some of the PySpark column class methods evaluate boolean values to filter the records of the PySpark DataFrame.
  • PySpark also provides some additional functions that take PySpark Column class instance and return column type.

Create a Pyspark Column class object

To create a PySpark Column class object, firstly we have to create a PySpark DataFrame having some records. We can create PySpark Column class instances or objects in various ways, we will see all those ways one by one.

let’s create a PySpark data frame.

from pyspark.sql import SparkSession
from pyspark.sql.column import Column

data = [
    ('Sharu', 'Developer', 'IT', 33000),
    ('John', 'Developer', 'IT', 40000),
    ('Jaiyka', 'HR Executive', 'HR', 25000),
    ('Vanshika', 'Senior HR Manager', 'HR', 50000),
    ('Harsh', 'Senior Marketing Expert', 'Marketing', 45000),
    ('Harry', 'SEO Analyst', 'Marketing', 33000),
    ('Shital', 'HR Executive', 'HR', 25000),
    ('Veronika', 'Developer', 'IT', 43000),
]

columns = ['name', 'designation', 'department', 'salary']

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# displaying dataframe
df.show(truncate=False)

After executing the above code the data frame will be:

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Sharu   |Developer              |IT        |33000 |
|John    |Developer              |IT        |40000 |
|Jaiyka  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|Harsh   |Senior Marketing Expert|Marketing |45000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Veronika|Developer              |IT        |43000 |
+--------+-----------------------+----------+------+

Now, let’s see all the possible ways to create an instance or object of the Column class.

Creating Column object using DataFrame object:

The data frame object is used to create the instance of the Column class. For an instance, In the above code, df is the instance of PySpark DataFrame and it is used to create a column object.

Here, I am about to create a Column class instance for the PySpark DataFrame name column.

df["name"]

Creating Column instance using col() function:

The col() function is a built-in function that is defined inside the pyspark.sql.functions module. It takes the DataFrame column name and returns the instance of pySpark.sql.column.Column class.
To use the col() function, we have to import it from pyspark.sql.functions module.

from pyspark.sql.functions import col
col_obj = col("department")

Creating Column object using DataFrame object with a dot (.):

We can also use dot (.) with the DataFrame object in order to create an instance of the PySpark Column class.

col_obj = df.department

So this is how you can create an Instance or object of the Column class, Now We are about to see Column class methods.

PySpark Column Functions

Let’s see some popular Column class methods. I have mentioned all the PySpark Column class methods below table, You can also able to see examples of a particular method by clicking on that method.

PySpark Column FunctionsFunction Description
alias(*col, **kwargs)It provides the alias or expression for a given column.
name(*col, **kwargs)Work the same as an alias.
asc()Sort the column in ascending order.
asc_nulls_first()Sort the column in ascending order but null values will come first.
asc_nulls_last()Sort the column in ascending order but null values will come last.
astype()Change the type of column to another type.
cast()Same as astype().
between(lowerBound, upperBound)Return True if the current column is in between lowerBound and upperBound.
betweenAND(lowerBound, upperBound)Compute between AND for this expression with another expression.
betweenOR(lowerBound, upperBound)Compute between OR for this expression with another expression.
betweenXOR(lowerBound, upperBound)Compute between XOR for this expression with another expression.
desc()Sort the column in descending order.
desc_nulls_first()Sort the column in descending order but null values will come first.
desc_nulls_last()Sort the column in descending order but null values will come last.
startswith(other)Return True if the string startswith other.
endswith(other)Return True if the string endswith other.
contains(other)Check if the column value contains another string.
getNullSafe(other)Quality test that is null for safe value.
like(other)It is equivalent to SQL-like operation.
ilike(other)It is equivalent to SQL ilike operation ( Case insensitive operation ).
rlike()SQL R Like Expression ( Like operation with Regular Expression ).
isNotNull()Return True if the current expression is not null.
isNull()Return True if the current expression is null.
isin()Check if a value present in the list.
over()Used for window functions like SQL window function.
substr(strtPos, length)Return a column that is substring for another column.
when(condition, value)It is similar to the SQL CASE statement. The value will be returned if the condition evaluates true.
otherwise(value)Passed Value will be returned if the when(condition, value) evaluates False.
dropFields(*fieldnames)Used to drop field names in StructType by name.
withField(fieldName, col)An expression that adds or replaces a field in StructType by name.

PySpark Column Functions Examples

In this example section, I have used the above-created DataFrame in order to apply PySpark Column class methods on top of the particular column name.

PySpark column alias() function:

The alias() method is used to provide the name of the column in PySpark DataFrame.As you can see in the above data frame we have multiple columns but I want to change the name column to employee_name and the designation column to employee_designation using the alias() method.

df.select(
    df["name"].alias("employee_name"), df["designation"].alias("employee_designation")
).show(truncate=False)

PySpark column name() function:

The name() method works the same as the PySpark Column alias() method. In the above code, I have just replaced the alias() with the name().

df.select(
    df["name"].name("employee_name"), df["designation"].name("employee_designation")
).show(truncate=False)

The output of both alias() and name() would be the same, As you can see below.

+-------------+-----------------------+
|employee_name|employee_designation   |
+-------------+-----------------------+
|Sharu        |Developer              |
|John         |Developer              |
|Jaiyka       |HR Executive           |
|Vanshika     |Senior HR Manager      |
|Harsh        |Senior Marketing Expert|
|Harry        |SEO Analyst            |
|Shital       |HR Executive           |
|Veronika     |Developer              |
+-------------+-----------------------+

PySpark Column asc() method:

The asc() method is used to sort the PySpark DataFrame column in ascending order. It will sort both string and number columns.

df.select(df["salary"]).orderBy(df["salary"].asc()).show()

PySpark Column asc_nulls_first() method:

The asc_nulls_first() sorts the given column in ascending order but null values will come first before the not null values.

df.select(
df["salary"]
).orderBy(df["salary"].asc_nulls_first()).show()

PySpark Column asc_nulls_last() method:

The asc_nulls_last() sorts the given column in ascending order but null values will come first before the not null values.

df.select(
df["salary"]
).orderBy(df["salary"].asc_nulls_last()).show()

PySpark Column astype() method:

The astype() method is used to change the data type of provided column. Here I am about to change the data type of the salary column from long to string.

Before changing the data type:

df.select(df["salary"]).printSchema()

Output

root
 |-- salary: long (nullable = true)

After changing the data type:

df.select(df["salary"].astype("string")).printSchema()

Output

root
 |-- salary: string (nullable = true)

PySpark Column cast() method:

It works the same as astype() method.

Before changing the data type:

df.select(df["salary"]).printSchema()

Output

root
 |-- salary: long (nullable = true)

After changing the data type:

df.select(df["salary"].cast("string")).printSchema()

Output

root
 |-- salary: string (nullable = true)

PySpark Column between() method:

The between() method is used to filter the records. Same as SQL between operators. It returns True if the current column value lies between the lower value and the upper value. it takes two parameters lowerBound and upperBound.

For example, I am about to select all the employees whose salary is between 25000 and 40000.

df.select(df.name, df["salary"].between(25000, 40000)).show(truncate=False)

Output

+--------+-----------------------------------------+
|name    |((salary >= 25000) AND (salary <= 40000))|
+--------+-----------------------------------------+
|Sharu   |true                                     |
|John    |true                                     |
|Jaiyka  |true                                     |
|Vanshika|false                                    |
|Harsh   |false                                    |
|Harry   |true                                     |
|Shital  |true                                     |
|Veronika|false                                    |
+--------+-----------------------------------------+

PySpark Column desc() method:

The desc() method is used to sort the column in descending order means largest to smallest direction. I sort both string and number-based columns.

df.select(
df.name
).orderBy(df['name'].desc()).show(truncate=False)

PySpark Column desc_nulls_first() method:

The desc_nulls_first() method sorts the column in descending order but null values will come first.

df.select(
df.name
).orderBy(df['name'].desc_nulls_first()).show(truncate=False)

PySpark Column desc_nulls_last() method:

The desc_nulls_first() method sorts the column in descending order but null values will come last.

df.select(
df.name
).orderBy(df['name'].desc_nulls_last()).show(truncate=False)

PySpark Column startswith() method:

The startswith() method is used to check whether a column value starts with a given string or not. For example, I am about to select all the employees whose name starts with ‘H‘ only.

df.filter(df.name.startswith("H")).show(truncate=False)

Output

+-----+-----------------------+----------+------+
|name |designation            |department|salary|
+-----+-----------------------+----------+------+
|Harsh|Senior Marketing Expert|Marketing |45000 |
|Harry|SEO Analyst            |Marketing |33000 |
+-----+-----------------------+----------+------+

PySpark Column endswith() method:

The endswith() method is used to check whether a column value ends with a given string or not. For example, I am about to select all the employees whose name ends with ‘a‘ only.

df.filter(df.name.endswith("a")).show(truncate=False)

Output

+--------+-----------------+----------+------+
|name    |designation      |department|salary|
+--------+-----------------+----------+------+
|Jaiyka  |HR Executive     |HR        |25000 |
|Vanshika|Senior HR Manager|HR        |50000 |
|Veronika|Developer        |IT        |43000 |
+--------+-----------------+----------+------+

PySpark Column contains() method:

The method contains() is used to check whether a string includes in the current column value or not. I am going to select all the employees whose names contain the ‘r‘ character.

df.filter(df.name.contains("r")).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Sharu   |Developer              |IT        |33000 |
|Harsh   |Senior Marketing Expert|Marketing |45000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Veronika|Developer              |IT        |43000 |
+--------+-----------------------+----------+------+

PySpark Column like() method:

The like() method is used to search for specific patter in the column. It came as an SQL-like operator. For an instance, I am to search lop in the designation column. Here percent sign ( % ) represents one or more characters. You have to remember one thing, it is case sensitive means lop and LOP will treat different things.

df.filter(df.designation.like("%lop%")).show(truncate=False)

Output

+--------+-----------+----------+------+
|name    |designation|department|salary|
+--------+-----------+----------+------+
|Sharu   |Developer  |IT        |33000 |
|John    |Developer  |IT        |40000 |
|Veronika|Developer  |IT        |43000 |
+--------+-----------+----------+------+

PySpark Column ilike() method:

The ilike() method is used to search for specific patter in the column. It came as an SQL-like operator. For an instance, I am to search for lop in the designation column. Here percent sign ( % ) represents one or more characters. You have to remember one thing, it is case insensitive means lop and LOP will treat the same thing.

df.filter(df.designation.ilike("%LOP%")).show(truncate=False)

Output

+--------+-----------+----------+------+
|name    |designation|department|salary|
+--------+-----------+----------+------+
|Sharu   |Developer  |IT        |33000 |
|John    |Developer  |IT        |40000 |
|Veronika|Developer  |IT        |43000 |
+--------+-----------+----------+------+

PySpark Column rlike() method

The rlike() method is used to search for specific patter in the column. It came as an SQL-like operator. For an instance, I am to search for lop in the designation column. It is the same as SQL LIKE operator with regular expression.

df.filter(df.designation.rlike("er$")).show(truncate=False)

Output

+--------+-----------+----------+------+
|name    |designation|department|salary|
+--------+-----------+----------+------+
|Sharu   |Developer  |IT        |33000 |
|John    |Developer  |IT        |40000 |
|Veronika|Developer  |IT        |43000 |
+--------+-----------+----------+------+

PySpark Column isNotNull() method:

The isNotNull() method is used to return True if the column is not a null value.

df.filter(df.name.isNotNull()).show(truncate=False)

PySpark Column isNull() method:

The isNull() method is used to filter and return only the Null value of the current column. Here, I am about to select only those employee records whose designation is Null.

df.filter(df.name.isNull()).show(truncate=False)

PySpark Column isin() method:

This method is used to check current column value present in the list or not. If present, it will return True otherwise it will return False.

df.filter(df.department.isin(["HR", "IT", "Account", "Finance"])).show(truncate=False)

Output

+--------+-----------------+----------+------+
|name    |designation      |department|salary|
+--------+-----------------+----------+------+
|Sharu   |Developer        |IT        |33000 |
|John    |Developer        |IT        |40000 |
|Jaiyka  |HR Executive     |HR        |25000 |
|Vanshika|Senior HR Manager|HR        |50000 |
|Shital  |HR Executive     |HR        |25000 |
|Veronika|Developer        |IT        |43000 |
+--------+-----------------+----------+------+

PySpark Column over() method:

The over() method is used to apply the window function on top of the column. Window functions are used to perform some calculations on a set of rows or a particular window.

let’s take an example of the above-created PySpark DataFrame, We have a total of three departments such as ‘IT‘, ‘Marketing‘ and ‘HR‘ and now suppose our requirement is to provide the rank of employees based on salary within each department.

To perform such kinds of calculations we will use a ranking window function called dense_rank.

let’s see.

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank


partition_by = Window.partitionBy("department").orderBy("salary")
df.withColumn("ranking", dense_rank().over(partition_by)).show(truncate=False)

Output

+--------+-----------------------+----------+------+-------+
|name    |designation            |department|salary|ranking|
+--------+-----------------------+----------+------+-------+
|Jaiyka  |HR Executive           |HR        |25000 |1      |
|Shital  |HR Executive           |HR        |25000 |1      |
|Vanshika|Senior HR Manager      |HR        |50000 |2      |
|Sharu   |Developer              |IT        |33000 |1      |
|John    |Developer              |IT        |40000 |2      |
|Veronika|Developer              |IT        |43000 |3      |
|Harry   |SEO Analyst            |Marketing |33000 |1      |
|Harsh   |Senior Marketing Expert|Marketing |45000 |2      |
+--------+-----------------------+----------+------+-------+

As you can see in the above data frame, A rank has been assigned to each employee within a department. The same rank has been assigned to the employee whose salaries are equal to the same department.

PySpark Column substr() method:

The substr() method is used to return a column that is a substring of the column. I am about to return a new column called substr that will contain some part of the name column.

df.select(df["name"].substr(1, 4).alias("substr")).show(truncate=False)

Output

+------+
|substr|
+------+
|Shar  |
|John  |
|Jaiy  |
|Vans  |
|Hars  |
|Harr  |
|Shit  |
|Vero  |
+------+

PySpark Column when() and otherwise() method:

The when() and otherwise() methods are the same as the SQL CASE statement. It is used to execute one or more expression until it matches the condition and return a value if the matches occur. If it does not match any condition then the value assigns in the otherwise() method will return.

For example, I want to add 1000 to the salary of each employee whose salary is more than 40000 and 1500 whose salary is less than 40000.

df.select(
    "*",
    when(df.salary > 40000, df.salary + 1000).otherwise(df.salary + 1500).alias("new_salary"),
).show(truncate=False)

Output

+--------+-----------------------+----------+------+----------+
|name    |designation            |department|salary|new_salary|
+--------+-----------------------+----------+------+----------+
|Sharu   |Developer              |IT        |33000 |34500     |
|John    |Developer              |IT        |40000 |41500     |
|Jaiyka  |HR Executive           |HR        |25000 |26500     |
|Vanshika|Senior HR Manager      |HR        |50000 |51000     |
|Harsh   |Senior Marketing Expert|Marketing |45000 |46000     |
|Harry   |SEO Analyst            |Marketing |33000 |34500     |
|Shital  |HR Executive           |HR        |25000 |26500     |
|Veronika|Developer              |IT        |43000 |44000     |
+--------+-----------------------+----------+------+----------+

PySpark Column dropFields() method:

This method is used to drop fields in StructType by name.

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=Row(b=1, c=2, d=3, e=Row(f=4, g=5, h=6)))])
# before applying dropFields
df.show()

# after applying dropFields()
df.withColumn("a", df["a"].dropFields('e')).show()

Output

+--------------------+
|                   a|
+--------------------+
|{1, 2, 3, {4, 5, 6}}|
+--------------------+

+---------+
|        a|
+---------+
|{1, 2, 3}|
+---------+

PySpark Column withField() method:

The method withField() is used to add or replace a field in StructType by name.

df = spark.createDataFrame([
    Row(a=Row(b=1, c=2, d=3, e=Row(f=4, g=5, h=6)))])
# before applying withField
df.show(truncate=False)

# after applying dropField()
df.withColumn("a", df["a"].withField("b", lit(20))).show(truncate=False)

Output

+--------------------+
|a                   |
+--------------------+
|{1, 2, 3, {4, 5, 6}}|
+--------------------+

+---------------------+
|a                    |
+---------------------+
|{20, 2, 3, {4, 5, 6}}|
+---------------------+

Complete Source Code:

You can find all the source code here that we have explored above.

from pyspark.sql import SparkSession
from pyspark.sql.functions import dense_rank, when
from pyspark.sql.window import Window

data = [
    ("Sharu", "Developer", "IT", 33000),
    ("John", "Developer", "IT", 40000),
    ("Jaiyka", "HR Executive", "HR", 25000),
    ("Vanshika", "Senior HR Manager", "HR", 50000),
    ("Harsh", "Senior Marketing Expert", "Marketing", 45000),
    ("Harry", "SEO Analyst", "Marketing", 33000),
    ("Shital", "HR Executive", "HR", 25000),
    ("Veronika", "Developer", "IT", 43000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# using alias

df.select(
    df["name"].alias("employee_name"), df["designation"].alias("employee_designation")
).show(truncate=False)

# using name
df.select(
    df["name"].name("employee_name"), df["designation"].name("employee_designation")
).show(truncate=False)


# using asc()
df.select(df["salary"]).orderBy(df["salary"].asc_()).show()

# using asc_nulls_first()
df.select(df["salary"]).orderBy(df["salary"].asc_nulls_first()).show()


# using asc_nulls_last()
df.select(df["salary"]).orderBy(df["salary"].asc_nulls_last()).show()

# using cast
df.select(df["salary"].cast("string")).printSchema()

# using astype
df.select(df["salary"].cast("string")).printSchema()

# using between
df.select(df.name, df["salary"].between(25000, 40000)).show(truncate=False)

# using desc
df.select(df.name).orderBy(df['name'].desc()).show(truncate=False)

# using desc_nulls_last
df.select(df.name).orderBy(df['name'].desc_nulls_first()).show(truncate=False)

# using desc_nulls_last
df.select(df.name).orderBy(df['name'].desc_nulls_last()).show(truncate=False)


# using startswith
df.filter(df.name.startswith("H")).show(truncate=False)

# using endswith
df.filter(df.name.endswith("a")).show(truncate=False)

# using contains
df.filter(df.name.contains("r")).show(truncate=False)


# using like
df.filter(df.designation.like("%lop%")).show(truncate=False)

# using like
df.filter(df.designation.ilike("%LOP%")).show(truncate=False)

# using like
df.filter(df.name.rlike("a$")).show(truncate=False)


# using isNotNull()
df.filter(df.name.isNotNull()).show(truncate=False)

# using isNull()
df.filter(df.name.isNull()).show(truncate=False)

# using isin

df.filter(df.department.isin(["HR", "IT", "Account", "Finance"])).show(truncate=False)


# --------- apply rank window function

partition_by = Window.partitionBy("department").orderBy("salary")
df.withColumn("ranking", dense_rank().over(partition_by)).show(truncate=False)

# ------- close rank window function


# using substr
df.select(df["name"].substr(1, 4).alias("substr")).show(truncate=False)


# using when() and otherwise() method
df.select(
    "*",
    when(df.salary > 40000, df.salary + 1000).otherwise(df.salary + 1500).alias("new_salary"),
).show(truncate=False)

PySpark Articles:

Summary

So this is all about the PySpark Column class and its methods. I hope, the presentation of PySpark column class methods was pretty easy and straightforward. In real-life PySpark projects, the Column class and its methods are mostly used. You can use any column class method as per your requirement.

If you like this article and way of presentation, Please share and keep visiting for further PySpark tutorials.

Thanks for your valuable time…

have a nice day….

PySpark Sort Function with Examples
How to convert PySpark Row To Dictionary

Related Posts