Menu Close

PySpark col() Function with Examples

PySpark col() function

Hi there, In this article, you will learn all about how to use the PySpark col function with examples. It is one of the most useful built-in functions in PySpark in order to select a particular column from the PySpark DataFrame. Most of the time we want to perform some operation on a particular column of the PySpark DataFrame then we can use the PySpark col function.


Throughout this article, we will explore the PySpak col function with the help of various examples so that you can use it for various purposes.

What is PySpark col() Function

The col() function in PySpark is a built-in function defined inside pyspark.sql.functions module. it is used to select a particular column from the PySpark DataFrame and return it to apply some transformation on top of that column.

Parameter

The col() function in PySpark accepts a column name of PySpark Dataframe and returns it in order to apply the transformation method on top of that.

Return Type

The return type of the PySpark col function will be an instance of the PySpark Column class.

Note:- Remember, Throughout this article, we are about to use PySpark col() function along PySpark FataFrame functions and the col() function returns PySpark DataFrame Column.

How to use the col function in PySpark DataFrame?

Before using the col() function we must have a PySpark DataFrame so that we can apply the col() function to select a particular column of the DataFrame and apply some operations on top of that.

Creating Pyspark DataFrame

As we know that PySpark col() function takes the DataFrame column name as a parameter and returns an instance of Column class that’s why we have to create a PySpark DataFrame having some records. Through this article, we will use the PySpark col function with data frame only.

PySpark code to create DataFrame:

from pyspark.sql import SparkSession

data = [
    ('Sharu', 'Developer', 'IT', 33000),
    ('John', 'Developer', 'IT', 40000),
    ('Jaiyka', 'HR Executive', 'HR', 25000),
    ('Vanshika', 'Senior HR Manager', 'HR', 50000),
    ('Harsh', 'Senior Marketing Expert', 'Marketing', 45000),
    ('Harry', 'SEO Analyst', 'Marketing', 33000),
    ('Shital', 'HR Executive', 'HR', 25000),
    ('Veronika', 'Developer', 'IT', 43000),
]

columns = ['name', 'designation', 'department', 'salary']

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

df = spark.createDataFrame(data, columns)
df.show()

After executing the above code, The output will be:

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Sharu   |Developer              |IT        |33000 |
|John    |Developer              |IT        |40000 |
|Jaiyka  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|Harsh   |Senior Marketing Expert|Marketing |45000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Veronika|Developer              |IT        |43000 |
+--------+-----------------------+----------+------+

Now it’s time to implement the col() function along with PySpark functions that take the column as a parameter.

Note:- To use col() function you have to import it from pyspark.sql.functions module.

Using col() function with groupBy() Method

PySpark groupBy() method is used to perform an aggregate function on a set of rows. It takes one or more columns’ names to be grouped. Here, I am about to apply groupBy by on the department column along with the sum aggregate function on the salary column in order to calculate the total salary of employees within a department.

df.groupBy(col('department')).agg(sum("salary").alias("sum")).show(truncate=False)

Output

+----------+------+
|department|sum   |
+----------+------+
|IT        |116000|
|HR        |100000|
|Marketing |78000 |
+----------+------+

You can perform more aggregate functions except for sum functions like max, min, count, avg, mean, etc.

Using col() function with drop() Method

The PySpark drop() method is used to drop the specified columns in the drop() method. It always returns a new PySpark DataFrame after deleting specific columns. For an instance, I am about to drop the department column name from PySpark DataFrame.

df.drop(col('department')).show(truncate=False)

Output

+--------+-----------------------+------+
|name    |designation            |salary|
+--------+-----------------------+------+
|Sharu   |Developer              |33000 |
|John    |Developer              |40000 |
|Jaiyka  |HR Executive           |25000 |
|Vanshika|Senior HR Manager      |50000 |
|Harsh   |Senior Marketing Expert|45000 |
|Harry   |SEO Analyst            |33000 |
|Shital  |HR Executive           |25000 |
|Veronika|Developer              |43000 |
+--------+-----------------------+------+

You can also pass more than one column name separated by commas like drop(col(‘department’), col(‘name’)).

Using col() function with orderBy() Method

It is used to sort the specific column names and return always new PySpark DataFrame.Here, I am about to sort the name column in ascending order. By default orderBy sort the column in ascending order, if you want to sort the specified column in descending order then you have to pass the ascending=False in the orderBy() method.

df.orderBy(col('name')).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Harry   |SEO Analyst            |Marketing |33000 |
|Harsh   |Senior Marketing Expert|Marketing |45000 |
|Jaiyka  |HR Executive           |HR        |25000 |
|John    |Developer              |IT        |40000 |
|Sharu   |Developer              |IT        |33000 |
|Shital  |HR Executive           |HR        |25000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|Veronika|Developer              |IT        |43000 |
+--------+-----------------------+----------+------+

Using col() function with select() Method

The select() method is used to select the specific columns from the data frame and return a data frame. You can pass one or more columns inside the select() method in order to fetch.

I am about to select only the name and salary column of the employees.

df.select(col('name'), col('salary')).show(truncate=False)

Output

+--------+------+
|name    |salary|
+--------+------+
|Sharu   |33000 |
|John    |40000 |
|Jaiyka  |25000 |
|Vanshika|50000 |
|Harsh   |45000 |
|Harry   |33000 |
|Shital  |25000 |
|Veronika|43000 |
+--------+------+

Using col() function with sort() Method

The PySpark sort() method is synonymous with PySpark orderBy() method. It is used to sort the specific column in ascending order or descending order.
To sort the descending order, You have to pass ascending=False in the sort() method, By default it is True.

df.sort(col('salary')).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Shital  |HR Executive           |HR        |25000 |
|Jaiyka  |HR Executive           |HR        |25000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Sharu   |Developer              |IT        |33000 |
|John    |Developer              |IT        |40000 |
|Veronika|Developer              |IT        |43000 |
|Harsh   |Senior Marketing Expert|Marketing |45000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
+--------+-----------------------+----------+------+

Using col() function with filter() method

The PySpark DataFrame filter() method is used to filter the PySpark DataFrame based on the condition that passed into the filter() method, For instance, I am about to select only those employees whose salary is greater than 30000.

df.filter(col('salary') > 30000).show(truncate=False)

Output

+--------+-----------------------+----------+------+
|name    |designation            |department|salary|
+--------+-----------------------+----------+------+
|Sharu   |Developer              |IT        |33000 |
|John    |Developer              |IT        |40000 |
|Vanshika|Senior HR Manager      |HR        |50000 |
|Harsh   |Senior Marketing Expert|Marketing |45000 |
|Harry   |SEO Analyst            |Marketing |33000 |
|Veronika|Developer              |IT        |43000 |
+--------+-----------------------+----------+------+

Using col() function with where() method

The PySpark DataFrame where() method is used to filter the PySpark DataFrame based on the condition that passed into the where() method, For instance, I am about to select only those employees’ names starting with ‘H‘.The where() method and filter() are the same, You can use any.

df.filter(col('name').startswith("H")).show(truncate=False)

Output

+-----+-----------------------+----------+------+
|name |designation            |department|salary|
+-----+-----------------------+----------+------+
|Harsh|Senior Marketing Expert|Marketing |45000 |
|Harry|SEO Analyst            |Marketing |33000 |
+-----+-----------------------+----------+------+

So this is how you can use the PySpark col() method to perform an operation on top of a particular DataFrame column.

Complete Source Code

You can find all the codes here that we have seen throughout this tutorial.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

data = [
    ('Sharu', 'Developer', 'IT', 33000),
    ('John', 'Developer', 'IT', 40000),
    ('Jaiyka', 'HR Executive', 'HR', 25000),
    ('Vanshika', 'Senior HR Manager', 'HR', 50000),
    ('Harsh', 'Senior Marketing Expert', 'Marketing', 45000),
    ('Harry', 'SEO Analyst', 'Marketing', 33000),
    ('Shital', 'HR Executive', 'HR', 25000),
    ('Veronika', 'Developer', 'IT', 43000),
]

columns = ['name', 'designation', 'department', 'salary']

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# groupBy()
df.groupBy(col('department')).agg(sum("salary")).show(truncate=False)

# drop()
df.drop(col('department')).show(truncate=False)


# orderBy
df.orderBy(col('name')).show(truncate=False)

# select
df.select(col('name'), col('salary')).show(truncate=False)

# sort
df.sort(col('salary')).show(truncate=False)

# using col function with filter method
df.filter(col('salary') > 30000).show(truncate=False)

# using col function with where method
df.filter(col('name').startswith("H")).show(truncate=False)

👉How to load CSV Files into PySpark DataFrame

Conclusion

So, in this article, we have successfully covered all about the PySpark col() method with examples. This is one of the most important functions in PySpark because as a PySpark developer, Sometimes we have to perform some operations on top of a particular column, for example, applying a filter on a column, sorting a column, group by column, etc. In that case, col() is used.

If you found this article helpful, please share and keep visiting for further PySpark interesting tutorials.

Thanks for your valuable time…

Have a nice day.

How to read CSV files using PySpark
PySpark Sort Function with Examples

Related Posts