Menu Close

How to Fill Null Values in PySpark DataFrame

Fill Null values in PySpark DataFrame

Hi PySpark Developers, I how you are doing well. In today’s articles, we are going to learn all about how to fill null values in PySpark DataFrame with zero ( 0 ), constant value, string, empty string, etc.

PySpark provides two methods called fillna() and fill() that are always used to fill missing values in PySpark DataFrame in order to perform any kind of transformation and actions.

Handling missing values in PySpark DataFrame is one of the most common tasks by PySpark Developers, Data Engineers, Data Analysts, etc. Before processing the PySpark DataFrame we have to handle all those missing values or null values or none values in Pyspark DataFrame.

Why do we need to fill null values in PySpark DataFrame?

Being a PySpark Developer, Data Engineer, Data Analyst, etc when you load Dataset from different-difference source systems in order to extract some insights from that data for organizational decisions, Then it might be a high probability that the dataset contained some missing values, So it’s our responsibility to handle all those missing values and then start processing towards the new data. Here missing values means null values or None values. Before processing the Data, we have to fill all the missing values with some other values like zero, empty string, any constant value, etc.

Before using the PySpark fillna() and fill() method, let’s create a sample PySpark DataFrame along with some null values.

Create PySpark DataFrame From CSV

Here, I have prepared sample CSV files along with some null values. Now I am about to load that CSV file using PySpark.

sample_data.csv

First Name,Last Name,Gender,Country,Age,Date,Id
Dulce,Abril,Female,,32,2017-10-15,1562
Mara,,Female,Great Britain,,2016-08-16,1582
Philip,Gent,Male,,36,2015-05-21,2587
Kathleen,Hanner,Female,,25,,1876
Vishvajit,Rao,,India,,2023-04-24,1230
Harry,,Male,India,27,2024-05-23,5664
Pankaj,,Male,,30,2019-09-23,434
Harshita,Kumari,,USA,24,2018-06-20,352

PySpark Code to load the above sample_data.csv file:

from pyspark.sql import SparkSession

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')

# displaying
dataframe.show(truncate=False)

After executing the above code, The DataFrame will be like this.

+----------+---------+------+-------------+----+----------+----+
|First Name|Last Name|Gender|Country      |Age |Date      |Id  |
+----------+---------+------+-------------+----+----------+----+
|Dulce     |Abril    |Female|null         |32  |2017-10-15|1562|
|Mara      |null     |Female|Great Britain|null|2016-08-16|1582|
|Philip    |Gent     |Male  |null         |36  |2015-05-21|2587|
|Kathleen  |Hanner   |Female|null         |25  |null      |1876|
|Vishvajit |Rao      |null  |India        |null|2023-04-24|1230|
|Harry     |null     |Male  |India        |27  |2024-05-23|5664|
|Pankaj    |null     |Male  |null         |30  |2019-09-23|434 |
|Harshita  |Kumari   |null  |USA          |24  |2018-06-20|352 |
+----------+---------+------+-------------+----+----------+----+

As you can see in the above table, some null values exist. In fact, I have highlighted all those null values in red color in the below screenshot.

Let’s explain the above PySpark Code:

  • Firstly, I have imported the SparkSession class from the pyspark.sql module.
  • Second, I have created a spark session with the help of SparkSession.builder.appName(“testing”).getOrCreate().
  • Third, I have used the read attribute and csv() method to load the sample_data.csv file.
  • Fourth, I have displayed the PySpark DataFrame.

Now, It’s time to fill or replace all null values.

What is the PySpark DataFrame fillna() method

PySpark fillna() is a PySpark DataFrame method that was introduced in spark version 1.3.1. PySpark DataFrame fillna() method is used to replace the null values with other specified values. It accepts two parameter values and subsets.

  • value:- It is a value that will come in place of null values. The Data Type of value can be string, dict, bool, int, and float.
  • subset:- It is an optional parameter that represents the list of column names to consider. if it is not provided, then all the null values will be replaced in the PySpark DataFrame.

Let’s use fillna() method to fill null values in PySpark DataFrame, we will take the above created DataFrame to replace null values.

Example: Replace Null values with an empty string

For example, I want to replace all the null values with an empty string.

from pyspark.sql import SparkSession

# creating spark session   
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')

# replace all the null values with empty string
dataframe = dataframe.fillna('')

# displaying
dataframe.show(truncate=False)

After replacing null values with empty strings the output will be.

+----------+---------+------+-------------+---+----------+----+
|First Name|Last Name|Gender|Country      |Age|Date      |Id  |
+----------+---------+------+-------------+---+----------+----+
|Dulce     |Abril    |Female|             |32 |2017-10-15|1562|
|Mara      |         |Female|Great Britain|   |2016-08-16|1582|
|Philip    |Gent     |Male  |             |36 |2015-05-21|2587|
|Kathleen  |Hanner   |Female|             |25 |          |1876|
|Vishvajit |Rao      |      |India        |   |2023-04-24|1230|
|Harry     |         |Male  |India        |27 |2024-05-23|5664|
|Pankaj    |         |Male  |             |30 |2019-09-23|434 |
|Harshita  |Kumari   |      |USA          |24 |2018-06-20|352 |
+----------+---------+------+-------------+---+----------+----+

Example: fill null values with 0 in PySpark DataFrame:

We can also replace null values with zero, Pass zero into fillna() method.

from pyspark.sql import SparkSession

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')


dataframe.printSchema()
# replace all the null values with empty string
dataframe = dataframe.fillna('0')

# displaying
dataframe.show(truncate=False)

Output

+----------+---------+------+-------------+---+----------+----+
|First Name|Last Name|Gender|Country      |Age|Date      |Id  |
+----------+---------+------+-------------+---+----------+----+
|Dulce     |Abril    |Female|0            |32 |2017-10-15|1562|
|Mara      |0        |Female|Great Britain|0  |2016-08-16|1582|
|Philip    |Gent     |Male  |0            |36 |2015-05-21|2587|
|Kathleen  |Hanner   |Female|0            |25 |0         |1876|
|Vishvajit |Rao      |0     |India        |0  |2023-04-24|1230|
|Harry     |0        |Male  |India        |27 |2024-05-23|5664|
|Pankaj    |0        |Male  |0            |30 |2019-09-23|434 |
|Harshita  |Kumari   |0     |USA          |24 |2018-06-20|352 |
+----------+---------+------+-------------+---+----------+----+

So far we have seen how to fill all null values in PySpark DataFrame with an empty string and zero, In fact, you can pass anything in place of zero. But we can also replace null values in a particular column of PySpark DataFrame.

To fill particular columns’ null values in PySpark DataFrame, We have to pass all the column names and their values as Python Dictionary to value parameter to the fillna() method.

In The main data frame, I am about to fill 0 to the age column and 2023-04-10 to the Date column and the rest will be null itself.

from pyspark.sql import SparkSession

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')


dataframe.printSchema()
# replace all the null values with empty string
dataframe = dataframe.fillna({"age": 0, "Date": '2023-04-10'})

# displaying
dataframe.show(truncate=False)

Output

+----------+---------+------+-------------+---+----------+----+
|First Name|Last Name|Gender|Country      |Age|Date      |Id  |
+----------+---------+------+-------------+---+----------+----+
|Dulce     |Abril    |Female|null         |32 |2017-10-15|1562|
|Mara      |null     |Female|Great Britain|0  |2016-08-16|1582|
|Philip    |Gent     |Male  |null         |36 |2015-05-21|2587|
|Kathleen  |Hanner   |Female|null         |25 |2023-04-10|1876|
|Vishvajit |Rao      |null  |India        |0  |2023-04-24|1230|
|Harry     |null     |Male  |India        |27 |2024-05-23|5664|
|Pankaj    |null     |Male  |null         |30 |2019-09-23|434 |
|Harshita  |Kumari   |null  |USA          |24 |2018-06-20|352 |
+----------+---------+------+-------------+---+----------+----+

Even you can pass column names explicitly to the subset parameter in order to replace null values.

from pyspark.sql import SparkSession

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')


dataframe.printSchema()
# replace all the null values with empty string
dataframe = dataframe.fillna('0', subset=['Last Name', 'Gender'])

# displaying
dataframe.show(truncate=False)

Output

+----------+---------+------+-------------+----+----------+----+
|First Name|Last Name|Gender|Country      |Age |Date      |Id  |
+----------+---------+------+-------------+----+----------+----+
|Dulce     |Abril    |Female|null         |32  |2017-10-15|1562|
|Mara      |0        |Female|Great Britain|null|2016-08-16|1582|
|Philip    |Gent     |Male  |null         |36  |2015-05-21|2587|
|Kathleen  |Hanner   |Female|null         |25  |null      |1876|
|Vishvajit |Rao      |0     |India        |null|2023-04-24|1230|
|Harry     |0        |Male  |India        |27  |2024-05-23|5664|
|Pankaj    |0        |Male  |null         |30  |2019-09-23|434 |
|Harshita  |Kumari   |0     |USA          |24  |2018-06-20|352 |
+----------+---------+------+-------------+----+----------+----+

PySpark DataFrame fill() Method

The fill() is a method that is used to replace null values in PySpark DataFrame.PySpark DataFrame fill() and fillna() methods are aliases of each other. The parameter of the fill() method will be the same as fillna() method.

Example: Fill null values in PySpark DataFrame using fill() method

from pyspark.sql import SparkSession

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')

# replace all the null values with empty string
dataframe = dataframe.na.fill('0')

# displaying
dataframe.show(truncate=False)

Output

+----------+---------+------+-------------+---+----------+----+
|First Name|Last Name|Gender|Country      |Age|Date      |Id  |
+----------+---------+------+-------------+---+----------+----+
|Dulce     |Abril    |Female|0            |32 |2017-10-15|1562|
|Mara      |0        |Female|Great Britain|0  |2016-08-16|1582|
|Philip    |Gent     |Male  |0            |36 |2015-05-21|2587|
|Kathleen  |Hanner   |Female|0            |25 |0         |1876|
|Vishvajit |Rao      |0     |India        |0  |2023-04-24|1230|
|Harry     |0        |Male  |India        |27 |2024-05-23|5664|
|Pankaj    |0        |Male  |0            |30 |2019-09-23|434 |
|Harshita  |Kumari   |0     |USA          |24 |2018-06-20|352 |
+----------+---------+------+-------------+---+----------+----+

Example: Fill null values with string in PySpark DataFrame using fill() method

from pyspark.sql import SparkSession

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating PySpark DataFrame
dataframe = spark.read.option('header', 'true').csv('sample_data.csv')

# replace all the null values with empty string
dataframe = dataframe.na.fill('N/A')

# displaying
dataframe.show(truncate=False)

Output

+----------+---------+------+-------------+---+----------+----+
|First Name|Last Name|Gender|Country      |Age|Date      |Id  |
+----------+---------+------+-------------+---+----------+----+
|Dulce     |Abril    |Female|N/A          |32 |2017-10-15|1562|
|Mara      |N/A      |Female|Great Britain|N/A|2016-08-16|1582|
|Philip    |Gent     |Male  |N/A          |36 |2015-05-21|2587|
|Kathleen  |Hanner   |Female|N/A          |25 |N/A       |1876|
|Vishvajit |Rao      |N/A   |India        |N/A|2023-04-24|1230|
|Harry     |N/A      |Male  |India        |27 |2024-05-23|5664|
|Pankaj    |N/A      |Male  |N/A          |30 |2019-09-23|434 |
|Harshita  |Kumari   |N/A   |USA          |24 |2018-06-20|352 |
+----------+---------+------+-------------+---+----------+----+

PySpark Other Tutorials


Conclusion

As you have seen, how we have filled the null values in PySpark DataFame along with examples.PySpark DataFrame fillna() and fill() methods are the most useful method to fill the null values with string, float, any literal, etc.

If you found this article helpful, Please share and keep visiting for further PySpark interesting tutorials.

How to Drop Duplicate Rows from PySpark DataFrame
Merge Two DataFrames in PySpark with Different Column Names

Related Posts