Menu Close

How to Count Null and NaN Values in Each Column in PySpark DataFrame?

How to count none nan in each column of pyspark dataframe

Hi PySpark Developers, In this article, you will learn everything about how to count null and nan values in each column in PySpark DataFrame with the help of the examples. We will also see how we can count Null, None, and Nan values in one column in PySpark DataFrame.

When we work on any real-life datasets then it can be a high possibility, Dataset might have lots of Null, Nan, and None values so being a Data Engineer, Developer it’s our responsibility to handle all null values inside the Datasets.

To load the large dataset we will use PySpark because it is one of the most powerful in-memory computational frameworks for large-scale datasets, Throughout this article, we will see count None, Nan, and Null values in each column and single column also.

Let’s first understand the meaning of the None, Nan, and Null values in the PySpark DataFrame.

Note:- In Python, the None keyword is used to define the null values or variables. None means nothing, you can say. None keyword is equal to the Null value.

What are None, Null, and Nan values in PySpark DataFrame?

In PySpark DataFrame, None, Null and NaN values represent the missing values. So it’s mandatory to handle all None, Null, and NaN values in any real-life PySpark application so that we can perform operations on top of PySpark DataFrame in a more efficient way. In PySpark DataFrame, None is equivalent to Null value.

To count Null, None, and Nan values, we must have a PySpark DataFrame along with these values. So let’s create a PySpark DataFrame with None values.

PySpark provides some built-in functions to check None values and count None values. All those functions are defined inside the pyspark.sql.functions module. so it is mandatory to import all those functions from pyspark.sql.functions module.

Functions are given below.

  • count(col):- it is an aggregate function that is used to return the total count of the passed column names.
  • when(condition, value):- The when() function is also a built-in function that s used to return evaluate the list of conditions and return one or multiple possible results expression.
  • col(col_name):- The col() is a built-in function that takes the column name as a parameter and returns the instance of the PySpark Colum class.
  • isNull():- It is a column class method that will be applied only on the PySpark DataFrame column.
  • alias():- It is also a method of PySpark Column class and it is used to provide the alias name of the column applied on.

Create PySpark DataFrame with None Values:

To count the None values in PySpark DataFrame, I have created a sample DataFrame along with some None values in each column. Now we will see the process of counting one value in a single column as well as multiple columns.

from pyspark.sql import SparkSession

data = [
    ("Vikas", None, "25", 'IT', None),
    (None, "Kumar", "26", 'HR', "23000"),
    ('Hanshika', "Kumari", None, 'IT', None),
    ('Pankaj', "Singh", None, 'IT', "25000"),
    (None, None, '24', 'IT', "35000"),
    ('Vinay', 'Kumari','25', None, None),
    ('Vishal', None,  None, 'HR', "27000"),

]


columns = ['first_name', 'last_name', 'age', 'department', 'salary']

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
employee_dataframe = spark.createDataFrame(data, columns)

# display created DataFrame
employee_dataframe.show()

After executing the above PySpark code, The DataFrame will look like this.

+----------+---------+----+----------+------+
|first_name|last_name| age|department|salary|
+----------+---------+----+----------+------+
|     Vikas|     null|  25|        IT|  null|
|      null|    Kumar|  26|        HR| 23000|
|  Hanshika|   Kumari|null|        IT|  null|
|    Pankaj|    Singh|null|        IT| 25000|
|      null|     null|  24|        IT| 35000|
|     Vinay|   Kumari|  25|      null|  null|
|    Vishal|     null|null|        HR| 27000|
+----------+---------+----+----------+------+

As you can see in the above PySpark DataFrame, the None values converted into null values. Now we will try to count the total null values in single or multiple columns.

PySpark DataFrame Count Null Values in a Column

Here we are about to count the total number of null values in a single column with the help of the above-mentioned functions. We will count total null values in the first_name column.

Example:- PySpark DataFrame Count Null Value in a single column

from pyspark.sql.functions import count, when, col
employee_dataframe.select(
    count(when(col("first_name").isNull(), "first_name")).alias("total null values")
).show()

Note:- Remember, employee_dataframe represents the above DataFrame.

+-----------------+
|total null values|
+-----------------+
|                2|
+-----------------+

As you can see in the above data frame, total null values have been counted. You can verify the result by the above-created main data frame.

PySpark DataFrame Count Null Values in Multiple Columns

We can also count the null values in multiple columns. Here I am going to count total null values in the first_name, last_name, and age columns.

Example: Count Null values in multiple PySpark DataFrame Columns

from pyspark.sql.functions import count, when, col
employee_dataframe.select(
    [
        count(when(col("first_name").isNull(), "first_name")).alias(
            "first_name"
        ),
        count(when(col("last_name").isNull(), "last_name")).alias(
            "last_name"
        ),
        count(when(col("age").isNull(), "age")).alias(
            "age"
        ),
    ]
).show()

Output

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|         2|        3|  3|
+----------+---------+---+

PySpark DataFrame Count Null Values in Each Column

So far we have seen count null values in single or multiple columns, But sometimes we want to count null values in each column in PySpark DataFrame.We use column attributes of PySpark DatFrame in order to return a total number of columns in DataFrame.

Example: Count Null Values in Each PySpark DataFrame

from pyspark.sql.functions import count, when, col
employee_dataframe.select(
    [
        count(when(col(column_name).isNull(), column_name)).alias(column_name)
        for column_name in employee_dataframe.columns
    ]
).show()

Output

+----------+---------+---+----------+------+
|first_name|last_name|age|department|salary|
+----------+---------+---+----------+------+
|         2|        3|  3|         1|     3|
+----------+---------+---+----------+------+

You can verify all the results set from the above-created main PySpark DataFrame.

Now we are to count NaN values in PySpark DataFrame. Let’s see what exactly are NaN values in Python.

What is the NaN value in Python?

The NaN stands for Not a Number which represents the missing value in the data. This creates one of the major problems in real-life Data Analysis projects that we should always handle it. Is it special a floating point type of value and it cannot be converted into another type rather than a float value?

Create PySpark DataFrame with NaN values

To create PySpark DataFrame with NaN values, I will use the Python NumPy library which is one of the most used and popular libraries in Python for purpose of performing a variety of mathematical operations.

from pyspark.sql import SparkSession
import numpy as np

data = [
    ("Pankaj", "Kumar", "Male", "BTech", 1550.50, 101, "2022-12-20"),
    ("Harshita", "Kumari", "Female", "BCom", np.nan, 102, "2022-12-20"),
    ("Hari", "Sharma", "Male", "BCA", 1400.00, 103, "2018-03-12"),
    ("Anshika", "Kumari", "Female", "MCA", 1550.30, 104, "2029-05-19"),
    ("Shantanu", "Saini", "Male", "BSc", np.nan, 105, "2019-08-20"),
    ("Avantika", "Srivastava", "Female", "BCom", 1350.40, 106, "2020-10-21"),
    ("Jay", "Kumar", "Male", "BTech", 1450.30, 107, "2019-08-29"),

]

columns = [
    "first_name",
    "last_name",
    "gender",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
students_dataframe = spark.createDataFrame(data, columns)

students_dataframe.show()

After executing the above code, The PySpark DataFrame will be like this.

+----------+----------+------+------+------+-----------+--------------+
|first_name| last_name|gender|course| marks|roll_number|admission_date|
+----------+----------+------+------+------+-----------+--------------+
|    Pankaj|     Kumar|  Male| BTech|1550.5|        101|    2022-12-20|
|  Harshita|    Kumari|Female|  BCom|   NaN|        102|    2022-12-20|
|      Hari|    Sharma|  Male|   BCA|1400.0|        103|    2018-03-12|
|   Anshika|    Kumari|Female|   MCA|1550.3|        104|    2029-05-19|
|  Shantanu|     Saini|  Male|   BSc|   NaN|        105|    2019-08-20|
|  Avantika|Srivastava|Female|  BCom|1350.4|        106|    2020-10-21|
|       Jay|     Kumar|  Male| BTech|1450.3|        107|    2019-08-29|
+----------+----------+------+------+------+-----------+--------------+

As you can see in the column marks some nan values exist, Now we will try to count all those values.

Example: Count NaN values in PySpark DataFrame Column

PySpark Provides a built-in function called isnan() that takes the column name as a parameter and returns True if the passed column contains nan values otherwise it will return False.

from pyspark.sql.functions import count, when, col, isnan

students_dataframe.select(
    [
        count(when(isnan(col('marks')), 'marks')).alias("total nan values")
    ]
).show()

Output

+----------------+
|total nan values|
+----------------+
|               2|
+----------------+

So this is how you can count nan values in PySpark DataFrame.


Other Useful PySpark Articles:


Conclusion

I hope the process of explaining this article was pretty easy and straightforward. If you are Data contained None values of nan values then you can follow any approaches which we have seen throughout this article we want to perform some operations on top of PySpark DataFrame on the basis of None and nan values count.

To count None and NaN values we have used multiple functions and methods like count(), when(), isnan(), isNull(), alias(), and col(), when count() total number of valid varies inside the column when() functions are used to check multiple condition and return result when the condition evaluates True, isnan() check passed column is nan or not.

The isNull() is a column method that returns True if the column contains None values otherwise it will return False, col() functions take the column name as a parameter return instance of PySpark column class, and the alias() method is also a column method that is used to provide the alias name of the column.

If you found this article helpful, please share and keep visiting for further interesting tutorials.

Have a nice day.
Happy coding.


FAQs ( Frequently Asked Questions )

What is the None Value in PySpark DataFrame Column?

Ans:- None value in the PySpark DataFrame column represents the missing value.PySpark column class has a method called isNull to check whether a column is None or not. You have remembered one thing, the Python None keyword is equal to null in PySpark DataFrame.

How do you fill None with a value in PySpark?

Ans:- PySpark provided some DataFrame methods like fillna() and fill() in order to fill or replace the none or null value in PySpark DataFrame.

PySpark isnan vs isnull

Ans:- Both are used to handle missing values in PySpark DataFrame. NAN stands for Not a Number and it is always used to check for NAN value that does not make sense isnull is also used for missing values. Null represents the nothing or no value event, not an empty string. The isnull() method is used to check whether a column is null or not, if the column is null then it will return True otherwise it will return False.

Merge Two DataFrames in PySpark with Same Column Names
PySpark DataFrame Tutorial for Beginners

Related Posts