Menu Close

How to convert PySpark Row To Dictionary

Convert PySpark Row to Dictionary

In this article, we are about to see how to convert PySpark Row to Dictionary with the help of examples. The Row is the core class in PySpark that represents the single record or row in the PySpark DataFrame.
To create an instance of the Row class, first of all, we have to import it from pyspark.sql.functions module because it is defined inside the pyspark.sql.functions module.
There are two ways to create an instance of the Row class first by using a name argument and second by creating a custom class of the Row class.

Throughout this article, we will see a complete process to convert PySpark Row to Dictionary with the help of the proper examples.

Now, let’s see the process of creating an object or instance of the PySpark Row class.

Creating an instance of the Row class

There are two ways to create an instance of a Row class: Using a name argument and a custom class of Row class.

Let’s see both ways in order to create an instance of the Row class.

Creating Row object using named argument:

Row object can be created by using a named argument because the Row class accepts keyword variable-length argument. We can pass any number of arguments into the Row class in order to create an instance of the Row class.

This is the code, where I have created the instance called row_obj of the Row class. The keyword name, gender, and age represent the column name in PySpark DataFrame and it is also called a named argument.

from pyspark.sql import Row
row_obj = Row(name="John", gender="Male", age=25)

Creating Row object using the custom class:

We have another option to create an instance of the Row class that is called a custom class. To create an instance of the Row class, first of all, we have to create a Custom class by passing column names in the Row class.

As you can see below.

from pyspark.sql import Row
# creating custom class
Person = Row('name', 'gender', 'age')

# creating object
obj1 = Person('John', 'Male', 30)

You can also use the Python type() function to check the data type of the Row class instance.

print(type(obj1))

Output

<class 'pyspark.sql.types.Row'>

The value of the Row object will be like this.

Row(name='John', gender='Male', age=30)

The objects row_obj and obj1 are called the instance of the PySpark Row class.

Now let’s see the process of converting the Row instance to PySpark Dictionary.

Converting PySpark Row to Dictionary

So far we have seen two ways to create instances of the PySpark Row class. Now we are about to see the process of converting the Row class instance to the dictionary.

PySpark Row class has a method called asDict() and it is used to convert the Row instance to Dict, As you can see below.

from pyspark.sql import Row

# creating custom class
Person = Row('name', 'gender', 'age')

# creating object
obj1 = Person('John', 'Male', 30)

# convert to dictionary
print(obj1.asDict())

After executing the above code the output will be the following.

{'name': 'John', 'gender': 'Male', 'age': 30}

Converting nested Row to Dictionary

Sometimes we have a nested Row object and we want to convert the nested row object to the dictionary, In that scenario, we have to use also use asDict() method along with the recursive=True parameter, By default set to False.

Let’s create a nested Row object in PySpark.

row_obj = Row(name="Harshita", gender="Female", age=25, skills=Row(backend="Python", frontend="Angular", database="MySQL"))

As you can see in the above code, I have created a Row object and inside the Row object, I have created another Row object called skills now to convert this nested object to the dictionary we have to use asDict() method recursive=True parameter. Just like below.

row_obj.asDict(recursive=True)

The complete code will be like this.

from pyspark.sql import Row

row_obj = Row(name="Harshita", gender="Female", age=25, skils=Row(backend="Python", frontend="Angular", database="MySQL"))
print(row_obj.asDict(recursive=True))

After executing the above code, created dictionary will be like this.

{'name': 'Harshita', 'gender': 'Female', 'age': 25, 'skils': {'backend': 'Python', 'frontend': 'Angular', 'database': 'MySQL'}}

Converting PySpark DataFrame to Dictionary

As we know, The PySpark Row represents the record or row in PySpark DataFrame that’s why it is also possible to convert PySpark DataFrame to Dictionary.

For demonstration, I have created a PySpark DataFrame as you can see below.

from pyspark.sql import SparkSession

data = [
    ("Sharu", "Developer", "IT", 33000),
    ("John", "Developer", "IT", 40000),
    ("Jaiyka", "HR Executive", "HR", 25000),
    ("Vanshika", "Senior HR Manager", "HR", 50000),
    ("Harsh", "Senior Marketing Expert", "Marketing", 45000),
    ("Harry", "SEO Analyst", "Marketing", 33000),
    ("Shital", "HR Executive", "HR", 25000),
    ("Veronika", "Developer", "IT", 43000),
]
#
columns = ["name", "designation", "department", "salary"]
#
# # creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()
#
# # creating dataframe
df = spark.createDataFrame(data, columns)

# displaying df
df.show(truncate=False)

DataFrame will be:

+------+------------+----------+------+
|name  |designation |department|salary|
+------+------------+----------+------+
|Sharu |Developer   |IT        |33000 |
|John  |Developer   |IT        |40000 |
|Jaiyka|HR Executive|HR        |25000 |
+------+------------+----------+------+

Now let’s convert the above-created PySpark DataFrame to Dictionary.

To convert PySpark DataFrame to dict, Mainly we have to follow two steps. In the first steps, we have to apply to collect() method on top of DataFrame in order to convert each record of DataFrame into a Row instance.

And in the second step, we have to iterate each item of the PySpark DataFrame and apply asDict() method on iterated item or record of The PySpark DataFrame.

collect():- It returns a list containing Row instances.

lst = []

from pprint import pprint
for record in df.collect():
    lst.append(
        record.asDict()
    )

pprint(lst)

Output

[{'department': 'IT',
  'designation': 'Developer',
  'name': 'Sharu',
  'salary': 33000},
 {'department': 'IT',
  'designation': 'Developer',
  'name': 'John',
  'salary': 40000},
 {'department': 'HR',
  'designation': 'HR Executive',
  'name': 'Jaiyka',
  'salary': 25000}]

The complete code will be like this.

from pyspark.sql import SparkSession

data = [
    ("Sharu", "Developer", "IT", 33000),
    ("John", "Developer", "IT", 40000),
    ("Jaiyka", "HR Executive", "HR", 25000)
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

#creating dataframe
df = spark.createDataFrame(data, columns)


lst = []

from pprint import pprint

for record in df.collect():
    lst.append(
        record.asDict()
    )

pprint(lst)

Note:- pprint is a built-in module that has been used for pretty print.

Convert PySpark Row To Dictionary in PDF File

I have provided a simple PDF file that has included all the PySpark code about how to convert PySpark Row to Dictionary in a PDF file in a very short. You can download and print the PDF Files below for long-term usage.


Related Articles:


Summary

So, in this article, we have seen the full process of converting PySpark Row to Dictionary with the help of different examples. Now, you can easily convert any row instance meaning simple row instances or nested row instances into PySpark Dictionary. Only you have to remember one thing, When you are going to convert nested row instance then you have to pass recursive=True in asDict() method.

I hope the process of explaining this article was pretty good, If you have any queries regarding this article let us know through this email.

Please share and keep visiting for further interesting PySpark Tutorials.

I appreciate your valuable time.

Have a nice day …..

PySpark Column Class with Examples
How to convert PySpark DataFrame to RDD

Related Posts