Menu Close

How to Convert PySpark DataFrame Column to List

PySpark DataFrame Column to List

In this article, you will learn everything about how to convert the Convert PySpark DataFrame Column to List with the help of the examples. In real-life projects, Sometimes our requirement is to convert the PySpark DataFrame column to a list. PySpark provides multiple ways to convert PySpark DataFrame to a Python list.

Before going to this article, Let’s see a little bit introduction of PySpark DataFrame and Python list.

What is a Python list?

List in Python is a fundamental data structure that is used to store the collection of elements of different-different data types. It is an immutable or changeable data type in Python. The meaning of saying mutable or changeable, we can change the value of the Python list, Once it has been created.

For understanding, I have defined some Python lists that have stored values of different-different data types.

lst1 = ['Python', 'Java', 'C', 'C++']
lst2 = [1, 2, 3, 4, 5]
lst3 = ['Python', 1, 'C', 2, 'C++']

As you can see in the above code, I have defined three Python lists. The lst1 stored the only string values, lst2 stored the string and integer items and lst3 stored the only integer values. In fact, we can also store float values.

What is PySpark DataFrame?

PySpark DataFrame is also a fundamental or core data structure in PySpark that is used to store values in tabular format ( Row column format) or spreadsheet in a distributed environment.

DataFrame in PySpark is the same as the table in RDBMS (Relational Database Management System). PySpark allows us to perform all the SQL operations on the top of PySpark DataFrame.

For understanding, You can find a sample PySpark DataFrame below.

+--------+-----------------+----------+------+
|name    |designation      |department|salary|
+--------+-----------------+----------+------+
|Sharu   |Developer        |IT        |33000 |
|John    |Developer        |IT        |40000 |
|Jaiyka  |HR Executive     |HR        |25000 |
|Shantanu |Manual Tester   |IT       |25000 |
|Avantika|Senior HR Manager|HR        |45000 |
|Vaishali|Junior Accountant|Account   |23000 |
|Vinay   |Senior Accountant|Account   |40000 |
+--------+-----------------+----------+------+

In the above PySpark DataFrame, The name, designation, department, and salary represent the columns of PySpark DataFrame.

Throughout this article, we will see how to convert all the columns of PySpark DataFrame into PySpark List.

Now, Let’s process creating DataFrame in PySpark.

Creating PySpark DataFrame

PySpark provides multiple ways to create DataFrame, But here we are about to create PySpark DataFrame using a list of tuples and each tuple would contain information about employees like name, designation, department, and salary.

Code to be generated PySpark DataFrame:

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# displaying dataframe
df.show(truncate=False)

Explanation of the above code:

As you can see, The above code has some set of instructions in order to create PySpark DataFrame. Let’s understand all those statements so that you can understand all the code easily.

  • First, I imported the SparkSession from the pyspark.sql module.
  • A defined list of tuples and each tuple is having some information about the employees like employee name, designation, department, and salary.
  • Defined a list having column names “name”, “designation”, “department”, and “salary”.
  • Created a spark session using SparkSession, builder, appName(), and getOrCreate() method.
  • Create PySpark DataFrame using createDataFrame(). The createDataFrame() method is a spark session method.
  • Displayed the PySpark DataFrame, using the show() method.

After executing the above code, The PySpark DataFrame will be like this.

Output

+----------+-----------------+----------+------+
|first_name|      designation|department|salary|
+----------+-----------------+----------+------+
|    Pankaj|        Developer|        IT| 33000|
|      Hari|        Developer|        IT| 40000|
|   Anshika|     HR Executive|        HR| 25000|
|  Shantanu|    Manual Tester|        IT| 25000|
|  Avantika|Senior HR Manager|        HR| 45000|
|       Jay|Junior Accountant|   Account| 23000|
|     Vinay|Senior Accountant|   Account| 40000|
+----------+-----------------+----------+------+

Now, let’s see how can we convert the PySpark DataFrame column to a Python list.

Convert PySpark DataFrame Column to List

There are multiple ways to convert PySpark DataFrame to a Python list. Here we are about to follow various ways in order to convert the PySpark DataFrame column to a Python list.

Using flatMap()

The flatMap() is a method of PySpark RDD that returns an RDD after applying a function on each element of RDD and then flatting the results.

As we know that flatMap() applies a function on each element of RDD, so we have to select one column from the existing PySpark DataFrame using PySpark select() method.

Syntax:

dataframe.select(column).rdd.flatMap(lambda x: x).collect()

In the above syntax:

  • dataframe represents the PySpark DataFrame.
  • select() is a method that returns another data frame after selecting the column passed into it.
  • column indicates the column of PySpark DataFrame which you want to convert into a list.
  • RDD is the attribute of the PySpark DataFrame which is used to convert DataFrame into RDD.
  • flatMap() is an RDD method that applies a function on each element of the RDD.
  • collect() method is used to return a list that contains all of the elements of RDD.

Example: Convert Single PySpark DataFrame Columns to List

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# selecting one column
df_one = df.select("name")

# converting into PySpark RDD
rdd = df_one.rdd

# convert into list
lst = rdd.flatMap(lambda x: x).collect()

# display the lst
print(lst)

Output

['Pankaj', 'Hari', 'Anshika', 'Shantnu', 'Avantika', 'Jay', 'Vinay']

Example: Convert Multiple PySpark DataFrame Columns to List

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# selecting one column
df_one = df.select(["name", "designation"])

# converting DataFrame into PySpark RDD
rdd = df_one.rdd

# convert into list
lst = rdd.flatMap(lambda x: x).collect()

# display the lst
print(lst)


Output

['Pankaj', 'Developer', 'Hari', 'Developer', 'Anshika', 'HR Executive', 'Shantnu', 'Manual Tester', 'Avantika', 'Senior HR Manager', 'Jay', 'Junior Accountant', 'Vinay', 'Senior Accountant']

Using map() Method

The map() method is used to apply a function on each element of the RDD and returns a new RDD.

Syntax:

dataframe.select(column_name).rdd.map(lambda x: x).collect()

In the above syntax:

  • dataframe represents the PySpark DataFrame.
  • select() is a method that returns another data frame after selecting the column passed into it.
  • column indicates the column of PySpark DataFrame which you want to convert into a list.
  • RDD is the attribute of the PySpark DataFrame which is used to convert DataFrame into RDD.
  • flatMap() is an RDD method that applies a function on each element of the RDD.
  • collect() method is used to return a list that contains all of the elements of RDD.
from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# selecting one column
df_one = df.select(["designation"])

# converting DataFrame into PySpark RDD
rdd = df_one.rdd

# convert into list
lst = rdd.map(lambda x: x[0]).collect()

# display the lst
print(lst)


Output

['Developer', 'Developer', 'HR Executive', 'Manual Tester', 'Senior HR Manager', 'Junior Accountant', 'Senior Accountant']

Example: Convert Multiple PySpark Columns to Python List

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# selecting one column
df_one = df.select(["name", "designation"])

# converting DataFrame into PySpark RDD
rdd = df_one.rdd

# convert into list
lst = rdd.map(lambda x: (x[0], x[1])).collect()

# display the lst
print(lst)


Convert PySpark DataFrame Column to List Using Collect()

The collect() method is another way to convert PySpark DataFrame Column to List. The collect() method returns the list of Row objects and each Row object represents the record of the PySpark DataFrame.

To use the collect() method, we are about to use list comprehension and store all the values into a list.

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# selecting one column
df_one = df.select("designation")

# converting DataFrame into PySpark RDD
rdd = df_one.rdd

# convert into list
lst = [data[0] for data in rdd.collect()]

# display the lst
print(lst)


Output

['IT', 'IT', 'HR', 'IT', 'HR', 'Account', 'Account']

Convert PySpark DataFrame Column to List using Pandas

Pandas is another library to process large-scale data. To convert the DataFrame column to a list, first, we have to convert the required column to Pandas DataFrame and then convert it into a Python list using the list() function. To convert PySpark DataFrame into Pandas DataFrame, PySpark DataFrame has a method called toPandas(). The toPandas() method converts the PySpark DataFrame into Pandas DataFrame.

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Developer", "IT", 33000),
    ("Hari", "Developer", "IT", 40000),
    ("Anshika", "HR Executive", "HR", 25000),
    ("Shantanu", "Manual Tester", "IT", 25000),
    ("Avantika", "Senior HR Manager", "HR", 45000),
    ("Jay", "Junior Accountant", "Account", 23000),
    ("Vinay", "Senior Accountant", "Account", 40000),
]

columns = ["name", "designation", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
df = spark.createDataFrame(data, columns)

# selecting one column
pandas_df = df.select("department").toPandas()['department']

# convert into list
print(list(pandas_df))


Output

['IT', 'IT', 'HR', 'IT', 'HR', 'Account', 'Account']

Related Articles


Conclusion

So, In this article, we have seen all about How to convert PySpark DataFrame to List with the help of the examples. You can use any one of them in order to convert PySpark DataFrame to List, even you can convert multiple DataFrame columns to List.

I hope the process of converting PySpark DataFrame to List was pretty easy and straightforward.

If you like this article, please share and keep visiting for further Tutorials.

Have a nice day…

How to convert PySpark DataFrame to RDD
How to Write PySpark DataFrame to CSV

Related Posts