Menu Close

Drop One or Multiple columns from PySpark DataFrame

Drop One or Multiple columns from PySpark DataFrame

In this article, You will learn everything about the drop one or multiple columns from PySpark DataFrame with the help of the examples. In real-life projects sometimes we want to delete columns in PySpark DataFrame.

PySpark DataFrame has a method called a drop() that is used to delete columns from the PySpark DataFrame. After reading this article, you will not have any confusion regarding how to drop columns in PySpark DataFrame because throughout this article we are about to delete single as well multiple columns from the PySpark DataFrame.

To apply the drop() method, first of all, we must have a PySpark DataFrame. Let me create a simple PySpark DataFrame just for demonstration for this article. you can skip this part if you have already a PySpark DataFrame.

Create PySpark DataFrame

To create PySpark DataFrame, I have prepared a list of tuples and each tuple inside the list contains some information about the students like their first_name, last_name, course, marks, roll_number, and admission_date.

Code to create PySpark DataFrame:

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"),
    ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"),
    ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"),
    ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"),
    ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"),
    ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"),
    ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"),
]

columns = [
    "first_name",
    "last_name",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)

# displaying dataframe
student_dataframe.show()

After the successful execution of the above code, The Created DataFrame will look like this.

+----------+----------+------+-------+-----------+--------------+
|first_name| last_name|course|  marks|roll_number|admission_date|
+----------+----------+------+-------+-----------+--------------+
|    Pankaj|     Kumar| BTech|1550.50|        101|    2022-12-20|
|      Hari|    Sharma|   BCA|1400.00|        102|    2018-03-12|
|   Anshika|    Kumari|   MCA|1450.00|        103|    2029-05-19|
|  Shantanu|     Saini|   BSc|1350.50|        104|    2019-08-20|
|  Avantika|Srivastava|  BCom|1350.00|        105|    2020-10-21|
|       Jay|     Kumar| BTech|1540.00|        106|    2019-08-29|
|     Vinay|     Singh|   BCA|1480.50|        107|    2017-09-17|
+----------+----------+------+-------+-----------+--------------+

Let me explain the above code so that you can get more clarity with that code.

  1. Firstly, I have imported the SparkSession from the pyspark.sql module.
  2. Prepared the list of Python tuples and each tuple contains information about the Student like first_name, last_name, course, marks, roll_number, and admission_date.
  3. I have created the Python list that contained column names for PySpark DataFrame.
  4. Created a spark session using SparkSession.builder.appName(“testing”).getOrCreate() because spark session is the entry point of any spark application.
  5. And I have used the createDataFrame() method of the spark session and passed a list of tuples and columns inside it in order to create PySpark DataFrame.
  6. And finally displayed the created PySpark DataFrame.

Now it’s time to explore the PySpark DataFrame drop function in order to drop one or multiple columns from PySpark DataFrame.

PySpark DataFrame drop() Method

The drop() method is a PySpark DataFrame method that is responsible for drop columns in PySpark DataFrame.It takes the column name as a parameter and drops them. You can pass a single column or multiple columns in order to drop them. It’s up to you.

Throughout this article, we are about to drop single or multiple columns from PySpark DataFrame.

Drop one or Multiple columns from PySpark DataFrame

We will see multiple ways to drop one or more columns from PySpark DataFrame with the help of the examples.

PySpark DataFrame drop Single Column

To drop a single column, we have to pass the column inside the drop() method. After dropping the passed column, The drop method will return a new DataFrame. In this example, I am going to drop the admission_date column.

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"),
    ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"),
    ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"),
    ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"),
    ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"),
    ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"),
    ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"),
]

columns = [
    "first_name",
    "last_name",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)

# dropping admission_date column
student_dataframe2 = student_dataframe.drop('admission_date')

# displaying dataframe
student_dataframe2.show()

After dropping the admission_date column, Your data frame will be like this.

+----------+----------+------+-------+-----------+
|first_name| last_name|course|  marks|roll_number|
+----------+----------+------+-------+-----------+
|    Pankaj|     Kumar| BTech|1550.50|        101|
|      Hari|    Sharma|   BCA|1400.00|        102|
|   Anshika|    Kumari|   MCA|1450.00|        103|
|  Shantanu|     Saini|   BSc|1350.50|        104|
|  Avantika|Srivastava|  BCom|1350.00|        105|
|       Jay|     Kumar| BTech|1540.00|        106|
|     Vinay|     Singh|   BCA|1480.50|        107|
+----------+----------+------+-------+-----------+

PySpark DataFrame Drop Multiple Columns:

To drop multiple columns, we have to pass multiple columns inside the drop() method separated by a comma. The drop() method will drop all the passed columns from the existing data frame and return mew one.

I am about to drop a course, marks, and admission_date column from PySpark DataFrame.

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"),
    ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"),
    ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"),
    ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"),
    ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"),
    ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"),
    ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"),
]

columns = [
    "first_name",
    "last_name",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)

# dropping multiple columns
student_dataframe2 = student_dataframe.drop('course', 'marks', 'admission_date')

# displaying dataframe
student_dataframe2.show()

After dropping the above columns the new PySpark DataFrame will be like this:

+----------+----------+-----------+
|first_name| last_name|roll_number|
+----------+----------+-----------+
|    Pankaj|     Kumar|        101|
|      Hari|    Sharma|        102|
|   Anshika|    Kumari|        103|
|  Shantanu|     Saini|        104|
|  Avantika|Srivastava|        105|
|       Jay|     Kumar|        106|
|     Vinay|     Singh|        107|
+----------+----------+-----------+

PySpark DataFrame Drop All Columns:

To drop all column names from PySpark DataFrame, you have to pass all column names inside the drop() method. The drop() will return a blank PySpark DataFrame after dropping all the column names.

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"),
    ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"),
    ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"),
    ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"),
    ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"),
    ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"),
    ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"),
]

columns = [
    "first_name",
    "last_name",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)

# dropping all columns
student_dataframe2 = student_dataframe.drop(*columns)

# displaying dataframe
student_dataframe2.show()

New PySpark Blank DataFrame something looks like this.

++
||
++
||
||
||
||
||
||
||
++

Using drop() method with col() method to drop columns

We can pass the column name inside the col() method because it represents the column of PySpark DataFrame. For example, I am about to drop last_name from PySpark DataFrame. To drop last_name first I will pass last_name into the col() method and then pass the col() method to the drop() method.To use the col() method you have to import it from pyspark.sql.functions module.

let’s see.

Example: Using drop() and col() functions to drop single column

from pyspark.sql import SparkSession
from pyspark.sql.functions import col


data = [
    ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"),
    ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"),
    ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"),
    ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"),
    ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"),
    ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"),
    ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"),
]

columns = [
    "first_name",
    "last_name",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)

# using col() method
student_dataframe2 = student_dataframe.drop(col("last_name"))
student_dataframe2.show()

Output

+----------+------+-------+-----------+--------------+
|first_name|course|  marks|roll_number|admission_date|
+----------+------+-------+-----------+--------------+
|    Pankaj| BTech|1550.50|        101|    2022-12-20|
|      Hari|   BCA|1400.00|        102|    2018-03-12|
|   Anshika|   MCA|1450.00|        103|    2029-05-19|
|  Shantanu|   BSc|1350.50|        104|    2019-08-20|
|  Avantika|  BCom|1350.00|        105|    2020-10-21|
|       Jay| BTech|1540.00|        106|    2019-08-29|
|     Vinay|   BCA|1480.50|        107|    2017-09-17|
+----------+------+-------+-----------+--------------+

Related PySpark Articles:

👉PySpark DataFrame drop() method reference:- Click Here

Conclusion

I hope this tutorial was helpful and easy to understand. Throughout this article, we have seen how to drop one or multiple columns from PySpark DataFrame with the help of the proper example. You can use any one of them as per your project requirement. This is most useful, especially when you are working on any real-life project and your requirement is to drop columns in PySpark DataFrame.

If you found this article helpful, Please share and keep visiting for further PySpark tutorials.

Have a great day…..

How to Convert PySpark DataFrame to JSON ( 3 Ways )
How to Change DataType of Column in PySpark DataFrame

Related Posts