Menu Close

How to Change DataType of Column in PySpark DataFrame

Change DataType of Column in PySpark DataFrame

Hi, In this article you will learn everything about how to change datatype of column in PySpark DataFrame. In real-life PySpark projects sometimes we want to Modify the Column type of PySpark DataFrame to perform operations. Throughout this article, we are about to explore multiple ways in order to change the DataType of the column in PySpark DataFrame.

To modify the column type of PySpark DataFrame, First of all, you must have a PySpark DataFrame. Let’s write a PySpark Code to create a DataFrame, If you have already a PySpark DataFrame, Then you can skip this part.

Creating PySpark DataFrame

I have created a simple PySpark DataFrame in order to Change DataType of Column in PySpark DataFrame. The DataFranme is having some information about the students like first name, last name, course, marks, and roll number.

from pyspark.sql import SparkSession

data = [
    ("Pankaj", "Kumar", "BTech", "1550.50", "101"),
    ("Hari", "Sharma", "BCA", "1400.00", "102"),
    ("Anshika", "Kumari", "MCA", "1450.00", "103"),
    ("Shantanu", "Saini", "BSc", "1350.50", "104"),
    ("Avantika", "Srivastava", "BCom", "1350.00", "105"),
    ("Jay", "Kumar", "BTech", "1540.00", "106"),
    ("Vinay", "Singh", "BCA", "1480.50", "107"),
]

columns = ["first_name", "last_name", "course", "marks", "roll_number"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)

# displaying PySpark DataFrame
student_dataframe.show(truncate=False)

After executing the above code, The Newly created PySpark DataFrame will look like this.

+----------+----------+------+-------+-----------+--------------+
|first_name|last_name |course|marks  |roll_number|admission_date|
+----------+----------+------+-------+-----------+--------------+
|Pankaj    |Kumar     |BTech |1550.50|101        |2022-12-20    |
|Hari      |Sharma    |BCA   |1400.00|102        |2018-03-12    |
|Anshika   |Kumari    |MCA   |1450.00|103        |2029-05-19    |
|Shantanu  |Saini     |BSc   |1350.50|104        |2019-08-20    |
|Avantika  |Srivastava|BCom  |1350.00|105        |2020-10-21    |
|Jay       |Kumar     |BTech |1540.00|106        |2019-08-29    |
|Vinay     |Singh     |BCA   |1480.50|107        |2017-09-17    |
+----------+----------+------+-------+-----------+--------------+

Let me explain the above pySpark code so that you can get more clarity about what’s going on inside this code.

  • Firstly, I have imported the SparkSession from the pyspark.sql module.
  • Prepared the list of Python tuples and each tuple contains information about the Student like first name, last name, course, marks, roll number, and date of admission.
  • I have created the Python list that contained column names for PySpark DataFrame.
  • Created a spark session using SparkSession.builder.appName(“testing”).getOrCreate() because spark session is the entry point of any spark application.
  • And Then, I used the createDataFrame() method of the spark session and passed a list of tuples and columns inside it in order to create PySpark DataFrame.
  • And finally displayed the created PySpark DataFrame using the DataFrame show() method.

How to Check the schema of PySpark DataFrame?

PySpark provides a method called printSchema(). The printSchema() method is the PySpark DataFrame method that is used to print the schema or Data Type of PySpark DataFrame column names.
Let’s see how can we check the schema of the above-created DataFrame.

To print the schema of the DataFrame, just call the printSchema() method along with the DataFrame object. In my case, student_dataframe is the DataFrame object.

student_dataframe.printSchema()

The output will be:

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- marks: string (nullable = true)
 |-- roll_number: string (nullable = true)
 |-- admission_date: string (nullable = true)

As you can see in the above schema tree structure, The DataTypes of all the columns are the string. Throughout this article, we will see multiple ways to change the type of column for example we will change the Data type of the marks column from string to float and the type of roll_number column from string to integer.

The reason behind changing the Data type of column is, Suppose we want to add all the marks of students in the same course then we can’t add them because they are strings, or suppose we want to perform some comparison operators on top of roll number column then we can’t perform because they are also strings.

PySpark cast function

PySpark cast() function is a PySpark Column class function that will only apply on top of the PySpark DataFrame column. It takes data type as a parameter in order to apply to the selected column.
You can explore the PySpark Column class and its methods in our Python column class article.

Changing DataType of PySpark DataFrame Column

Let’s see all the possible ways to Change DataType of Column in PySpark DataFrame.

Using DataFrame.withColumn() Method:

withColumn() in PySpark is a DataFrame method that will only apply on top of the existing PySpark DataFrame. The withColumn() method takes two arguments column name and column expression for the new column. It is used to add a new column or replace a column in PySpark DataFrame. It always returns a new data frame.

Throughout this example, we are about to change the column type from string to float. In The above created DataFrame, we have a column called marks that hold the string value but now we want to change it to float using the cast() method. The cast() method takes a parameter that indicates the data type, in this case, the data type will be ‘float‘.

Example: PySpark Change Column Type to Float

student_dataframe1 = student_dataframe.withColumn('marks', student_dataframe['marks'].cast('float'))
student_dataframe1.printSchema()

Output:

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- marks: float (nullable = true)
 |-- roll_number: string (nullable = true)
 |-- admission_date: string (nullable = true)

As you can see in the above schema, the column type of marks has been changed successfully from string to float.

Example: PySpark Change Data Type Multiple Columns:

We can also use the withColumn() method in order to change the data type of multiple PySpark DataFrame columns. Let’s see.

student_dataframe1 = student_dataframe.withColumn(
    "marks", student_dataframe["marks"].cast("float")
).withColumn("roll_number", student_dataframe["roll_number"].cast("integer"))
student_dataframe1.printSchema()

Output

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- marks: float (nullable = true)
 |-- roll_number: integer (nullable = true)
 |-- admission_date: string (nullable = true)

As you can see Data type of both column marks and roll_number has been changed.

Using DataFrame.select() Method:

The select() method is also a PySpark DataFrame method. It takes column names to select them from the existing PySpark DataFrame and returns a new PySpark DataFrame.
During the selection of column names in the select() method we will apply the cast() method to particular column names.

Example: PySpark change data type to date

In the above-created data frame, A column name called admission_date with the string data type. In this example, I am about to change the string data type to the Date type.

student_dataframe1 = student_dataframe.select(
    'first_name',
    'last_name',
    'course',
    "marks,
    "roll_number,
    student_dataframe["admission_date"].cast("date"),

)

student_dataframe1.printSchema()

Output

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- marks: string (nullable = true)
 |-- roll_number: string (nullable = true)
 |-- admission_date: date (nullable = true)

As you can see in the above schema, the Data type of the admission_date column has been updated successfully.

Example: PySpark Change Data Type Multiple Columns

We can also change the data type of multiple columns in the select() method.

student_dataframe1 = student_dataframe.select(
    'first_name',
    'last_name',
    'course',
    student_dataframe["marks"].cast("float"),
    student_dataframe["roll_number"].cast("integer"),
    student_dataframe["admission_date"].cast("date"),

)

student_dataframe1.printSchema()

Output

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- marks: float (nullable = true)
 |-- roll_number: integer (nullable = true)
 |-- admission_date: date (nullable = true)

Using spark.sql():

We can also use the spark sql() function in order to change the data type of the column in PySpark DataFrame.

Let’s see how can we change the data type of single and multiple PySpark data frame columns using the spark sql() method. To use the spark sql(), we have to create a temporary table of created PySpark DataFrame. To create a temp table we are about to use the PySpark DataFrame createOrReplaceTempView() method. The temporary table will be available until the current spark session is destroyed.

The temporary table will act like a normal table in SQL.

# creating temporary view
student_dataframe.createOrReplaceTempView("student_data")
# changing type of column
df = spark.sql(
    "select first_name, last_name, marks, course, CAST(roll_number AS INTEGER), admission_date from student_data"
)
df.printSchema()

Example: Change Data Type of Multiple columns on PySpark DataFrame

We can also use the spark sql() method to cast the data type of multiple columns, we are about to change the data type of three-column marks, roll_number, and admission_date.

# creating temporary view
student_dataframe.createOrReplaceTempView("student_data")

# changing the data type of columns
df = spark.sql(
    "select first_name, "
    "last_name, "
    "CAST(marks AS FLOAT), "
    "course, CAST(roll_number AS INTEGER), "
    "CAST(admission_date AS DATE) from student_data "
)

df.printSchema()

Output:

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- marks: float (nullable = true)
 |-- course: string (nullable = true)
 |-- roll_number: integer (nullable = true)
 |-- admission_date: date (nullable = true)

So this is how spark sql() can be used to change the data type of PySpark Columns.

Conclusion

I hope the process of changing the data type of column in PySpark DataFrame was pretty easy and straightforward. We have covered multiple ways to Change DataType of Column in PySpark DataFrame.

If your PySpark DataFrame stores float values, integer values, date values, etc and you want to perform some operations on top of that then definitely you will need to convert the data type of those columns because we cannot perform the sum operation on integer string values. To change the data type, you can go with any approaches that we have seen throughout this article.

If you found this article helpful, please share and keep visiting for further PySpark Tutorials.

Have a great day.

Drop One or Multiple columns from PySpark DataFrame
How to Apply groupBy in Pyspark DataFrame

Related Posts