Menu Close

How to Format a String in PySpark DataFrame using Column Values

How to format a string in PySpark DataFrame using column values

In this PySpark article, we will see how to format a string in PySpark DataFrame using column values with the help of an example. PySpark provides a string function called format_string() that is used to format the sting with the help of PySpark DataFrame column values.

Note:- Remember, This question might be asked in PySpark interviews.

PySpark Sample DataFrame

To apply the format_string() function, we must have a PySpark DataFrame, therefore I have created a sample CSV file along with some records as you can see below.

employees.csv
emp_full_name,emp_email,emp_gender,emp_salary,emp_department,date_of_joining,age
Mayank Kumar,[email protected],Male,25000,BPO,11/1/2023,25
Vishvajit Rao,[email protected],Male,40000,IT,11/2/2023,30
Harshita Mathur,[email protected],Female,20000,Sales,11/3/2023,23
Kavya Singh,[email protected],Female,20000,SEO,11/4/2023,24
Vishal Kumar,[email protected],Male,60000,IT,11/5/2023,28
Vaishali Mehta,[email protected],Female,35000,SEO,11/6/2023,27
Vaishali Mehta,[email protected],Female,35000,SEO,11/6/2023,25
James Bond,[email protected],Male,42000,IT,11/7/2023,23
Mariya Katherine,[email protected],Female,32000,Sales,11/8/2023,29
Mariya Katherine,[email protected],Female,40000,Sales,11/8/2023,31
Harshali Kumari,[email protected],Female,21000,BPO,11/9/2023,20
Vinay Singh,[email protected],Male,18000,BPO,11/10/2023,24
Vinay Mehra,[email protected],Male,45000,IT,11/11/2023,33
Akshara Singh,[email protected],Female,55000,IT,11/12/2023,30

Now, I have created a PySpark DataFrame from the CSV data with the help of the csv() method of the PySpark DataFrameReader class.

Example: Creating a DataFrame from a CSV file

from pyspark.sql import SparkSession

# creating spark session
spark = (
    SparkSession.builder.master("local[*]")
    .appName("www.programmingfunda.com")
    .getOrCreate()
)

# creating DataFrame
df = spark.read.option("header", "true").csv("../Datasets/employees.csv")
df.show(truncate=False)

After executing the above code a new PySpark DataFrame will be created as you can see below.

Creating a DataFrame from a CSV file

Now, I am about to create a new column that will store a formatted string like “My name is ABC and I am x years old, Thanks” where ABC will be replaced with the value of the emp_full_name column, and x will be replaced with the value of age column.

Format a String in PySpark DataFrame using Column Values

Before applying the format_string() function, Let’s see a little about this function and its parameters.

format_string():- It is the string function in PySpark DataFrame that is used to format the arguments in printf-style and return the result as a string.

👉 PySpark format_string() function:- Click Here

It takes two parameters:

  • Format:- It will string that will contain embedded tags and be used as a result of column values.
  • cols:- Column names or columns to be used in formatting.
Note:- To use format_string() function we have to import it from pyspark.sql.functions function.

Let’s apply the format_string() function to make a new column intro with the help of the other column values.

Example:- PySpark format_string function Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import format_string


# creating spark session
spark = (
    SparkSession.builder.master("local[*]")
    .appName("www.programmingfunda.com")
    .getOrCreate()
)

# creating DataFrame
df = spark.read.option("header", "true").csv("../Datasets/employees.csv")

new_df = df.withColumn("intro", format_string("My name is %s and I am %s years old, Thanks", df.emp_full_name, df.age))
new_df.show(truncate=False)

After executing the above code, The new DataFrame will be like this.

PySpark format_string function Example

Conclusion

So in this article, we have seen all about how to format a string in PySpark DataFrame using column values with the help of the example. The format_string() function is a string function that is used to format the string with the help of PySpark DataFrame column values.

This question might be asked in most of the PySpark interviews. If you found this article helpful, please share and keep visiting for further PySpark tutorials.

Thanks for visiting ….

How to Mask Card Number in PySpark DataFrame
PySpark Tutorials ( For Beginners and Professionals )

Related Posts