Menu Close

How to Remove Time Part from PySpark DateTime Column

How to remove time part from PySpark DateTime Column

Hi, In this article we are going to see how to Remove Time Part from PySpark DateTime Column with the help of the examples. These questions might be asked in most of the Data engineering interviews which is why it is one of the most important for PySpark developers or data engineers.

To remove the time part from the PySpark DateTime column, first of all, we have a PySpark DataFrame column along with the time part.

Let’s create a PySpark DataFrame along with a DateTime column.

Remove Time Part from PySpark DateTime Column

Here, I have created a sample PySpark DataFrame along with the dob column and the dob column has the date of birth of the students in the form of DateTime.

from pyspark.sql import SparkSession

# list of tuples
data = [
    ("1", "Vishvajit", "Rao", "2021-01-12 04:30:20"),
    ("2", "Harsh", "Goal", "2020-04-10 04:40:20"),
    ("3", "Pankaj", "Kumar", "2019-08-09 04:35:50"),
    ("4", "Pranjal", "Rao", "2013-11-12 02:11:20"),
    ("5", "Ritika", "Kumari", "2017-04-07 05:36:10"),
    ("6", "Diyanshu", "Saini", "2018-06-12 03:34:55"),
]


# columns
column_names = ["id", "first_name", "last_name", "dob"]

# creating spark session
spark = (
    SparkSession.builder.master("local[*]")
    .appName("www.programmingfunda.com")
    .getOrCreate()
)

# creating DataFrame
df = spark.createDataFrame(data=data, schema=column_names)
df.show(truncate=False)

Output:


+---+----------+---------+-------------------+
|id |first_name|last_name|dob                |
+---+----------+---------+-------------------+
|1  |Vishvajit |Rao      |2021-01-12 04:30:20|
|2  |Harsh     |Goal     |2020-04-10 04:40:20|
|3  |Pankaj    |Kumar    |2019-08-09 04:35:50|
|4  |Pranjal   |Rao      |2013-11-12 02:11:20|
|5  |Ritika    |Kumari   |2017-04-07 05:36:10|
|6  |Diyanshu  |Saini    |2018-06-12 03:34:55|
+---+----------+---------+-------------------+

We have successfully created the PySpark DataFrame, Now it’s time to remove the Time Part from the PySpark DataFrame Column.
To do this, we have to follow some steps.

  1. Convert string DateTime to a timestamp column
new_df = df.withColumn("dob_col", to_timestamp('dob', 'yyyy-MM-dd HH:mm:ss'))

2. Use the PySpark to_date() function to extract the date part from the above convert timestamp DateTime.

new_df = new_df.withColumn("dob_col", to_date('dob_col', 'yyyy-MM-dd'))

3. Delete or replace the old column called ‘dob‘. Here I am DateTime the old column.

new_df.drop(cols='dob')
Note:- To use to_timestamp() and to_date() function you have to import these functions from the pyspark.sql.functions module.

And finally, your result will be:


+---+----------+---------+----------+
|id |first_name|last_name|dob_col   |
+---+----------+---------+----------+
|1  |Vishvajit |Rao      |2021-01-12|
|2  |Harsh     |Goal     |2020-04-10|
|3  |Pankaj    |Kumar    |2019-08-09|
|4  |Pranjal   |Rao      |2013-11-12|
|5  |Ritika    |Kumari   |2017-04-07|
|6  |Diyanshu  |Saini    |2018-06-12|
+---+----------+---------+----------+

Complete Source Code

You can get the complete source code form here.

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, to_timestamp


# list of tuples
data = [
    ("1", "Vishvajit", "Rao", "2021-01-12 04:30:20"),
    ("2", "Harsh", "Goal", "2020-04-10 04:40:20"),
    ("3", "Pankaj", "Kumar", "2019-08-09 04:35:50"),
    ("4", "Pranjal", "Rao", "2013-11-12 02:11:20"),
    ("5", "Ritika", "Kumari", "2017-04-07 05:36:10"),
    ("6", "Diyanshu", "Saini", "2018-06-12 03:34:55"),
]


# columns
column_names = ["id", "first_name", "last_name", "dob"]

# creating spark session
spark = (
    SparkSession.builder.master("local[*]")
    .appName("www.programmingfunda.com")
    .getOrCreate()
)

# creating DataFrame
df = spark.createDataFrame(data=data, schema=column_names)
new_df = df.withColumn("dob_col", to_timestamp('dob', 'yyyy-MM-dd HH:mm:ss'))
new_df = new_df.withColumn("dob_col", to_date('dob_col', 'yyyy-MM-dd'))
new_df = new_df.drop('dob')
new_df.show(truncate=False)

This is how you can extract date from the PySpark DataFrame DateTime column.


Helpful PySpark Articles


Conclusion

So, in this article, we have seen how to Remove Time Part from PySpark DateTime Column with the help of the examples. This is one of the important questions from the interview point of view.

As a Data Engineer or PySpark Developer, you must know this question.

To solve these questions, bookmark this side. If you found this article helpful, please share and keep visiting for further PySpark tutorials.

Thanks for reading 🙏🙏

How to Explode Multiple Columns in PySpark DataFrame
How to Mask Card Number in PySpark DataFrame

Related Posts