Menu Close

Merge Two DataFrames in PySpark with Same Column Names

How to merge two DataFrames in PySpark with same column names

Hi PySpark Developers, In this PySpark Recipe, you are going to learn all about how to merge two DataFrames in PySpark with same column names using various ways. As a PySpark Developer, you must have knowledge about PySpark Merge two DataFrames with same columns because in real life projects, Any time you have to merge two PySpark DataFrame that have same column names with the same structure. Throughout this article, we are about to see a total of three ways to join two PySpark DataFrames that have the same column names with the same data type.

PySpark provides some DataFrame methods to merge two or more DataFrame together but you have to remember one thing Data Type or Column all the DataFrame columns must be the same.

Let’s create both PySpark DataFrame one by one and then merge them with different-different methods.

Creating First PySpark DataFrame

Here, I have created my first PySpark DataFrame called dataframe_one. DataFrame has five column names like first_name, last_name, age, department, and salary. The data type or Column type of all the column names are string except age and salary.

from pyspark.sql import SparkSession


data1 = [
    ("Vikas", "Kumar", 25, "IT", 24000),
    ("Sachin", "Singh", 26, "HR", 23000),
    ("Hanshika", "Kumari", 15, "IT", 23000),
    ("Pankaj", "Singh", 28, "Finance", 25000),
    ("Shyam", "Singh", 24, "IT", 35000),
    ("Vinay", "Kumari", 25, "Finance", 24000),
    ("Vishal", "Srivastava", 21, "HR", 27000),
]


columns = ["first_name", "last_name", "age", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
dataframe_one = spark.createDataFrame(data1, columns)

# displaying dataframe
dataframe_one.show()

After executing the above code, The DataFrame will be like this.

+----------+----------+---+----------+------+
|first_name| last_name|age|department|salary|
+----------+----------+---+----------+------+
|     Vikas|     Kumar| 25|        IT| 24000|
|    Sachin|     Singh| 26|        HR| 23000|
|  Hanshika|    Kumari| 15|        IT| 23000|
|    Pankaj|     Singh| 28|   Finance| 25000|
|     Shyam|     Singh| 24|        IT| 35000|
|     Vinay|    Kumari| 25|   Finance| 24000|
|    Vishal|Srivastava| 21|        HR| 27000|
+----------+----------+---+----------+------+

Let’s see the schema or data type of the above DataFrame using the printSchema() method.

dataframe_one.printSchema()

After applying printSchema() on the DataFrame, you will see a tree structure like the schema of DataFrame.

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

As you can see in the above schema data type of first_name, last_name and department is a string, and the data types of age and salary columns are long.

Let’s create our second PySpark DataFrame with same column names and the same DataType or Column type.

Creating second PySpark DataFrame

Here, I have created my second PySpark DataFrame called dataframe_two. dataframe_two has five column names like first_name, last_name, age, department and salary.The data type or Column type of all the column names are string except age and salary. The data type of age and salary would be long.

from pyspark.sql import SparkSession

data2 = [
    ("John", "Doe", 30, "IT", 45000),
    ("Sakshi", "Kaushik", 23, "HR", 18000),
    ("Ayush", "Shukla", 27, "Account", 30000),
    ("Vishvajit", "Rao", 28, "IT", 25000),
    ("Ankita", "Singh", 29, "HR", 35000),
    ("Raghu", "Pandey", 26, "Account", 24000),
    ("Pritam", "Srivastava", 35, "Account", 40000),
]


columns = ["first_name", "last_name", "age", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
dataframe_two = spark.createDataFrame(data2, columns)

# displaying dataframe
dataframe_two.show()


# displaying schema
dataframe_two.printSchema()

After the successful execution of the above code, The DataFrame and schema will be like this.

Output

+----------+----------+---+----------+------+
|first_name| last_name|age|department|salary|
+----------+----------+---+----------+------+
|      John|       Doe| 30|        IT| 45000|
|    Sakshi|   Kaushik| 23|        HR| 18000|
|     Ayush|    Shukla| 27|   Account| 30000|
| Vishvajit|       Rao| 28|        IT| 25000|
|    Ankita|     Singh| 29|        HR| 35000|
|     Raghu|    Pandey| 26|   Account| 24000|
|    Pritam|Srivastava| 35|   Account| 40000|
+----------+----------+---+----------+------+

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

You can see the columns name and schema of both PySpark DataFrame dataframe_one and dataframe_two are the same. Now it’s time to see multiple ways to join two data frames in PySpark with the same column names.

Merge Two DataFrames in PySpark with Same Column Names

PysPark provides some DataFrame methods that are used to join two or more tables together. Here, I have mentioned all the methods along with descriptions.

Method NameDescription
union()It is the PySpark DataFrame method that is used to combine two PySpark DataFrames together. It always applies on top of Dataframe and also takes DataFrame as a parameter in order to combine them.
unionAll()unionAll() is also a PySpark DataFrame method that is used to combine two PySpark DataFrames together. It always applies on top of Dataframe and also takes DataFrame as a parameter in order to combine them.
unionByName()unionByName() method is a little bit different from the union() and unionAll() because it merges the DataFrames by the name of the columns not by position.

Let’s see all these methods one by one with the help of an example.

Merge Two DataFrames in PySpark with same column name using union() method

The union() method is a PySpark DataFrame method that is used to combine two PySpark DataFrame together and return a new DataFrame.It is similar to union all in the SQL ( Structure Query Language ). It applies on top of PySpark DataFrame and takes another DataFrame as a parameter.

Example:- PySpark Merge Two DataFrame using union() method

from pyspark.sql import SparkSession

data1 = [
    ("Vikas", "Kumar", 25, "IT", 24000),
    ("Sachin", "Singh", 26, "HR", 23000),
    ("Hanshika", "Kumari", 15, "IT", 23000),
    ("Pankaj", "Singh", 28, "Finance", 25000),
    ("Shyam", "Singh", 24, "IT", 35000),
    ("Vinay", "Kumari", 25, "Finance", 24000),
    ("Vishal", "Srivastava", 21, "HR", 27000),
]

data2 = [
    ("John", "Doe", 30, "IT", 45000),
    ("Sakshi", "Kaushik", 23, "HR", 18000),
    ("Ayush", "Shukla", 27, "Account", 30000),
    ("Vishvajit", "Rao", 28, "IT", 25000),
    ("Ankita", "Singh", 29, "HR", 35000),
    ("Raghu", "Pandey", 26, "Account", 24000),
    ("Pritam", "Srivastava", 35, "Account", 40000),
]

columns = ["first_name", "last_name", "age", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating first dataframe
dataframe_one = spark.createDataFrame(data1, columns)

# creating second dataframe
dataframe_two = spark.createDataFrame(data2, columns)

# merging both DataFrames using union() method
final_dataframe = dataframe_one.union(dataframe_two)

# displaying final dataframe
final_dataframe.show()

After merging both PySpark DataFrame using the union() method, The final Dataframe will be:

+----------+----------+---+----------+------+
|first_name| last_name|age|department|salary|
+----------+----------+---+----------+------+
|     Vikas|     Kumar| 25|        IT| 24000|
|    Sachin|     Singh| 26|        HR| 23000|
|  Hanshika|    Kumari| 15|        IT| 23000|
|    Pankaj|     Singh| 28|   Finance| 25000|
|     Shyam|     Singh| 24|        IT| 35000|
|     Vinay|    Kumari| 25|   Finance| 24000|
|    Vishal|Srivastava| 21|        HR| 27000|
|      John|       Doe| 30|        IT| 45000|
|    Sakshi|   Kaushik| 23|        HR| 18000|
|     Ayush|    Shukla| 27|   Account| 30000|
| Vishvajit|       Rao| 28|        IT| 25000|
|    Ankita|     Singh| 29|        HR| 35000|
|     Raghu|    Pandey| 26|   Account| 24000|
|    Pritam|Srivastava| 35|   Account| 40000|
+----------+----------+---+----------+------+

Merge Two DataFrames in PySpark with same column name using unionAll() method

The unionAll() method is a also PySpark DataFrame method that is used to combine two PySpark DataFrame together and return a new DataFrame.It is similar to union all in the SQL ( Structure Query Language ). It applies on top of PySpark DataFrame and takes another DataFrame as a parameter.

Example:- PySpark Merge Two DataFrame using unionAll() method

from pyspark.sql import SparkSession

data1 = [
    ("Vikas", "Kumar", 25, "IT", 24000),
    ("Sachin", "Singh", 26, "HR", 23000),
    ("Hanshika", "Kumari", 15, "IT", 23000),
    ("Pankaj", "Singh", 28, "Finance", 25000),
    ("Shyam", "Singh", 24, "IT", 35000),
    ("Vinay", "Kumari", 25, "Finance", 24000),
    ("Vishal", "Srivastava", 21, "HR", 27000),
]

data2 = [
    ("John", "Doe", 30, "IT", 45000),
    ("Sakshi", "Kaushik", 23, "HR", 18000),
    ("Ayush", "Shukla", 27, "Account", 30000),
    ("Vishvajit", "Rao", 28, "IT", 25000),
    ("Ankita", "Singh", 29, "HR", 35000),
    ("Raghu", "Pandey", 26, "Account", 24000),
    ("Pritam", "Srivastava", 35, "Account", 40000),
]

columns = ["first_name", "last_name", "age", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating first dataframe
dataframe_one = spark.createDataFrame(data1, columns)

# creating second dataframe
dataframe_two = spark.createDataFrame(data2, columns)

# merging both DataFrames using union() method
final_dataframe = dataframe_one.unionAll(dataframe_two)

# displaying final dataframe
final_dataframe.show()

After merging both PySpark DataFrame using the unionAll() method, The final Dataframe will be.

+----------+----------+---+----------+------+
|first_name| last_name|age|department|salary|
+----------+----------+---+----------+------+
|     Vikas|     Kumar| 25|        IT| 24000|
|    Sachin|     Singh| 26|        HR| 23000|
|  Hanshika|    Kumari| 15|        IT| 23000|
|    Pankaj|     Singh| 28|   Finance| 25000|
|     Shyam|     Singh| 24|        IT| 35000|
|     Vinay|    Kumari| 25|   Finance| 24000|
|    Vishal|Srivastava| 21|        HR| 27000|
|      John|       Doe| 30|        IT| 45000|
|    Sakshi|   Kaushik| 23|        HR| 18000|
|     Ayush|    Shukla| 27|   Account| 30000|
| Vishvajit|       Rao| 28|        IT| 25000|
|    Ankita|     Singh| 29|        HR| 35000|
|     Raghu|    Pandey| 26|   Account| 24000|
|    Pritam|Srivastava| 35|   Account| 40000|
+----------+----------+---+----------+------+

Merge Two DataFrames in PySpark with the same column name using the unionByName() method

The unionByName() method is a also PySpark DataFrame method that is used to combine two PySpark DataFrame together and return a new DataFrame.It is a bit different from union() and unionAll() because it takes an additional parameter called allowMissingColumns which accepts True or False, By default, it is set to False.

When it is set to True then it will fill the null values of the columns that are not matching in both PySpark DataFrame, but the column which you want to use for merging should be the same in both PySpark DataFrame. It applies on top of PySpark DataFrame and takes another DataFrame as a parameter.

Example:- PySpark Merge Two DataFrame using unionByName() method

from pyspark.sql import SparkSession

data1 = [
    ("Vikas", "Kumar", 25, "IT", 24000),
    ("Sachin", "Singh", 26, "HR", 23000),
    ("Hanshika", "Kumari", 15, "IT", 23000),
    ("Pankaj", "Singh", 28, "Finance", 25000),
    ("Shyam", "Singh", 24, "IT", 35000),
    ("Vinay", "Kumari", 25, "Finance", 24000),
    ("Vishal", "Srivastava", 21, "HR", 27000),
]

data2 = [
    ("John", "Doe", 30, "IT", 45000),
    ("Sakshi", "Kaushik", 23, "HR", 18000),
    ("Ayush", "Shukla", 27, "Account", 30000),
    ("Vishvajit", "Rao", 28, "IT", 25000),
    ("Ankita", "Singh", 29, "HR", 35000),
    ("Raghu", "Pandey", 26, "Account", 24000),
    ("Pritam", "Srivastava", 35, "Account", 40000),
]

columns = ["first_name", "last_name", "age", "department", "salary"]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating first dataframe
dataframe_one = spark.createDataFrame(data1, columns)

# creating second dataframe
dataframe_two = spark.createDataFrame(data2, columns)

# merging both DataFrames using unionByName() method
final_dataframe = dataframe_one.unionByName(dataframe_two)

# displaying final dataframe
final_dataframe.show()

After merging both PySpark DataFrame using the unionByName() method, The final Dataframe will be.

+----------+----------+---+----------+------+
|first_name| last_name|age|department|salary|
+----------+----------+---+----------+------+
|     Vikas|     Kumar| 25|        IT| 24000|
|    Sachin|     Singh| 26|        HR| 23000|
|  Hanshika|    Kumari| 15|        IT| 23000|
|    Pankaj|     Singh| 28|   Finance| 25000|
|     Shyam|     Singh| 24|        IT| 35000|
|     Vinay|    Kumari| 25|   Finance| 24000|
|    Vishal|Srivastava| 21|        HR| 27000|
|      John|       Doe| 30|        IT| 45000|
|    Sakshi|   Kaushik| 23|        HR| 18000|
|     Ayush|    Shukla| 27|   Account| 30000|
| Vishvajit|       Rao| 28|        IT| 25000|
|    Ankita|     Singh| 29|        HR| 35000|
|     Raghu|    Pandey| 26|   Account| 24000|
|    Pritam|Srivastava| 35|   Account| 40000|
+----------+----------+---+----------+------+

Note:- All the DataFrame methods like a union(), unionAll() and unionByName() do not remove duplicate rows by default. If you want to remove duplicate rows then you have to use the distinct() method on top of the final PySpark DataFrame.As You can see below.

# remove duplicate rows
unique_dataframe = final_dataframe.distinct()

# displaying DataFrame
unique_dataframe.show()

Other PySpark Useful Articles


Conclusion

So throughout this PySpark tutorial, we have successfully covered PySpark DataFrame’s union(), unionAll(), and unionByName() methods in order to merge Two DataFrames in PySpark DataFrame with same column names.
You can use any one of them at your convenience. This is going to be very helpful when you want to load data from different-different sources that have sane column names and you want to merge them by performing Transformations.

If you found this article helpful, please share and keep visiting for further PySpark tutorials.

Have a nice day….
Happy Coding


FAQs ( Frequently Asked Questions)

What is PySpark DataFrame?

Ans:- PySpark DataFrame is a distributed collection of Data organized into named columns. It stores data in the form of rows and columns. It is similar to a table in RDBMS ( Relation DataBase Management System ).

PySpark union() vs unionAll()

Ans:- PySpark union() and unionAll() both are transformation methods that are used to merge or join two PySpark DataFrame together that have the same schema and structure. If the schema of both DataFrames are different Then These functions will raise an error.

PySpark union() vs unionByName()

Ans:- The major difference between these both functions is that the union() function merges the DataFrames by the position while the unionByName() method merges the DataFrames by name of the columns.

How to Apply groupBy in Pyspark DataFrame
How to Count Null and NaN Values in Each Column in PySpark DataFrame?

Related Posts