Menu Close

How to Apply groupBy in Pyspark DataFrame

How to apply groupby in pyspark dataframe

In this article, You will learn everything about how to apply groupBy in PySpark DataFrame with multiple aggregate functions with the help of the examples.

In real-life projects sometimes we want to generate a summary of given PySpark PySpark DataFrame similar records. This is one of the most useful concepts in PySpark because this is mostly used in real-life Spark applications.

If you have an idea of Group By in SQL Then it will be very easy to understand Group By concept in PySpark because both are the same.

Before going to deep dive into this article, let’s understand what is PySpark GroupBy and why we need to use this.

What is groupBy in PySpark?

The groupBy() or groupby() in PySpark is the same as Group By in SQL. It is used to summarize identical data into a single row with the help of some aggregate functions like avg, max, min, sum, count, etc.

Why do we need to use Group By in PySpark?

To understand this, let me take a scenario so that I can present it in a better way. Suppose we have some data of students which comprises some information about the students like first_name, last_name, gender, marks, course, roll_number, and so on, and now I want to get a sum of all the student’s marks in each course or average of student’s marks in each course in gender level or It can be a count of total Male and Female students in each course, etc.

There are a lot of scenarios that can happen to be used groupby or groupBy in PySpark DataFrame. It all depends on your requirement.

This is why Group By statement is most useful in real-life projects because it allows us to get summarized information about the data.

Let’s explore PySpark DataFrame groupBy or groupby with different-different aggregate functions. Obviously to apply Group By we must have a PySpark data frame.

Creating PySpark DataFrame

I have created a sample PySpark DataFrame for demonstration of this article, you can skip this part if you have already a PySpark DataFrame.

from pyspark.sql import SparkSession


data = [
    ("Pankaj", "Kumar", "Male", "BTech", 1550, 101, "2022-12-20"),
    ("Harshita", "Kumari", "Female", "BCom", 1250, 102, "2022-12-20"),
    ("Hari", "Sharma", "Male", "BCA", 1400, 103, "2018-03-12"),
    ("Anshika", "Kumari", "Female", "MCA", 1550, 104, "2029-05-19"),
    ("Shantanu", "Saini", "Male", "BSc", 1350, 105, "2019-08-20"),
    ("Avantika", "Srivastava", "Female", "BCom", 1350, 106, "2020-10-21"),
    ("Jay", "Kumar", "Male", "BTech", 1450, 107, "2019-08-29"),
    ("Vinay", "Singh", "Male", "BCA", 1480, 108, "2017-09-17"),
    ("Pyush", "Rao", "Male", "BSc", 1400, 109, "2019-08-20"),
    ("Jackson", "Doe", "Male", "MCA", 1600, 1010, "2029-01-10"),
    ("Shashank", "Gupta", "Male", "BTech", 1750, 1011, "2029-03-11"),
    ("Vaibhav", "Pathak", "Male", "BCA", 1450, 1012, "2018-02-10"),
    ("Harshali", "Kaushik", "Female", "BTech", 1650, 1013, "2019-11-10"),
    ("Pyush", "Goal", "Male", "BSc", 1350, 1014, "2019-08-20"),
    ("Manshi", "Kumari", "Female", "BCom", 1290, 1015, "2022-12-25"),

]

columns = [
    "first_name",
    "last_name",
    "gender",
    "course",
    "marks",
    "roll_number",
    "admission_date",
]

# creating spark session
spark = SparkSession.builder.appName("testing").getOrCreate()

# creating dataframe
student_dataframe = spark.createDataFrame(data, columns)
student_dataframe.show()

After executing the above PySpark Code, Created DataFrame will be like this.

+----------+----------+------+------+-----+-----------+--------------+
|first_name| last_name|Gender|course|marks|roll_number|admission_date|
+----------+----------+------+------+-----+-----------+--------------+
|    Pankaj|     Kumar|  Male| BTech| 1550|        101|    2022-12-20|
|  Harshita|    Kumari|Female|  BCom| 1250|        102|    2022-12-20|
|      Hari|    Sharma|  Male|   BCA| 1400|        103|    2018-03-12|
|   Anshika|    Kumari|Female|   MCA| 1550|        104|    2029-05-19|
|  Shantanu|     Saini|  Male|   BSc| 1350|        105|    2019-08-20|
|  Avantika|Srivastava|Female|  BCom| 1350|        106|    2020-10-21|
|       Jay|     Kumar|  Male| BTech| 1450|        107|    2019-08-29|
|     Vinay|     Singh|  Male|   BCA| 1480|        108|    2017-09-17|
|     Pyush|       Rao|  Male|   BSc| 1400|        109|    2019-08-20|
|   Jackson|       Doe|  Male|   MCA| 1600|       1010|    2029-01-10|
|  Shashank|     Gupta|  Male| BTech| 1750|       1011|    2029-03-11|
|   Vaibhav|    Pathak|  Male|   BCA| 1450|       1012|    2018-02-10|
|  Harshali|   Kaushik|Female| BTech| 1650|       1013|    2019-11-10|
|     Pyush|      Goal|  Male|   BSc| 1350|       1014|    2019-08-20|
|    Manshi|    Kumari|Female|  BCom| 1290|       1015|    2022-12-25|
+----------+----------+------+------+-----+-----------+--------------+

Let me explain the above code so that you can get more clarity with that code.

  1. Firstly, I have imported the SparkSession from the pyspark.sql package.
  2. Prepared the list of Python tuples and each tuple contains information about the Student like first_name, last_name, gender, course, marks, roll_number, and admission_date.
  3. I have created the Python list that contained column names for PySpark DataFrame.
  4. Created a spark session using SparkSession.builder.appName(“testing”).getOrCreate() because spark session is the entry point of any spark application.
  5. And then I used the createDataFrame() method of the spark session and passed a list of tuples and columns inside it in order to create PySpark DataFrame.
  6. And finally displayed the created PySpark DataFrame.

Now it’s time to apply groupby on PySpark DataFrame along with various aggregate functions.

How to apply groupBy in Pyspark DataFrame?

To apply group by on top of PySpark DataFrame, PySpark provides two methods called groupby() and groupBy(). These two methods are the methods for PySpark DataFrame and these methods take column names as a parameter and group them on behalf of identical values and finally return a new PySpark DataFrame.

We have to remember one thing, we have to use at least one aggregate function along with groupby() and groupBy() methods in order to see the effect of groupby() and groupBy() methods.

I have listed some popular aggregate functions which we can use with groupby() and groupBy() methods.

  • sum():- It returns the total sum of values in each group.
  • count():– It returns the total count of values in each group.
  • avg():- It returns the average of values in each group.
  • max():- It returns the max value in each group.
  • min():- It returns the minimum value in each group.
  • mean():- It returns the mean of values in each group.
  • agg():- it is not an aggregate function but using this we can perform more than one aggregate function at a time.

Let me explain all the above PySpark aggregate functions along with examples.

Note:- You can use groupby() or groupBy() because both are quite similar.

I am about to apply all the aggregate functions on top of the above-created PySpark DataFrame.

PySpark DataFrame groupBy count():

Here I am about to count the total number of students in each course including females and males because this is my project requirement. To count the total number of students in each course we will group by course column and also use the count() aggregate function.

students_count = student_dataframe.groupBy('course').count()
students_count.show()

Output

+------+-----+
|course|count|
+------+-----+
| BTech|    4|
|  BCom|    3|
|   BCA|    3|
|   MCA|    2|
|   BSc|    3|
+------+-----+

As you can in the above table, the Total number of students in each course has been calculated.

PySpark DataFrame groupBy sum():

The sum() aggregate function is used to calculate the sum of value in each group. For example, I want to calculate the sum of the total marks of each student in each course.
To calculate the sum of marks we have to pass the marks column in the sum() function.

students_sum = student_dataframe.groupBy('course').sum('marks')
students_sum.show()

Output

+------+----------+
|course|sum(marks)|
+------+----------+
| BTech|      6400|
|  BCom|      3890|
|   BCA|      4330|
|   MCA|      3150|
|   BSc|      4100|
+------+----------+

PySpark DataFrame groupBy avg():

The avg() aggregate function is the same as the sum() aggregate function unlike the sum() function which returns the average of values on each group.

students_avg = student_dataframe.groupBy('course').avg('marks')
students_avg.show()

Output

+------+------------------+
|course|        avg(marks)|
+------+------------------+
| BTech|            1600.0|
|  BCom|1296.6666666666667|
|   BCA|1443.3333333333333|
|   MCA|            1575.0|
|   BSc|1366.6666666666667|
+------+------------------+

PySpark DataFrame groupBy mean():

The mean() aggregate functions calculate the mean of values in each group.

students_mean = student_dataframe.groupBy('course').mean('marks')
students_mean.show()

Output

+------+------------------+
|course|        avg(marks)|
+------+------------------+
| BTech|            1600.0|
|  BCom|1296.6666666666667|
|   BCA|1443.3333333333333|
|   MCA|            1575.0|
|   BSc|1366.6666666666667|
+------+------------------+

PySpark DataFrame groupBy min():

The min() aggregate function is used to return the less value in each group, For example, I want to select fewer marks of students in each course.

students_min = student_dataframe.groupBy('course').min('marks')
students_min.show()

Output

+------+----------+
|course|min(marks)|
+------+----------+
| BTech|      1450|
|  BCom|      1250|
|   BCA|      1400|
|   MCA|      1550|
|   BSc|      1350|
+------+----------+

PySpark DataFrame groupBy max():

The max() aggregate function is used to return the maximum value in each group, For example, I want to select the maximum marks of students in each course.

students_max = student_dataframe.groupBy('course').max('marks')
students_mas.show()

Output

+------+----------+
|course|max(marks)|
+------+----------+
| BTech|      1750|
|  BCom|      1350|
|   BCA|      1480|
|   MCA|      1600|
|   BSc|      1400|
+------+----------+

So far we have seen how to use PySpark DataFrame groupby with the single column but sometimes our requirement is to apply groupby with multiple columns. In that scenario, we have to pass multiple column names inside groupBy() or groupby() methods.

let’s see some examples to apply groupBy on multiple PySpark DataFrame columns.

PySpark DataFrame groupBy multiple columns:

In this section, we are about to see the process of grouping multiple columns in PySpark DataFrame.

For example, In the above created PySpark DataFrame, I want to count the total number of Male and Female students in each course. In that scenario, we have to group by course and gender columns along with the count() aggregate function.

students_count = student_dataframe.groupBy('course', 'gender').count()
students_count.show()

Output

+------+------+-----+
|course|gender|count|
+------+------+-----+
| BTech|  Male|    3|
|   BCA|  Male|    3|
|  BCom|Female|    3|
|   BSc|  Male|    3|
|   MCA|Female|    1|
|   MCA|  Male|    1|
| BTech|Female|    1|
+------+------+-----+

As you can see in the above table, we have successfully been able to fetch the total numbers of Male and Female students in each course itself.

With the same approach, you can use all the aggregate functions in order to fulfill your requirement.

Pyspark DataFrame groupBy with multiple aggregate functions

So far we have applied only a single aggregate function on top of PySpark DataFrame along with groupby but we can also apply multiple aggregate functions on top of PySpark DataFrame.
To use multiple aggregate functions we have to use agg() method along with groupby() and pass the column name and aggregate function as a key value within the Python Dictionary inside agg() function.

let’s see an example.

students_agg = student_dataframe.groupBy('course').agg({'course': 'count', 'marks': 'sum'})
students_agg.show()
Note:- Remember, You can't pass multiple aggregate functions with the same keys in agg() function because the Python dictionary doesn't allow duplicate keys.

To pass multiple aggregate functions with the same keys, you have to import all the aggregate functions from pyspark.sql.functions module and then pass it inside agg() function.

from pyspark.sql.functions import avg, sum, max, count, min, avg

students_agg = student_dataframe.groupBy("course").agg(
    count("course"), sum("marks"), min("marks"), max("marks"), avg("marks")
)
students_agg.show()

Output

+------+-------------+----------+----------+----------+------------------+
|course|count(course)|sum(marks)|min(marks)|max(marks)|        avg(marks)|
+------+-------------+----------+----------+----------+------------------+
| BTech|            4|      6400|      1450|      1750|            1600.0|
|  BCom|            3|      3890|      1250|      1350|1296.6666666666667|
|   BCA|            3|      4330|      1400|      1480|1443.3333333333333|
|   MCA|            2|      3150|      1550|      1600|            1575.0|
|   BSc|            3|      4100|      1350|      1400|1366.6666666666667|
+------+-------------+----------+----------+----------+------------------+

So this is how you can use PySpark DataFrame groupby with aggregate functions to summarize the result set.

Reference:- Click Here

PySpark Related Articles:


Conclusion

In conclusion, I hope this tutorial was easy and fully informative, Throughout this article we have successfully covered how to use PySpark dataFrame groupby on top of PySpark DataFrame along with aggregate functions. This is going to be very helpful, especially when you are going to be working with any real-life PySpark application because sometimes our requirement is to summarize the PySpark DataFrame on behalf of simial records.

You can easily use any aggregate function along with the PySpoark DataFrame groupby() method. You can use any method between groupby() and groupBy() because both are synonymous.

If you found this article helpful, please share and keep visiting for further PySpark Tutorials.


PySpark FAQs ( Frequently Asked Questions )

What is groupby in PySpark?

groupby is similar as GROUP BY in SQL. The groupby() function is used to get similar data into groups on DataFrame and perform aggregate functions like count, sum, max, min, mean, etc.

What are aggregate functions in PySpark?

Aggregate functions in PySpark are used to perform operations on a set of rows and produce a single return value for each group.

PySpark groupby vs reducebykey.

The reducebykey function works only work with a Resilient Distributed Dataset or RDD containing key and value pair kind of content.
while groupby is used to perform aggregate functions and the top of a set of rows and produce single return value.

Can I use group by in PySpark DataFrame?

Yes, of course, you can use group by in PySpark DataFrame in order to summarize the identical data. group by will always be used with aggregate functions like count, sum, min, max, mean, and avg.

Python group by vs order by.

The Group By in PySpark is used to perform some aggregate functions on identical rows and Order By is used to sort the result set in ascending or descending order.

How to Change DataType of Column in PySpark DataFrame
Merge Two DataFrames in PySpark with Same Column Names

Related Posts