In this article, You will learn everything about how to **apply groupBy in PySpark DataFrame** with multiple aggregate functions with the help of the examples.

In real-life projects sometimes we want to generate a summary of given PySpark PySpark DataFrame similar records. This is one of the most useful concepts in PySpark because this is mostly used in real-life Spark applications.

If you have an idea of Group By in SQL Then it will be very easy to understand Group By concept in PySpark because both are the same.

Before going to deep dive into this article, let’s understand what is PySpark GroupBy and why we need to use this.

Headings of Contents

- 1 What is groupBy in PySpark?
- 2 Why do we need to use Group By in PySpark?
- 3 Creating PySpark DataFrame
- 4 How to apply groupBy in Pyspark DataFrame?
- 5 PySpark DataFrame groupBy multiple columns:
- 6 Pyspark DataFrame groupBy with multiple aggregate functions
- 7 Conclusion
- 8 PySpark FAQs ( Frequently Asked Questions )

## What is groupBy in PySpark?

The **groupBy()** or **groupby()** in PySpark is the same as Group By in SQL. It is used to summarize identical data into a single row with the help of some aggregate functions like **avg**, **max**, **min**, **sum**, **count**, etc.

## Why do we need to use Group By in PySpark?

To understand this, let me take a scenario so that I can present it in a better way. Suppose we have some data of students which comprises some information about the students like **first_name**, **last_name**, **gender**, **marks**, **course**, **roll_number**, and so on, and now I want to get a sum of all the student’s marks in each course or average of student’s marks in each course in gender level or It can be a count of total Male and Female students in each course, etc.

There are a lot of scenarios that can happen to be used groupby or groupBy in PySpark DataFrame. It all depends on your requirement.

This is why Group By statement is most useful in real-life projects because it allows us to get summarized information about the data.

Let’s explore PySpark DataFrame groupBy or groupby with different-different aggregate functions. Obviously to apply Group By we must have a PySpark data frame.

## Creating PySpark DataFrame

I have created a sample PySpark DataFrame for demonstration of this article, you can skip this part if you have already a PySpark DataFrame.

from pyspark.sql import SparkSession data = [ ("Pankaj", "Kumar", "Male", "BTech", 1550, 101, "2022-12-20"), ("Harshita", "Kumari", "Female", "BCom", 1250, 102, "2022-12-20"), ("Hari", "Sharma", "Male", "BCA", 1400, 103, "2018-03-12"), ("Anshika", "Kumari", "Female", "MCA", 1550, 104, "2029-05-19"), ("Shantanu", "Saini", "Male", "BSc", 1350, 105, "2019-08-20"), ("Avantika", "Srivastava", "Female", "BCom", 1350, 106, "2020-10-21"), ("Jay", "Kumar", "Male", "BTech", 1450, 107, "2019-08-29"), ("Vinay", "Singh", "Male", "BCA", 1480, 108, "2017-09-17"), ("Pyush", "Rao", "Male", "BSc", 1400, 109, "2019-08-20"), ("Jackson", "Doe", "Male", "MCA", 1600, 1010, "2029-01-10"), ("Shashank", "Gupta", "Male", "BTech", 1750, 1011, "2029-03-11"), ("Vaibhav", "Pathak", "Male", "BCA", 1450, 1012, "2018-02-10"), ("Harshali", "Kaushik", "Female", "BTech", 1650, 1013, "2019-11-10"), ("Pyush", "Goal", "Male", "BSc", 1350, 1014, "2019-08-20"), ("Manshi", "Kumari", "Female", "BCom", 1290, 1015, "2022-12-25"), ] columns = [ "first_name", "last_name", "gender", "course", "marks", "roll_number", "admission_date", ] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # creating dataframe student_dataframe = spark.createDataFrame(data, columns) student_dataframe.show()

After executing the above PySpark Code, Created DataFrame will be like this.

```
+----------+----------+------+------+-----+-----------+--------------+
|first_name| last_name|Gender|course|marks|roll_number|admission_date|
+----------+----------+------+------+-----+-----------+--------------+
| Pankaj| Kumar| Male| BTech| 1550| 101| 2022-12-20|
| Harshita| Kumari|Female| BCom| 1250| 102| 2022-12-20|
| Hari| Sharma| Male| BCA| 1400| 103| 2018-03-12|
| Anshika| Kumari|Female| MCA| 1550| 104| 2029-05-19|
| Shantanu| Saini| Male| BSc| 1350| 105| 2019-08-20|
| Avantika|Srivastava|Female| BCom| 1350| 106| 2020-10-21|
| Jay| Kumar| Male| BTech| 1450| 107| 2019-08-29|
| Vinay| Singh| Male| BCA| 1480| 108| 2017-09-17|
| Pyush| Rao| Male| BSc| 1400| 109| 2019-08-20|
| Jackson| Doe| Male| MCA| 1600| 1010| 2029-01-10|
| Shashank| Gupta| Male| BTech| 1750| 1011| 2029-03-11|
| Vaibhav| Pathak| Male| BCA| 1450| 1012| 2018-02-10|
| Harshali| Kaushik|Female| BTech| 1650| 1013| 2019-11-10|
| Pyush| Goal| Male| BSc| 1350| 1014| 2019-08-20|
| Manshi| Kumari|Female| BCom| 1290| 1015| 2022-12-25|
+----------+----------+------+------+-----+-----------+--------------+
```

Let me explain the above code so that you can get more clarity with that code.

- Firstly, I have imported the SparkSession from the
**pyspark.sql**package. - Prepared the list of Python tuples and each tuple contains information about the Student like
**first_name**,**last_name**,**gender**,**course**,**marks**,**roll_number**, and**admission_date**. - I have created the Python list that contained column names for PySpark DataFrame.
- Created a spark session using
**SparkSession.builder.appName(“testing”).getOrCreate()**because spark session is the entry point of any spark application. - And then I used the
**createDataFrame()**method of the spark session and passed a list of tuples and columns inside it in order to create PySpark DataFrame. - And finally displayed the created PySpark DataFrame.

Now it’s time to apply groupby on PySpark DataFrame along with various aggregate functions.

## How to apply groupBy in Pyspark DataFrame?

To apply group by on top of PySpark DataFrame, We are about to use methods **groupby()** and **groupBy()** and these two methods come under the **PySpark DataFrame methods**. These two methods are the methods for PySpark DataFrame and these methods take column names as a parameter and group them on behalf of identical values and finally return a new PySpark DataFrame.

We have to remember one thing, we have to use at least one aggregate function along with **groupby()** and **groupBy()** methods in order to see the effect of **groupby()** and **groupBy()** methods.

I have listed some popular aggregate functions that we can use with groupby() and groupBy() methods.

**sum():-**It returns the total sum of values in each group.**count():**– It returns the total count of values in each group.**avg():-**It returns the average of values in each group.**max():-**It returns the max value in each group.**min():-**It returns the minimum value in each group.**mean():-**It returns the mean of values in each group.**agg():-**it is not an aggregate function but using this we can perform more than one aggregate function at a time.

Let me explain all the above PySpark aggregate functions along with examples.

Note:-You can usegroupby()orgroupBy()because both are quite similar.

I am about to apply all the aggregate functions on top of the above-created PySpark DataFrame.

### PySpark DataFrame groupBy count():

Here I am about to count the total number of students in each course including females and males because this is my project requirement. To count the total number of students in each course we will group by course column and also use the **count()** aggregate function.

students_count = student_dataframe.groupBy('course').count() students_count.show()

**Output**

```
+------+-----+
|course|count|
+------+-----+
| BTech| 4|
| BCom| 3|
| BCA| 3|
| MCA| 2|
| BSc| 3|
+------+-----+
```

As you can in the above table, the Total number of students in each **course **has been calculated.

### PySpark DataFrame groupBy sum():

The **sum()** aggregate function is used to calculate the sum of value in each group. For example, I want to calculate the sum of the total marks of each student in each course.

To calculate the sum of marks we have to pass the marks column in the **sum()** function.

students_sum = student_dataframe.groupBy('course').sum('marks') students_sum.show()

**Output**

```
+------+----------+
|course|sum(marks)|
+------+----------+
| BTech| 6400|
| BCom| 3890|
| BCA| 4330|
| MCA| 3150|
| BSc| 4100|
+------+----------+
```

### PySpark DataFrame groupBy avg():

The **avg()** aggregate function is the same as the **sum()** aggregate function unlike the **sum()** function which returns the average of values on each group.

students_avg = student_dataframe.groupBy('course').avg('marks') students_avg.show()

**Output**

```
+------+------------------+
|course| avg(marks)|
+------+------------------+
| BTech| 1600.0|
| BCom|1296.6666666666667|
| BCA|1443.3333333333333|
| MCA| 1575.0|
| BSc|1366.6666666666667|
+------+------------------+
```

### PySpark DataFrame groupBy mean():

The **mean()** aggregate functions calculate the mean of values in each group.

students_mean = student_dataframe.groupBy('course').mean('marks') students_mean.show()

**Output**

```
+------+------------------+
|course| avg(marks)|
+------+------------------+
| BTech| 1600.0|
| BCom|1296.6666666666667|
| BCA|1443.3333333333333|
| MCA| 1575.0|
| BSc|1366.6666666666667|
+------+------------------+
```

### PySpark DataFrame groupBy min():

The **min()** aggregate function is used to return the less value in each group, For example, I want to select fewer marks of students in each **course**.

students_min = student_dataframe.groupBy('course').min('marks') students_min.show()

**Output**

```
+------+----------+
|course|min(marks)|
+------+----------+
| BTech| 1450|
| BCom| 1250|
| BCA| 1400|
| MCA| 1550|
| BSc| 1350|
+------+----------+
```

### PySpark DataFrame groupBy max():

The **max()** aggregate function is used to return the maximum value in each group, For example, I want to select the maximum marks of students in each **course**.

students_max = student_dataframe.groupBy('course').max('marks') students_mas.show()

**Output**

```
+------+----------+
|course|max(marks)|
+------+----------+
| BTech| 1750|
| BCom| 1350|
| BCA| 1480|
| MCA| 1600|
| BSc| 1400|
+------+----------+
```

So far we have seen how to use PySpark DataFrame groupby with the single column but sometimes our requirement is to apply groupby with multiple columns. In that scenario, we have to pass multiple column names inside **groupBy()** or **groupby()** methods.

let’s see some examples to apply groupBy on multiple PySpark DataFrame columns.

## PySpark DataFrame groupBy multiple columns:

In this section, we are about to see the process of grouping multiple columns in PySpark DataFrame.

For example, In the above created PySpark DataFrame, I want to count the total number of Male and Female students in each **course**. In that scenario, we have to group by **course **and **gender **columns along with the **count()** aggregate function.

students_count = student_dataframe.groupBy('course', 'gender').count() students_count.show()

**Output**

```
+------+------+-----+
|course|gender|count|
+------+------+-----+
| BTech| Male| 3|
| BCA| Male| 3|
| BCom|Female| 3|
| BSc| Male| 3|
| MCA|Female| 1|
| MCA| Male| 1|
| BTech|Female| 1|
+------+------+-----+
```

As you can see in the above table, we have successfully been able to fetch the total numbers of **Male **and **Female **students in each course itself.

With the same approach, you can use all the aggregate functions in order to fulfill your requirement.

## Pyspark DataFrame groupBy with multiple aggregate functions

So far we have applied only a single aggregate function on top of PySpark DataFrame along with groupby but we can also apply multiple aggregate functions on top of PySpark DataFrame.

To use multiple aggregate functions we have to use **agg()** method along with groupby() and pass the column name and aggregate function as a key value within the Python Dictionary inside **agg()** function.

let’s see an example.

students_agg = student_dataframe.groupBy('course').agg({'course': 'count', 'marks': 'sum'}) students_agg.show()

Note:-Remember, You can't pass multiple aggregate functions with the same keys inagg()function because the Python dictionary doesn't allow duplicate keys.

To pass multiple aggregate functions with the same keys, you have to import all the aggregate functions from pyspark.sql.functions module and then pass it inside **agg() **function.

from pyspark.sql.functions import avg, sum, max, count, min, avg students_agg = student_dataframe.groupBy("course").agg( count("course"), sum("marks"), min("marks"), max("marks"), avg("marks") ) students_agg.show()

Output

```
+------+-------------+----------+----------+----------+------------------+
|course|count(course)|sum(marks)|min(marks)|max(marks)| avg(marks)|
+------+-------------+----------+----------+----------+------------------+
| BTech| 4| 6400| 1450| 1750| 1600.0|
| BCom| 3| 3890| 1250| 1350|1296.6666666666667|
| BCA| 3| 4330| 1400| 1480|1443.3333333333333|
| MCA| 2| 3150| 1550| 1600| 1575.0|
| BSc| 3| 4100| 1350| 1400|1366.6666666666667|
+------+-------------+----------+----------+----------+------------------+
```

So this is how you can use PySpark DataFrame groupby with aggregate functions to summarize the result set.

Reference:- Click Here

**PySpark Related Articles:**

- How to convert PySpark Row To Dictionary
- PySpark Column Class with Examples
- PySpark Sort Function with Examples
- How to read CSV files using PySpark
- PySpark col() Function with Examples
- Convert PySpark DataFrame Column to List
- How to Write PySpark DataFrame to CSV
- How to Convert PySpark DataFrame to JSON

## Conclusion

In conclusion, I hope this tutorial was easy and fully informative, Throughout this article we have successfully covered **how to use PySpark dataFrame** groupby on top of PySpark DataFrame along with aggregate functions. This is going to be very helpful, especially when you are going to be working with any real-life PySpark application because sometimes our requirement is to summarize the PySpark DataFrame on behalf of simial records.

You can easily use any aggregate function along with the PySpoark DataFrame groupby() method. You can use any method between **groupby()** and **groupBy(**) because both are synonymous.

If you found this article helpful, please share and keep visiting for further PySpark Tutorials.

## PySpark FAQs ( Frequently Asked Questions )

**What is groupby in PySpark?**

**groupby** is similar as GROUP BY in SQL. The **groupby()** function is used to get similar data into groups on DataFrame and perform aggregate functions like **count**, **sum**, **max**, **min**, **mean**, etc.

W**hat are aggregate functions in PySpark?**

Aggregate functions in PySpark are used to perform operations on a set of rows and produce a single return value for each group.

**PySpark groupby vs reducebykey.**

The reducebykey function works only work with a Resilient Distributed Dataset or RDD containing key and value pair kind of content.

while groupby is used to perform aggregate functions and the top of a set of rows and produce single return value.

**Can I use group by in PySpark DataFrame?**

Yes, of course, you can use group by in PySpark DataFrame in order to summarize the identical data. group by will always be used with aggregate functions like **count, sum, min, max, mean, and avg.**

**Python group by vs order by.**

The Group By in PySpark is used to perform some aggregate functions on identical rows and Order By is used to sort the result set in ascending or descending order.