How to use GroupBy in Pandas DataFrame

In this Pandas tutorial, we will explore all about the Pandas groupby() method with the help of the examples. Pandas groupby() is very similar to the Group by in SQL that is used to group the result set according to similar values and apply aggregate functions like sum, min, max, avg, etc on a group of the result set.

If you are familiar with Group By in SQL then you can easily understand Group By in Pandas DataFrame.

Pandas provides a DataFrame groupby() method that is used to group the result set based on the passed columns. Before going to deep dive into this article, Let’s see all about the Pandas groupby() method.

Headings of Contents

1 Pandas DataFrame groupby() Method
2 How to use GroupBy in Pandas DataFrame
- 2.1 Group By Department in Pandas DataFrame
3 How does Pandas Groupby work?
4 Conclusion

Pandas DataFrame groupby() Method

groupby() is a DataFrame method that is used to grouping the Pandas DataFrame by mapper or series of columns. The groupby() method splits the objects into groups, applying an aggregate function on the groups and finally combining the result set.

Syntax of Pandas groupby() Method:

This is the syntax of the Pandas groupby() method.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)

Parameter of groupby() Method:

The Pandas groupby() Method accepts some keyword argument parameters that can be used as per requirement.

by:-  Required, A labels, A list of labels or a function used to specifiy how to group the DataFrame.

axis:- {0 for 'Index', 1 for 'column'}. Default is 0. For the series, this parameter is unused and by default set to 0.

level:- int, level name, If the axis is multiplex, group by particular level or levels. Do not specify both by and level.

as_index:- Boolean, Default is True.Return the objects with group lavels.

sort:- Boolean, Default is True. It is used to sort the group keys.You can get better performance by turning this off.

group_keys:- Boolean, Default is True. Set to False if the result should not add the groups to the index.

observed:- It has been deprecated from Pandas version 2.1.0.

dropna:- Boolean, default is True. If True and if group keys contains NA value, NA values together with rows and columns will be droped.

Now let’s see how to use group in Pandas DataFrame with the help of some examples.

How to use GroupBy in Pandas DataFrame

For a demonstration of this article, I have prepared a sample CSV dataset along with some dummy data. Throughout this article, we are about to use this sample CSV dataset.

I have loaded the above Sample CSV dataset into Pandas DataFrame with the help of the Pandas read_csv() method.

use GroupBy in Pandas DataFrame — **Pandas Sample DataFrame**

Group By Department in Pandas DataFrame

As you can see in the above Pandas DataFrame, we have one emp_department column that has a department for each employee. Now we want to group the Pandas DataFrame based on the department and perform some aggregate functions on each group.
To apply the aggregate function, we can use the agg() function along with the aggregate function otherwise we can use the corresponding aggregate function directly on groupby() object.

The most common aggregate functions are sum(), mean(), count(), min(), max(), size(), median() and var().

sum() aggregate function:

The sum() function is used to add the total salary of all the employees within a department. For example, I want to get the total salary of employees in a particular department.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )
x = df.groupby(['emp_department']).sum()
x[['emp_salary']]

count() aggregate function:

The count() aggregate function is used to return the total number of rows in each group. For example, I want to get the total number of employees in each department.

import pandas as pd
import numpy as np
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )
x = df.groupby(['emp_department']).count()
x.rename({"emp_full_name": "Total Employees"}, axis=1,inplace=True)
x[["Total Employees"]]

How does Pandas Groupby work?

Pandas groupby() Method performs three operations behind the scenes. Three operations are listed below.

Splitting:- Splitting the original object into groups based on the defined criteria.
Applying:- Applying aggregate function on each group.
Combining:- Combining the results.

This process is also called a split-apply-combine chain.

Let’s understand all the above operations step by step with the help of the examples.

Splitting the original objects into groups:

When we call the Pandas groupby() method on the top of the Pandas DataFrame.The groupby() method splits the objects into groups based on predefined criteria. The groupby() functions map the labels to the names of the groups.

For example, we are grouping the data based on the emp_department.

groups = df.groupby('emp_department')

We can also group the data based on the multiple-column names.

groups = df.groupby(['emp_gender', 'emp_department'])

The groupby() function always returns a groupby object which contains multiple groups.

We can get all the groups by iterating over the groupby() object.

Let’s use Python for loop to iterate each item of the groupby object. Here, I am trying to group the Pandas DataFrame based on the emp_department column.

import pandas as pd
import numpy as np
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )
groups = df.groupby(['emp_department'])

for group_name, data in groups:
    print("Group Name is:- ", group_name)
    print("-------------------------")
    print("Group Data:- \n\n", data)

The Output will be:


Group Name is:-  ('BPO',)
-------------------------
Group Data:- 
       emp_full_name                  emp_email emp_gender  emp_salary  \
0      Mayank Kumar     [email protected]       Male       25000   
10  Harshali Kumari  [email protected]     Female       21000   
11      Vinay Singh      [email protected]       Male       18000   

   emp_department date_of_joining  
0             BPO       11/1/2023  
10            BPO       11/9/2023  
11            BPO      11/10/2023  
Group Name is:-  ('IT',)
-------------------------
Group Data:- 
     emp_full_name                emp_email emp_gender  emp_salary  \
1   Vishvajit Rao  [email protected]       Male       40000   
4    Vishal Kumar   [email protected]       Male       60000   
7      James Bond     [email protected]       Male       42000   
12    Vinay Mehra    [email protected]       Male       45000   
13  Akshara Singh  [email protected]     Female       55000   

   emp_department date_of_joining  
1              IT       11/2/2023  
4              IT       11/5/2023  
7              IT       11/7/2023  
12             IT      11/11/2023  
13             IT      11/12/2023  
Group Name is:-  ('SEO',)
-------------------------
Group Data:- 
     emp_full_name                 emp_email emp_gender  emp_salary  \
3     Kavya Singh     [email protected]     Female       20000   
5  Vaishali Mehta  [email protected]     Female       35000   
6  Vaishali Mehta  [email protected]     Female       35000   

  emp_department date_of_joining  
3            SEO       11/4/2023  
5            SEO       11/6/2023  
6            SEO       11/6/2023  
Group Name is:-  ('Sales',)
-------------------------
Group Data:- 
       emp_full_name                   emp_email emp_gender  emp_salary  \
2   Harshita Mathur   [email protected]     Female       20000   
8  Mariya Katherine  [email protected]     Female       32000   
9  Mariya Katherine  [email protected]     Female       40000   

  emp_department date_of_joining  
2          Sales       11/3/2023  
8          Sales       11/8/2023  
9          Sales       11/8/2023

Access Specific Group:

Pandas groupby() object has a special method called get_group() which takes the group name parameter and returns the data in the form of Pandas DataFrame.

Each department name represents a group name, For example, I want to select the IT group, and Then we will pass the IT name in the get_group() method.

Let’s see.

import pandas as pd
import numpy as np
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )
groups = df.groupby(['emp_department'])
groups.get_group('IT')

This is how you can get single group information in the form of Pandas DataFrame by using the get_group() method. Now, let’s see how we can use aggregate functions.

Applying Aggregate Functions

After splitting the original objects into groups, we can apply aggregate functions on top of the GroupBy object even if we can apply aggregate functions by each group.

Let’s apply aggregate functions on the GroupBy object.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )
groups = df.groupby(['emp_department'])
groups.max()

The max() aggregate function returns the maximum salary of each department.

aggregate functions can also be applied to a single group.

Let’s apply the median() aggregate function to the IT group.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )
groups = df.groupby(['emp_department'])
groups.get_group('IT')['emp_salary'].median()

This is the way you can use aggregate functions on GroupBy objects or single groups.

Combining

This is the last step in the GroupBy process. After applying aggregate functions on each group it combines the result set of all the groups into a single DataFrame.This stage is performed by the Pandas itself.

This is how Pandas GroupBy works.

Helpful Pandas Articles

👉 Pandas DataFrame groupby() Method Documentation:- Click Here

Conclusion

So during this article, we have seen how to use groupby in Pandas DataFrame along with some examples and we have seen the working of Pandas groupby() method. This is one of the useful features in Data Engineering and Data Analysis that we must know Pandas GroupBy method.

If you found this article helpful, Please share and keep visiting for further Pandas tutorials.