Menu Close

How to Split String in Pandas DataFrame Column

Split String in Pandas DataFrame Column

In this Pandas Tutorial, we will see how to split string in Pandas DataFrame column with the help of the examples. Sometimes a column in Pandas DataFrame contains more things and we want to split that information and store it into multiple columns. To split a string, Pandas provides a string method called split() that is used to split the string and return a Python list.

Python split() string method is also capable of storing all the strings after splitting into multiple columns.

I have prepared a sample CSV dataset along with some dummy records as you can see below.

Sample CSV Dataset

From the above CSV dataset, I have created a Pandas DataFrame.

Pandas Sample DataFrame


If you don’t know, How to load a CSV file into Pandas DataFrame, Then ๐Ÿ‘‰ click here to learn the complete tutorial.

Before using the Pandas string split() method, let’s see all about the Pandas string split() method.

Pandas str.split() Method

It is a Pandas Series method that is used to split the string for a given delimiter. It always applies on top of the Pandas series that contains the string value. Each column of the Pandas DataFrame represents the Pandas Series.

This is the syntax of the split() method.

Series.str.split(pat=None, *, n=-1, expand=False, regex=None)

Parameters:

  • pat:- It indicates the pattern or string value to split on. If it is not specified, space will be used to split the string. You can also treat this as a delimiter.
  • n:- It limits the number of splits.
  • expand:- It will be used when we want to split the string into multiple columns. It takes only a boolean value; True or False.
    • True:- Return DataFrame/MultiIndex expanding dimensionality.
    • False:- Return series which contains a list of strings after splitting.
  • regex:- It is used to determine if the passed-in pattern is a regular expression:
    • If True, assumes the passed-in pattern is a regular expression
    • If False, treats the pattern as a literal string.
    • If None and pat length is 1, treat pat as a literal string.
    • If None and pat length is not 1, treat pat as a regular expression.
    • Cannot be set to False if pat is a compiled regex.

Let’s see the use case of the split() method with the help of the examples.

How to Split String in Pandas DataFrame Column

Here we will see multiple use cases of the split() method and also will use all the parameters of the split() method.

Split string into a List:

As you can see in the above Pandas DataFrame, It has a column emp_full_name that has a combination of the first name and last name of the employees and now we will see how to split the full name of the employees into a list.

The string split() method takes space delimiter by default that’s why we don’t need to use a pat parameter to be passed the delimiter of the split() method and it will return a list of strings which will be stored into a new column called split_column.

Let’s see how can we do that.

import pandas as pd

df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df['split_string'] = df['emp_full_name'].str.split()
print(df)
Split String in Pandas DataFrame Column
Pandas split column by space delimiter

As you can see in the above Pandas DataFrame Output. A new column named split_string has been added along with the list.

Pandas split column by delimiter:

Sometimes a string column might have another delimiter rather than space, In that case, we can use that delimiter as the first parameter in the split() method because the first argument of the split() method takes a delimiter other than a space delimiter.

I have changed the space delimiter in the emp_full_name column with comma (,) to be used comma delimiter in the split() method.

Let’s see.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df['split_string'] = df['emp_full_name'].str.split(pat=',')
print(df)
Split String in Pandas DataFrame Column
Pandas split column by comma delimiter

Limit Number of Splits:

As you can see in the emp_email column that has the email of the employees, In each employee’s email, There are two dots exist that’s why two split operations will happen and if we split with a dot delimiter after that a list will be returned along with three items. But suppose we want to perform only one split operation then we can use the n parameter of the split() method which indicates the number of split operations that should be performed.

Let’s see how we can limit the number of splits.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df['split_email'] = df['emp_email'].str.split(pat='.', n=1)
print(df)
Split String in Pandas DataFrame Column
Limit Number of Splits in Pandas DataFrame

Pandas split one column into multiple columns:

The expand parameter of the split() method is used to create a new column after splitting the string. For example, in the first column of the DataFrame, the full name of the employee is included now we want to split the full name into two columns first_name and last_name because the emp_full_name column has been made by a combination of first_name and last_name.

To create a new column we will use expand=True in the split() method.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df[['first_name', 'last_name']] = df['emp_full_name'].str.split(pat=' ', expand=True)
print(df)
Pandas split one column into multiple columns
Pandas split one column into multiple columns

Now you can see in the original DataFrame, that two columns first_name and last_name have been added.

Access specific items from the list after splitting:

Pandas provides a method called get() that is used to extract the item from the lists or tuples, strings, sets, or dict at specified positions or with specified keys.

Let’s split the emp_full_name column with space and store the first item of the list in the first_name column and the second item of the list in the last_name column with the help of the get() method.get() method takes the position of the item.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df['split_full_name'] = df['emp_full_name'].str.split()
df['first_name'] = df['split_full_name'].str.get(0)
df['last_name'] = df['split_full_name'].str.get(1)
print(df)
Access specific items from the list after splitting:
Access specific items from the list after splitting

Extract Domain Name from Email in Pandas:

In the Pandas DataFrame, we have a column emp_email that has the email of employees and now we want to get the domain name from the email.
To get the domain name from the email we have to split the employee email from @ because the string after @ represents the domain name.
The @ sign will be used as the delimiter in the pat parameter of the split() method.

Let’s see how can we do that.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df['domain'] = df['emp_email'].str.split(pat='@').str.get(1)
print(df)
Extract Domain Name from Email in Pandas
Extract Domain Name from Email in Pandas

As you can see, we have successfully extracted the domain name from the email. This example is not about only gmail.com, you can extract any kind of domain name from the email.

Use Regex to Split The String:

Regex means regular expression that is used to search for a specific pattern in a string. To understand this regex in the split() method you must know the Python regular expression but don’t worry we have written a complete article on Python regular expression, you can get the article by ๐Ÿ‘‰ clicking here.

In the first column of Pandas DataFrame, You can see there are multiple full names where ‘al’ is occurring together and now I want to split the full name with ‘al’. Remember, if you using regex in the split method then you must use regex=True in the split() method because when you use regex=True the value of the pat parameter will be treated as a Python regular expression.

If ‘al’ will not occur together in the emp_full_name column then the exact string will be returned.

Let’s see a simple example.

import pandas as pd
df = pd.read_csv(
                 '../../Datasets/employees.csv'
                )

df['split_string'] = df['emp_full_name'].str.split(pat='al', regex=True)
print(df)
Use Regex to Split The String
Use Regex to Split The String

As you can see in the emp_full_name column of Output DataFrame, ‘al’ is not occurring together in some names that’s why the same name has been returned in the split_string column and the list of the string after splitting has been returned if ‘al’ is occurring.

This is how you can use Python regular expression in the split() method to split the Pandas DataFrame column.

split() Method Documentation:- Click Here

Helpful Pandas Articles


Conclusion

The split() method is the best method in Pandas to split the string value in the Pandas DataFrame column or Pandas Series. In real-life Pandas applications, mostly we have to perform split operations on the Pandas DataFrame column to get some
specific value, Then we can go with the split() method. Remember, the split() method should be applied on on string value not other than string.

If you found this article helpful, please share and keep visiting for further Pandas tutorials.

Thanks for your valuable timeโ€ฆ.

How to convert DataFrame to HTML in Python
How to use GroupBy in Pandas DataFrame

Related Posts