Menu Close

How to Use AWS S3 Select to Query CSV

Use AWS S3 Select to Query CSV

Hi, In this article, you will learn all about How to Use AWS S3 Select to Query CSV with the help of the examples. Throughout this article, we will try to Query CSV files using the AWS management console and Python boto3 library with the help of examples.

Let’s understand all about AWS S3 SELECT.

What is AWS S3 SELECT?

Amazon S3 SELECT is a feature of AWS S3 ( Simple Storage Services ) that allows us to retrieve a subset of data from an object with the help of simple SQL operations.
By using AWS S3 SELECT, we can reduce the amount of data that AWS S3 transfers which reduce the cost and latency to retrieve the data.

Amazon S3 SELECT works only with objects that are stored in JSON, CSV, and Parquet format. It works with objects that are compressed in GZIP or BZIP2 ( Only for CSV and JSON ).

AWS S3 SELECT supports subsets of SQL SELECT statements which are given below.

  • SELECT:- it is used to select a specific column from the object.
  • FROM:- It is used to specify the object name.
  • WHERE:- It is used to specify a condition.
  • LIMIT:- it limits the result sets.

When should we use AWS S3 SELECT?

You can use the AWS S3 SELECT statement when you need to retrieve a subset of data from the object stored in S3. This can be useful in the following scenarios.

  • When you need to improve the performance of the applications. AWS S3 SELECT can significantly improve the performance of the applications by avoiding the need of downloading entire objects.
  • When you need to cost your application. AWS S3 SELECT can help to reduce the cost by reducing the amount of data that needs to be stored and transferred.
  • When you need to filter the data in the object. AWS S3 SELECT statement allows us to use SQL extension to filter the data and in the object which can be helpful for data analysis and other tasks.

Here are some examples of when you might want to use AWS S3 SELECT.

  • Suppose you have large CSV files with customer data and you want to retrieve data for customers that belong to a specific state.
  • You have a JSON file with product data and you want to retrieve products that are in stock.

Limitations of AWS S3 SELECT

These are some limitations of AWS S3 SELECT that you should remember.

  • It does not support all SQL features.
  • It can only retrieve a subset of objects stored in S3.
  • It only works with JSON, CSV, and Parquet files.
  • The maximum length of SQL expression must be 256KB.
  • The maximum length of a record in input or result is 1 MB.
  • Amazon S3 SELECT does not support parquet output. You will have to specify the output as JSON or CSV.

I hope you would have understood the basics of the AWS S3 SELECT statement, Now let’s see how we can use AWS S3 SELECT with the help of the management console and Pythpn boto3 library. We will see all the ways one by one.

How to Use AWS S3 Select to Query CSV using AWS Management Console?

First, we will see the AWS management console to query CSV files using AWS S3 SELECT.

Use AWS S3 Select to Query CSV
  • Choose the bucket that contains the objects that you want to select content from and then choose the name of the object. In my case, the bucket name is myemployeesdata.
Use AWS S3 Select to Query CSV
  • Click on the object name which you want to get content. In my case, it will be employee_one.csv
Use AWS S3 Select to Query CSV
  • Choose Object options and then choose Query with S3 Select.
Use AWS S3 Select to Query CSV
  • Configure the input settings based on the format of your CSV data.
  • Configure output settings based on the format of data that you want to receive.
  • To select the records from the object, under the SQL query, enter the SQL SELECT commands. After entering the SQL SELECT command, click on Run SQL Query.
Use AWS S3 Select to Query CSV
  • After executing the above SQL SELECT command, you will see your results like this.
Use AWS S3 Select to Query CSV
  • To show the output in tabular format, click on the Formatted option.
Use AWS S3 Select to Query CSV

S3 SELECT statement with WHERE Clause:

We can also use SQL WHERE Clause to specify the condition for data. For example, I want to select only those employee records whose salary is greater than 30000.

My SQL SELECT command will be the following.

SELECT * FROM s3object s WHERE s._7 > '30000';

_7 indicates the position of the column means in the CSV file salary column on the 7th position. Position always starts from 1.

After executing the above SQL SELECT command, The result will be:

Use AWS S3 Select to Query CSV

As you can see, all the employees have been fetched whose salary is greater than 30000.

How to Use AWS S3 Select to Query CSV using Python boto3?

Boto3 is AWS SDK written in Python programming language in order to make requests for various AWS services. As developers sometimes we want to use AWS S3 SELECT in our application then we can use Python boto3 library.

Python boto3 is an external Python library that first of all we have to install it by using the Python pip command.

You can use the below command to install boto3.

pip install boto3

Prerequisites to use boto3, You must have AWS secret key and access key and configure both the secret key and access key in your machine.
You can follow the below tutorials for creating an access key and secret key and configuring AWS CLI ( Command Line Interface ).

Now, I am assuming, You have got the access key and secret key and configured AWS CLI ( Command Line Interface ) also.

Now let’s see how we can use the boto3 library to use the AWS S3 SELECT statement to query data from the S3 object.

I have created a Python file called s3_tutorial.py and mentioned the below code inside that file.

import boto3


client = boto3.client('s3')
response = client.select_object_content(
    Bucket='myemployeesdata',
    Key='employees_one.csv',
    Expression="SELECT * FROM s3object s WHERE s.salary > '30000';",
    ExpressionType='SQL',
    InputSerialization={
        'CSV': {
            'FileHeaderInfo': 'USE'
        }
    },
    OutputSerialization={
        'JSON': {}
    }

)

for event in response['Payload']:
    if 'Records' in event:
        print(event['Records']['Payload'].decode('utf-8'))
    elif 'Stats' in event:
        print(event['Stats'])

In the above, I am fetching all those employee records whose salary is greater than 30000. Executed the code by using the below command.

python s3_tutorial.py 

After executing the code, You will see your result. In my case, the results will be the following.

Use AWS S3 Select to Query CSV

In fact, I have written the same code in my PyCharm IDE and executed it also, As you can see.

Use AWS S3 Select to Query CSV

So this is how you can use AWS S3 SELECT statement query CSV file.

AWS S3 SELECT Documentation:- Click Here

Conclusion

So in this tutorial recipe, We have successfully covered all about how to use AWS S3 Select to query CSV files. We have seen a total of two ways to query CSV files first us using the AWS Management Console and the second using the Python boto3 library.

I hope the process of explaining this article was pretty easy, You can go with it as per your requirement. if your want to AWS S3 SELECT in your Python application then I will recommend you go with Python boto3 library other you can go with the management console.

If you found this helpful, please share and keep visiting for further AWS tutorials.

Happy coding….

How to Read CSV Files from S3 Using Python
How to Give AWS Lambda Access to IAM User

Related Posts