How to Implement AWS Crawler using Boto3

Table of contents

Reading Time: 2 minutes

Introduction

Hi all, today we will be implementing aws crawler using boto3. we are going to code all the stuffs. Creating crawler, starting and deleting it. Crawlers are simply used to create table out of your data. The data can be files in s3 or JDBC or some other connections. We will be using S3 as our connection.

Creating AWS Crawler

import boto3
client = boto3.client('glue', region_name="us-east-1")

response = client.create_crawler(
    Name='CrawlerBoto3',
    Role='arn:aws:iam::967091080535:role/service-role/AWSGlueServiceRole-3',
    DatabaseName='Boto3',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://aki-aws-athena-1/data/',
            },
        ]
    },
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'DEPRECATE_IN_DATABASE'
    },
    RecrawlPolicy={
        'RecrawlBehavior': 'CRAWL_EVERYTHING'
    },
    LineageConfiguration={
        'CrawlerLineageSettings': 'DISABLE'
    }
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here, we are using a boto3 client. Please provide the region name as per your requirement. You can change the Name of your crawler as well.

This code will require the arn of role which will be having full glue aceess. kindly create one and provide the arn here as shown in code. Change the database name as per your requirement, it will create a new database.

In the target section, provide the path of your s3 folder where the files are stored from which crawler will be creating your table. You can also specify that what will happen if there is schema change in other files and also you can specify to crawl everything inside folder or some specific files.

Listing AWS Crawler

client = boto3.client('glue', region_name="us-east-1")
response = client.list_crawlers()
print(json.dumps(response, indent=4, sort_keys=True, default=str))

Starting AWS Crawler

Here is the code to start a crawler which was already made earlier. We are starting the latest crawler(in case if some previous crawlers are there).

client = boto3.client('glue', region_name="us-east-1")
response = client.list_crawlers()

response2 = client.start_crawler(
    Name=response['CrawlerNames'][0]
)

print(json.dumps(response2, indent=4, sort_keys=True, default=str))

After starting the crawler you can see the table created in your database.

Deleting AWS Crawler

We can also delete crawler using our code.

client = boto3.client('glue', region_name="us-east-1")
response = client.list_crawlers()
response2 = client.delete_crawler(
    Name=response['CrawlerNames'][0]
)

Conclusion

In this blog we have seen how to create, start and delete a crawler using boto3. Crawler are useful to create tables out of data we provide to it. Using that table we can query results using athena also. we will be seeing it in our next blog

You can also go through official documentation:- https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html