AWS Serverless Services: Athena

Reading Time: 2 minutes

Hi folks, in my this series about the AWS Serverless Services, I will be covering few the services provided by AWS and which can be used as an part of your Server less Architecture. This is going to be a developers perspective, so I would be writing everything about service but security and connectivity. At the end of this series we will try to put all the things together to build a completely Server less API.

So the first service I am going to talk about is AWS Athena.

Athena

Athena is a service that explicitly queries Amazon Simple Storage Service, or Amazon S3, using ANSI-standard SQL. On the backend AWS Athena uses a opensource facebook DB named Presto. Since its release, Serverless has picked up pace as we don’t have to manage or create an infrastructure. It follows the AWS model of pay per use. Where you only pay for the queries you run.

Athena is versatile enough to manage a number of tasks related to database queries, as described earlier. It runs standard SQL, supporting standard formats of data like CSV, JSON, ORC, Avro, and Parquet. With ANSI SQL support, Athena uses Presto — an open-source SQL database engine — so it is not a proprietary database tool that users would have to learn from the ground up.

Now lets look at a typical problem of reading and querying a nested JSON file stored in S3.

Example

For this example lets take a snapshot of the json file produced by the gatling:

{
  "name": "JXMQ Auto Test",
  "numberOfRequests": {
    "total": 5060,
    "ok": 5054,
    "ko": 6
  },
  "minResponseTime": {
    "total": 219,
    "ok": 219,
    "ko": 219
  },
  "maxResponseTime": {
    "total": 1087,
    "ok": 1087,
    "ko": 647
  },
  "meanResponseTime": {
    "total": 344,
    "ok": 344,
    "ko": 294
  },
  "standardDeviation": {
    "total": 213,
    "ok": 213,
    "ko": 158
  },
  "percentiles1": {
    "total": 236,
    "ok": 236,
    "ko": 225
  },
  "percentiles2": {
    "total": 260,
    "ok": 260,
    "ko": 228
  },
  "group1": {
    "name": "t < 800 ms",
    "count": 4598,
    "percentage": 91
  },
  "meanNumberOfRequestsPerSecond": {
    "total": 77.84615384615384,
    "ok": 77.75384615384615,
    "ko": 0.09230769230769231
  },
  "dateandtime": "2020-06-08 13:44:27.729128"
}

To Parse this nested JSON we will create a table in the Athena

CREATE EXTERNAL TABLE automation_reports (
name string,
numberOfRequests struct<total:float,ok:float, ko:float>,
minResponseTime struct<total:float,ok:float, ko:float>,
maxResponseTime struct<total:float,ok:float, ko:float>,
meanResponseTime struct<total:float,ok:float, ko:float>,
standardDeviation struct<total:float,ok:float, ko:float>,
percentiles1 struct<total:float,ok:float, ko:float>,
percentiles2 struct<total:float,ok:float, ko:float>,
group1 struct<name: string, count: BIGINT, percentage: float>,
meanNumberOfRequestsPerSecond struct<total:float,ok:float, ko:float>,
dateandtime timestamp
)         
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://athena-folder/' 

Note: For Row Format we are using the parsing methods provided by the AWS, we are using OpenX JSON SerDe . For other fileformats and parsing methods refer to https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html

And then to query the table for specific information you can write a sql query as follows

Select numberofRequests.total as totalNumberOfRequests, 
       numberofRequests.ok, 
       meanresponseTime, 
       dateandtime 
from "dbName"."automation_reports"

This results in the columnar result

So this is a simple example showcasing how you can parse and query an complex JSON file stored in S3 using Athena.

Stay tuned for upcoming AWS Serverless Services

Knoldus-blog-footer-image