Using AWS Athena to query S3 Server Access Logs.

Written by mannem on . Posted in EMR || Elastic Map Reduce

S3 server access logs can grow very big over time and it is very hard for a single machine to Process/Query/Analyze all these logs. So, we can use distributed computing to query the logs quickly. Ideally we might think of Apache Hadoop/Hive/Spark/Presto etc to process these logs. But, the simplicity of AWS Athena service as a Serverless model will make it even easier. This article will guide you to use Athena to process your s3 access logs with example queries and has some partitioning considerations which can help you to query TB’s of logs just in few seconds.

The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space delimited fields. The following is an example log consisting of six log records.

This article is based on the Log format of access logs specified here https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html. The field’s and its format may change and DDL QUERY should be changed accordingly.

Sample Data:

Data Format considerations:

On first look, the data format appears simple , which is a textfile with space filed delimiter and newline(/n) delimited. However, there is a catch in this data format, the columns like Time , RequestURI & User-Agent can have space in their data ( [06/Feb/2014:00:00:38 +0000] , "GET /gdelt/1980.csv.gz HTTP/1.1" & aws-cli/1.7.36 Python/2.7.9 Linux/3.14.44-32.39.amzn1.x86_64) which will mess up with the SerDe parsing .

For RequestURI & User-Agent field, the data is enclosed in quotes with spaces inside it. This can be parsed by any SerDe’s that support Quotes. Unfortunately, Athena does not support such SerDe’s like org.apache.hadoop.hive.serde2.OpenCSVSerde which does has quotes feature.

The org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe included by Athena will not support quotes yet.

A custom SerDe called com.amazon.emr.hive.serde.s3.S3LogDeserializer comes with all EMR AMI’s just for parsing these logs. A query like the following would create the table easily. However, this SerDe will not be supported by Athena.

Looking at these limitations, the org.apache.hadoop.hive.serde2.RegexSerDe only seems like the feasible option. Writing RegEx for your log structure was bit time consuming , but I was able to write it with the help of ELB log structure example and http://regexr.com/ .

I used the following Regex for the S3 server access logs :

([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)$

Also found at : http://regexr.com/3erct , where I removed some extra escape chars like \’s. In DDL , we need to add these escape characters like in following DDL.

([^ ]*) – matches a continuous string. Because the number fields like ‘time’ can return ‘-‘ character , I did not use ([-0-9]*) regex which is used for numbers. Because of same reason , I had to use STRING for all fields in DDL. We can use presto string functions to convert strings for appropriate conversions and comparison operators on DML

\\[(.*?)\\] or \\\[(.*?)\\\] – is for Time field – For a string like [06/Feb/2014:00:00:38 +0000], it will match and give a string like 05/Dec/2016:16:56:36 +0000 which is easier for querying between times.

\\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" – is for Request-URI field – removes the quotes and splits the "GET /mybucket?versioning HTTP/1.1" into three groups.

(\"[^\"]*\") – is for User-Agent – matches the whole string with quotes.

(-|[0-9]*) – is for HTTPstatus which is always a number like 200 or 404 . If you expect a ‘-‘ string, you can use ([^ ]*) instead,

CREATE TABLE DDL :

Where s3://s3server-accesslogs-bucket/logs/ is the bucket and prefix of you access logs.

Once you create the table, you can query this table for specific information :

Example Queries :

Any field can be set to “-” to indicate that the data was unknown or unavailable, or that the field was not applicable to this request.

Considerations:

1. From time to time, AWS might extend the access log record format by adding new fields to the end of each line.The DDL/RegEx must be written to handle trailing fields that it does not understand.
2. If the s3 directory is huge with a lot of log files , based on the query , you might encounter Athena’s “Query timeout: 30 minutes”. http://docs.aws.amazon.com/athena/latest/ug/service-limits.html. We have some optimizations here : https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Partitioning / Compression :

Overtime, the logs in your target prefix(s3://s3server-accesslogs-bucket/logs/) can grow big. This is especially true if you have this same bucket/prefix for other buckets access logs. If you enable logging on multiple source buckets that identify the same target bucket, the target bucket will have access logs for all those source buckets, but each log object will report access log records for a specific source bucket.

Amazon S3 uses the following object key format for the log objects it uploads in the target bucket: TargetPrefixYYYY-mm-DD-HH-MM-SS-UniqueString . So, it basically creates multiple files with a bunch of logs in the S3 bucket.

Querying can be slower if there are large number of small files in textformat.

To speed things up, we need at-least one of these options to consolidate these logs.
1. PARTITIONING – can work on subset of files instead of going through all files.
2. CONVERT TO a COLUMNAR STORAGE – like ORC or PARQUET For faster query performance
3. ENABLE COMPRESSION – Snappy or GZ
4. BLOOM FILTERS

Note that S3 access logs data are not partitioned and they are not organized with prefixes like /year/month/day or /year=2010/month=10/day=11/ etc using which Athena could make static or Dynamic partitions respectively. Also access logs are uncompressed and flat text files.

Unfortunately, we cannot enable all above options by just defining a DDL. We need to copy/move out text based data-set with the above options enabled. One option to use of AWS EMR to periodically structure and partition the S3 access logs so that you can query those logs easily with Athena. At this moment, it is not possible to use Athena itself to convert non-partitioned data into partitioned data.

On EMR, we could use either
HIVE : To partition/compress/covert , or
S3-DIST-CP utility : To compress/groupBy

Note that we can use EMR’s Presto / Spark etc to query the logs, but Athena being a serverless model is much easier to query without managing anything.

Here’s a example to convert Non partitioned s3 access logs to partitioned s3 access logs on EMR:

Continuing the Partitioning for incoming logs :

Note that above discussion talks about one time creation of all partitions. There’s two things to notice
1. While the above job is running , there could be new incoming logs which will not be partitioned onto destination.
2. Also after the job is completed, you keep getting new logs.

Which S3 files will be missed by the above job, really depends on hive-client’s split calculation on your source s3 bucket. when you run your ‘insert overwrite’ command, hive-client calculates splits initially by listing all objects inside the S3 prefix. This usually takes minutes and depends on number of s3 objects. After this split calculation is done, the actual jobs to do partitioning is done by mappers / reducers. So, any log files pushed to source prefix after split calculation is done will not be included on your job.

When you want to automatically add partitions to Athena table(Accesslogs_partitionedbyYearMonthDay) with new files coming to your S3 access logs with newer request days on your log bucket/prefix, you will need to have a ‘batch job on schedule’ or a ‘trigger mechanism’ to make sure the new data is also partitioned accordingly.

Batch job on schedule :

Note that our source s3 access logs are dumped into one single directory. Each file name is like 2016-12-05-17-17-06-5D51435906E10586. Since we can run our batch jobs after the above one-time-bulk-job, on the daily(or frequent) batches, we need to avoid the files that we already processed by previous one-time-job. For this , we can rely on the filename itself to get a list of newly generated files and then use an ETL to convert these files to partitioned data. We could also maintain a manifest file for the list of files already processed and the files which need to be processed etc.

Managing all that is left beyond this blog. There’s multiple ways to do ETL. One way is to run an AWS Data Pipeline with EMR, on schedule, and run the hive queries on the new data.

/

Trigger based solution :

If you want to run a partition job on every S3 PUT or bunch of PUTS , you can use AWS Lambda which can trigger a piece of code on every S3 object PUT. You can get Metadata about that put and Handle that on your Lambda code accordingly to fireoff an ETL etc to do the partitioning. AWS Lambda should come with AWS SDK and you can make all types of AWS API calls with proper permissions.
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Other Input Regex’s :

CloudFront
'^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$'
ELB access logs
'([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\”([^ ]*) ([^ ]*) (- |[^ ]*)\\\” ?(\”[^\”]*\”)? ?([A-Z0-9-]+)? ?([A-Za-z0-9.-]*)?$'
Apache web logs
'([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?'

You may also like:
Using Athena to Query CloudTrail logs.
Parsing AWS Cloudtrail logs with EMR Hive and Spark

Tags: , , , , , , , , , , , , , , , , , , ,

Trackback from your site.

Comments (10)

  • John Richards

    |

    Hello mannem, and thank you for the informative posting. I was wondering if you figured out a way to implement partitioning? i have an s3 bucket access logs folder with multiple tarabytes of data in it. I tried to query it, and the query timed out after 30 minutes, so I need to do something to improve the performance.

    Thanks,
    John Richards

    Reply

    • mannem

      |

      John, I updated my blog to add an example for “Non partitioned s3 access logs to partitioned”. Basically, the s3 access logs are non partitioned. We need to partition them and covert them to columnar format for better querying and retrieval by Athena. Ideally, we should keep on partitioning incoming access logs over time. In your case, you might run a big one time job on a decent sized/tuned EMR cluster to partition TB’s worth of s3 logs based on DATE and/or BUCKET into a different S3 path with a columnar format to better query it with Athena.

      Reply

  • David Panofsky

    |

    The Regex given for ELB ACCESS LOGS will only work for logs created after May 18, 2015. You may want to consider replacing it with:

    ‘([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\”([^ ]*) ([^ ]*) (- |[^ ]*)\\\” ?(\”[^\”]*\”)? ?([A-Z0-9-]+)? ?([A-Za-z0-9.-]*)?$’

    This regex will work with both old an new logs. The ones records will just always have ‘user_agent’, ‘ssl_cipher’ and ‘ssl_protocol’ come back as null.

    Reply

    • mannem

      |

      David, thanks for the tip. updated the blog to reflex your regex.

      Reply

  • James

    |

    Hello John,

    In your case, I would use Amazon EMR to process the logs and instead of producing a text file as an output, insert those logs into Athena.
    James

    Reply

  • Serdar

    |

    Hi Mannem,
    I was trying to follow steps to partition the S3 bucket but When I try to execute the script it returns error FAILED: SemanticException [Error 10001]: Line 25:6 Table not found ‘tabl_created_in_athena’. does AWS EMR has access to the table created in Athena?

    Reply

    • mannem

      |

      Serdar@ , no AWS EMR does not have access to Metadata used by Athena. EMR uses its own local mySql metastore by default(which is empty for a new cluster.). So you’ll need to run the DDl’s separately

      Reply

  • Drowsy

    |

    This was very helpful, thank you.

    Reply

  • Trent

    |

    I have a very large S3 bucket (~1.5 TB) and I keep getting Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException when creating the ‘Accesslogs’ table in Hive. I’m looking through the Hive logs but nothing stands out. Any help?

    Reply

Leave a comment

  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb