Query AWS CloudTrail with Athena for GxP compliance & auditing

Aneesh Karve

Published in

Quilt

3 min readFeb 23, 2023

Compliance regimes like GxP require teams to audit facts such as the following:

Who accessed data?
Which data did they access?
When did they access this data?
Were the data downloaded to an external device?

Pre-requisites

Your data are in Amazon S3 buckets
You are monitoring the S3 buckets in CloudTrail
You have set up an Athena table to query your CloudTrail logs.
You use Amazon Athena Engine version 3 so that you can access to the latest Trino functions.

Example

Suppose you wish to check CloudTrail for download events within the last hour against s3://YOUR_BUCKET. You can use the following SQL:

SELECT * FROM "YOUR_DB"."YOUR_TABLE"
WHERE date_diff('hour', from_iso8601_timestamp(eventtime), now()) <= 1
and eventname = 'GetObject'
and json_extract_scalar(requestparameters, '$.bucketName') = 'YOUR_BUCKET';

Let’s now examine the WHERE clause and some CloudTrail columns to understand this query.

eventtime

eventtime contains strings like "2023–02–20T21:12:08Z" that we can parse with from_iso8601_timestamp. date_diff handily creates a temporal window around its last argument. The units of the window range from nanoseconds to (ns) to days (d).

eventname

Depending on which agents use your S3 buckets, eventname may have a broad variety of values. A few examples of Get events include GetDownloadUrlForLayer, GetAuthorizationToken, GetTrailStatus, and (of interest to us because we wish to audit object downloads) GetObject.

useridentity

useridentity is a STRUCT. We’ll focus on the .userName and .principalid fields. In our example, we’re auditing usage of S3 via the Quilt Catalog, a JavaScript application wherein all users have assumed roles. We notice that userName is empty but principalid contains email addresses: "ABC1234567:user@example.com".

We can isolate usernames as follows:

SELECT element_at(split(useridentity.userName, ':'), -1)

requestparameters

requestparameters is a JSON string that we can parse with the indispensable json_extract_scalar, which uses a JSONPath expression to extract any JSON element as a string. json_extract_scalar(requestparameters, '$.bucketName') pulls out the name of the parent S3 bucket. Similarly, we can isolate file names (object keys, really) with json_extract_scalar(requestparameters, '$.key').

But did they download anything?

This is where things get interesting. In the Quilt Catalog users can preview S3 objects (which results in a GetObject call but the data stay within AWS Lambda), or they can download S3 objects to disk. To distinguish preview from download we can examine sourceipaddress and useragent. If our downloading users all share a specific IP prefix, perhaps from our VPN, we can do something like this:

WHERE sourceipaddress LIKE '10.162.%'

Alternatively, we can filter out known AWS IPs to eliminate GetObject calls from within the AWS cloud.

We can even determine exactly how many bytes were transferred out:

SELECT json_extract(additionaleventdata, '$.bytesTransferredOut')
FROM "YOUR_DB"."YOUR_TABLE"
WHERE CAST(json_extract(additionaleventdata, '$.bytesTransferredOut') AS INTEGER) > 0

Another clue as to “who downloaded this S3 object” can be found in the useragent column. Notice that the quilt3 Python client and Quilt Catalog have distinct user agents.

"[Python-urllib/3.7]"
"[Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36]"

Conclusion

AWS Athena is a straightforward way to comb through CloudTrail logs for compliance and auditing purposes. The queries are not always performant because they often require large table scans. In a future article, we may examine partition projection to improve performance.