Query AWS CloudTrail with Athena for GxP compliance & auditing
Compliance regimes like GxP require teams to audit facts such as the following:
- Who accessed data?
- Which data did they access?
- When did they access this data?
- Were the data downloaded to an external device?
Pre-requisites
- Your data are in Amazon S3 buckets
- You are monitoring the S3 buckets in CloudTrail
- You have set up an Athena table to query your CloudTrail logs.
- You use Amazon Athena Engine version 3 so that you can access to the latest Trino functions.
Example
Suppose you wish to check CloudTrail for download events within the last hour against s3://YOUR_BUCKET
. You can use the following SQL:
SELECT * FROM "YOUR_DB"."YOUR_TABLE"
WHERE date_diff('hour', from_iso8601_timestamp(eventtime), now()) <= 1
and eventname = 'GetObject'
and json_extract_scalar(requestparameters, '$.bucketName') = 'YOUR_BUCKET';
Let’s now examine the WHERE clause and some CloudTrail columns to understand this query.
eventtime
eventtime
contains strings like "2023–02–20T21:12:08Z"
that we can parse with from_iso8601_timestamp
. date_diff
handily creates a temporal window around its last argument. The units of the window range from nanoseconds to (ns
) to days (d
).
eventname
Depending on which agents use your S3 buckets, eventname
may have a broad variety of values. A few examples of Get events include GetDownloadUrlForLayer, GetAuthorizationToken, GetTrailStatus
, and (of interest to us because we wish to audit object downloads) GetObject
.
useridentity
useridentity
is a STRUCT
. We’ll focus on the .userName
and .principalid
fields. In our example, we’re auditing usage of S3 via the Quilt Catalog, a JavaScript application wherein all users have assumed roles. We notice that userName
is empty but principalid
contains email addresses: "ABC1234567:user@example.com"
.
We can isolate usernames as follows:
SELECT element_at(split(useridentity.userName, ':'), -1)
requestparameters
requestparameters
is a JSON string that we can parse with the indispensable json_extract_scalar
, which uses a JSONPath expression to extract any JSON element as a string. json_extract_scalar(requestparameters, '$.bucketName')
pulls out the name of the parent S3 bucket. Similarly, we can isolate file names (object keys, really) with json_extract_scalar(requestparameters, '$.key')
.
But did they download anything?
This is where things get interesting. In the Quilt Catalog users can preview S3 objects (which results in a GetObject call but the data stay within AWS Lambda), or they can download S3 objects to disk. To distinguish preview from download we can examine sourceipaddress
and useragent
. If our downloading users all share a specific IP prefix, perhaps from our VPN, we can do something like this:
WHERE sourceipaddress LIKE '10.162.%'
Alternatively, we can filter out known AWS IPs to eliminate GetObject
calls from within the AWS cloud.
We can even determine exactly how many bytes were transferred out:
SELECT json_extract(additionaleventdata, '$.bytesTransferredOut')
FROM "YOUR_DB"."YOUR_TABLE"
WHERE CAST(json_extract(additionaleventdata, '$.bytesTransferredOut') AS INTEGER) > 0
Another clue as to “who downloaded this S3 object” can be found in the useragent
column. Notice that the quilt3
Python client and Quilt Catalog have distinct user agents.
"[Python-urllib/3.7]"
"[Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36]"
Conclusion
AWS Athena is a straightforward way to comb through CloudTrail logs for compliance and auditing purposes. The queries are not always performant because they often require large table scans. In a future article, we may examine partition projection to improve performance.