Macie: AWS’ answer to “Gee I didn’t know that was there…”

7 min readOct 22, 2020

Recently I’ve read a number of articles related to enterprises misconfiguring their IT assets to expose sensitive data to people it shouldn’t. In light of this, I’ve decided to pen some thoughts on AWS’ revamped Macie service, and how it could assist in organisations looking to understand what sits in their S3 Buckets. I’ve also added my two cents on a few configuration trade-offs. I should point out that these are my opinions, and don’t reflect those of my employer.

Oh, I didn’t realise that was in there…

S3 Buckets can function a lot like the drawer in the hallway or kitchen — “The Everything Drawer”. Their versatility can present an issue from a risk profile perspective, given the breadth of objects that may reside in there. Even if you think you know what might be in your hallway drawer or S3 bucket; chances are you don’t get everything 100% right. This is important. How would your decisions change if your passport was in there, as opposed to your favourite triple A batteries? Would you think about putting a lock on it, with a set number of keys that you give only to your immediate family? Perhaps you’d relocate it to a more secure part of the house. They’d have to be some pretty special batteries if you are going to those lengths to protect them. Important distinctions of what poses a risk, what warrants a conversation, and what requires urgent remediation can’t be made until you’re fully aware of what is in your environment, or your hallway drawers. Enter Macie.

Let’s face it, everyone has a drawer ~ “Day 149 — Money” by DaGoaty

Where does Macie come in?

Macie is a managed service by AWS, which monitors the contents of S3 buckets and their configurations. S3 configuration alerting isn’t unique to Macie, but the flagging of sensitive data within files is the key differentiator of the service. S3 configuration deserves it’s own blog post, so for the purposes of this post I am going to focus on Macie’s sensitive information discovery.

Macie uses a combination of black box machine learning and regex functions; in conjunction with keyword lists, to flag potentially sensitive information in text based files. Whilst it’s limited to only S3, you can get round this by exporting from RDS and DynamoDB in Apache Parquet format, and temporarily uploading it to a bucket to be scanned.

Macie looks for the sensitive data from the following broad categories:

Personally Identifiable Information ~ Names, Addresses, Telephone numbers, Driver’s License numbers etc
Health Information ~ North American dominated formats
Financial Data ~ Credit Card Numbers, CVV/C, Bank Account Numbers etc
Credentials ~ AWS Secrets, OpenSSH, PKCS etc

Basically, all the juicy stuff that really bulks out an ITNews article, or makes the day of anyone with a penchant for Black Hats stumbling across it. In addition to these categories, Macie also has the ability to add custom data types with similar formats to their in-built identifiers.

Macie operates by configuring ‘jobs’, which instruct Macie to run over a certain collection of S3 buckets and the objects within. As the sensitive data discovery is billed on the volume of data ingested, scoping the scans effectively is important from an operational & cost perspective — discussed further below. These jobs can be scheduled, or run as a once-only process. In addition, the scope of objects covered in the bucket can be limited based on tags present/absent; or as a random sample to make up a total percentage of coverage i.e. 50% of objects. The sampling depth and one-time jobs are great for dipping your toe in the Macie pool. That said, in scenarios where a single potentially sensitive file (i.e. a pem key) is in a bucket, a sample may not contain it and therefore not give an accurate scan of a bucket. Longer term, using standardised object level tagging inclusion and exclusion (easier said than done) across an environment can be a great way of ensuring that Macie is focused on the objects it should be.

There are a number of ways to moderate the files for Macie to oversee. This section plays a huge part in deciding whether your Macie implementation represents “Death by Alerts” or “Crickets”

For each file flagged as containing sensitive information, Macie produces a finding which is populated in the console, with a 30 day retention period. As a highly encouraged additional configuration, an S3 bucket can be configured as a destination of all Macie findings for indefinite storage. Being able to interpret the detailed reports will help locate the exact position of the data, without which looking through files and folders is akin to a needle in a haystack operation. If it’s not immediately obvious, Macie can provide the data to fuel conversations aimed at understanding what the business need is for this data to reside within a bucket. There may be perfectly legitimate reasons for potentially sensitive data to reside in a bucket, but the configuration (access control and encryption as a starting point) around the bucket needs to reflect the risk profile.

So, you think you know what’s in your bucket? An example of Macie’s findings page, with different types of sensitive information across multiple files.

As it stands, Macie doesn’t have the ability to explicitly mark findings as false positives. Instead, it requires users to create suppression rules; which are a collection of attributes that are checked against findings. Any finding that fits that set of attributes (i.e. A personal data finding in bucket1234) will continue to be triggered, but marked as archived and filtered out of the default findings dashboard.

Suppression Rules allow you to tune out findings that aren’t of interest and focus on those that matter.

Configurations for Thought

For anyone considering implementing Macie within their environments, there are a few configuration decisions that are worth deliberating over.

Jobs to Bucket Ratio

The Macie jobs govern the scope of S3 Buckets and objects to scan, which is then tied up under a Job ID under the Macie console. Jobs are immutable, and as such I’d recommend sticking to scoping each job to one S3 bucket. Whilst this does create a slight administrative overhead initially (negligible if your Macie onboarding process is automated), the trade off to flexibly onboard/offload buckets is well worth it. For anyone interested, I’ve uploaded some code onto my Github as a point of reference for configuration of multiple single S3 Bucket Macie jobs.

Scoping

Like most things, investing in setting up and scoping the Macie scans properly at the beginning will save you time, money and sanity (think log file UID’s that just happen to look like credit card numbers). As an example, buckets storing objects like VPC flow logs will likely yield minimal true positives if AWS defaults and logging best practice is adhered to. As a starting point, creating a scan of these buckets with a sample percentage depth configuration can be a good way to indicate whether more long term options are required. There is always going to be a trade off associated with bucket characteristics when deciding whether Macie is a worthwhile pursuit. As an example, when comparing S3 Buckets with differing data flows:

Job configuration location

Macie is natively integrated with AWS Organisations; as well as allowing for a Macie delegated account to be configured to scan S3 Buckets across the organisation. Whilst this does give a ‘one view’ of the environment, sensitive data job results are only able to be viewed in the Macie console from which they are enabled. This can present an issue when it comes to triaging the results, particularly given that the teams responsible for Macie operation/administration and triaging the results may be different. For the ease of triage and reduction of IAM requirements, it can just be more practical to configure the jobs in the accounts which the S3 bucket that is being scanned resides. If accumulation of the logs back into an Audit account or SIEM is of importance, an EventBridge driven lambda function may be your best bet.

A piece of the sensitive information discovery puzzle

Macie presents a turnkey solution for people to understand more about what is in their environments, with minimal initial configuration required. Understanding the full data landscape enables a fully informed risk discussion on appropriate configurations within environments. Running Macie across somewhere like a Sandpit environment has the opportunity to create a lot of “Oh I didn’t realise that was there” moments, and with that quick basic security hygiene wins. With the basic building blocks for scalability in place, Macie can be easily utilised as a one-time recon tool, or inserted as a standalone layer for a defence-in-depth strategy.

I hope that you’ve got some useful information out of this blog post. If you’ve got any constructive criticism; technical, literary or otherwise, I’d be all ears.

Macie: AWS’ answer to “Gee I didn’t know that was there…”

Oh, I didn’t realise that was in there…

Where does Macie come in?

Configurations for Thought

A piece of the sensitive information discovery puzzle

Written by Ben Leembruggen