What is AWS Aurora Serverless v2 and why we didn't use it

We’re a big user of RDS Aurora from AWS to run our production workloads because it doesn’t require much infrastructure maintenance and feels like a very durable platform. If you run databases or are using RDS in AWS, then you’ve probably heard of AWS’s Aurora Serverless v2. In this article, I’m going to touch on what Aurora is, how Aurora Serverless v2 can help with your dynamic loads, and ultimately why we didn’t use it for our product.

What is Aurora

Ripped straight from the AWS docs:

Amazon Aurora (Aurora) is a fully managed relational database engine that's compatible with MySQL and PostgreSQL

What is Aurora Serverless v2

Ripped straight from the AWS docs:

Aurora Serverless v2 is an on-demand, autoscaling configuration for Amazon Aurora. Aurora Serverless v2 helps to automate the processes of monitoring the workload and adjusting the capacity for your databases.

I’m being very specific about what “v2” is because it’s a huge improvement over the product’s v1. So much so that even their own documentation is very clear in that they are 2 different things.

What problem were we trying to solve

Our product is heavy on the server side, which means quite a lot goes through the database. You can read more about it on a blog post Why is Kinde built the way it is?.

Keeping on top of database performance and scaling is critical. When an unexpected event occurs, such as a large customer migrating their entire user base in a weekend or being on the receiving end of a DDoS, our infrastructure needs to meet those needs. The default model for RDS Aurora is to provision on-demand instances with a fixed amount of vCPU and memory based on your requirements. You would have an instance size that is larger than your baseline so that unexpected events can be absorbed. If something too large was coming over the hill, then you'd have to provision a larger instance and then fail over to it. This all sounds quite inefficient.

Enter Aurora Serverless v2. Instead of provisioning based on a static vCPU and memory size, you now set a minimum and maximum threshold of those resources. Aurora Serverless v2 is calculated with ACUs, which stand for Aurora Capacity Unit.

1 ACU has approximately 2 GiB of memory with corresponding CPU and networking, similar to that used in Aurora provisioned instances.

Using this model, we would only need to set a practical minimum and then set a maximum so that our bill doesn't explode.

Setup

We have a synthetics testing system that's used to measure latency for a typical username and password authentication session. This can also be used to measure overall performance of the serverless cluster to learn a bit more about how our product is impacted by the dynamic changes.

For this test, we're going big since smaller tests all worked really well. We really want to stress this out. Initially we'd provision a big ole db.r8g.8xlarge on-demand instance, then hit it with an extremely large load that is fairly representative of a production baseline. For the serverless, we set a minimum of 64 ACUs and a maximum of 128 ACUs. With the serverless instance as the reader, it would have a chance to warm up by itself. Once things looked to have settled down, we would failover the cluster so that the serverless instance became the writer. Crunch time.

Testing good

So how did it go? So so.

Functionally everything worked as advertised. It was quite cool to see the ACUs automatically adjust based on the load and how it would ebb and flow depending on extra jobs we would add to the database.

The instance failed over at the same speed as we're used to with a pair of on-demand instances, we experienced no timeouts due to memory pressure, and the ACU scaling could be seen as the database load changed.

You can see the scaling kicked off when the instance was provisioned with this ACU chart.

And once the workload settled down after the failover, it landed on about 80 ACUs.

One part that was surprising to us was the time it took to provision the ACUs.

It took 15 minutes to provision 90 ACUs

For some reason I was expecting to scale up much faster.

Testing bad

What was even more surprising was the poor performance of the cluster once the scaling had settled down after the failover. Everything was a bit slower. The next graph shows the response latency for our synthetics monitoring in milliseconds. While the total transaction is still quite fast, it's noticeably across the board.

Not only is the latency baseline slower, it also seems more susceptible to small spikes. This is like moving from SSDs back to those hybrid magnetic HDDs.

Our baseline was running on an r8g on-demand instance, which is the latest generation available to RDS. I have no proof of this, but I suspect that Aurora Serverless v2 is re-using older hardware, basically giving AWS a longer return on investment for the data centre hardware that isn't being used by current customers. Over the years, we've noticed increasingly better performance and more consistent latency when moving from r6g to r7g and then finally to r8g.

Using Aurora Serverless v2 seems like a step in the wrong direction for overall performance. I guess in the grand scheme or things, this will need to be a considered trade off in return for dynamically increasing and reducing resource allocations for a system that is traditionally very static.

Cost

The pricing is a bit predatory too. Let's take this example we just tested. At the time of writing for the AWS us-west-2 Oregon region:

1 ACU = $0.12 per hour

So if we worked out the per hour cost of our serverless baseline:

80 (160GB memory) x $0.12 = $9.6 per hour

And then the price we were paying for the on-demand instance that had sufficient overhead already:

1 db.r8g.8xlarge (256GB memory) = $4.416 per hour

This makes Aurora Serverless v2 over double the cost at the top end of what we provisioned. The 8xlarge is already over-provisioned so that we could handle the occasional spikes.

Why it’s not right for us

In the end, the baseline performance and the costs are too far away from our expectations to consider using Aurora Serverless v2.

The baseline performance really surprised us. I have noticed that AWS don't publish what type of CPUs or the generation of hardware being used. Maybe my older generation hardware conspiracy theory? More likely the extra layers of virtualisation having a performance penalty.

The pricing wasn't too much of a surprise since you already know that the equivalent ACU to on-demand resourcing has a big delta. But the baseline that our instances sat at was a bit surprising since we were hoping that it would settle at something much closer to the minimum allocated due to the over provisioning already done.

I think if we were to use this again in the future, it would be for highly highly highly dynamic workloads where the instance could effectively be dropped down to 4 ACUs and then scale up automatically to something like 128 for the necessary period of time. That would likely save a lot compared to keeping a large enough instance available. And also assumes that the database even needs to be online the whole time.

Anyways, pop a message through or a comment if you've had an experience with Aurora Serverless v2. Good or bad.

Alex Norman @alexander_