How We Cut Our Client's AWS Bill by 90% by Going Serverless
Mehul budasana

Mehul budasana @mehul_budasana

About: Serving as the primary technical liaison, I offer guidance and solutions for complex technical challenges while actively contributing as an individual contributor across various developmental stages.

Joined:
Aug 23, 2024

How We Cut Our Client's AWS Bill by 90% by Going Serverless

Publish Date: Aug 8
6 0

$28,492.61.

That was the AWS bill for a single month. I remember staring at the number in the Cost Explorer dashboard and refreshing the page, hoping it was some glitch or untagged resource anomaly. It wasn’t.

The number was real, and it was rising fast.

Our client, a data-heavy B2B SaaS platform, had been scaling rapidly. Their user base had doubled in just six months. The product was growing. The team was shipping fast. But behind the scenes, the infrastructure was groaning under the weight, and the cloud bill had become a runaway train.

In theory, cloud costs were supposed to scale with usage. In reality, they were scaling with inefficiency.

Their infrastructure, built on traditional EC2 instances, RDS clusters, and long-running background workers, was designed for availability, not efficiency. Even at 2:00 AM, compute and storage services were humming, burning through budgets like a production-grade app under peak load.

As experienced AWS consultants, we weren’t just looking at a cost problem; we were looking at a design problem. What came next was a complete rethink of the system from the ground up. Here’s exactly what we did, and how.

Phase 1: Understanding the Problem

Before we could fix anything, we needed to understand what was really causing the high costs.

We started by digging deep into the AWS billing reports, usage dashboards, and CloudWatch logs. Our goal was to find out where exactly the money was going, and what parts of the system were using the most resources.

A few patterns stood out right away:

  • EC2 instances were running 24/7, even when there was no traffic.
  • RDS (Relational Database Service) was heavily overused, set up for peak performance but barely used to its full capacity.
  • Background jobs were always running, including data processing tasks that didn’t need to be active all the time.
  • There was no autoscaling in place for many services, which meant resources were always "on," regardless of actual demand.

It was designed to stay online at all costs, but that also meant it was burning money during quiet hours, weekends, and holidays. We also realized that many tasks were running on timers or schedules, instead of being triggered by real user activity or data changes. In simple terms, the system was always “on,” even when no one was using it.

At this point, it was clear:

The problem wasn’t just in the billing, it was in how the system was designed to operate.

This phase helped us confirm one thing: we didn’t just need cost optimization, we needed a new architecture.

Phase 2: Planning the Serverless Migration

Now that we understood the problem, it was time to plan the solution. But we didn’t want to rush into rebuilding everything at once. The system was live, users were active, and any big mistake could affect the business.

So, we started small, with a clear goal in mind:

Replace the most expensive and least efficient parts first.

What We Decided to Change First:

  • Background jobs that didn’t need to run 24/7
  • Data processing tasks triggered by schedules, not user actions
  • Internal workflows that didn’t need to be part of the main application

These areas were low risk to change and had a high impact in terms of cost.

What Tools We Chose:

We decided to move toward a serverless architecture, where services only run when needed. That way, we’d pay only for what we use, and nothing more.

Here’s what we used:

  • AWS Lambda for running code without managing servers
  • API Gateway to connect users to backend services
  • Amazon SQS to queue tasks and handle background jobs
  • Step Functions to manage complex workflows
  • Amazon EventBridge to trigger actions based on events
  • DynamoDB for fast, scalable, and cost-effective data storage

Instead of always-on servers, these tools would let us build a system that responds to demand, scales automatically, and costs almost nothing when idle.

The Migration Approach:

We followed a step-by-step approach:

  1. Start with non-critical tasks that don’t affect users
  2. Test everything in a staging environment first
  3. Deploy using feature flags so we can roll back quickly if needed
  4. Monitor performance and cost impact closely
  5. We weren’t just replacing technology; we were changing the way the system worked.

With a solid plan in place, we moved on to the actual execution: building, testing, and rolling out the new serverless components.

Phase 3: The Execution

With our plan ready, it was time to get our hands dirty and start rebuilding the system piece by piece.

We didn’t touch the entire system at once. Instead, we picked one small but important task to begin with: a background job that processed user reports every night. It was simple, isolated, and used a lot of resources, making it the perfect starting point.

Step-by-Step Rollout

  1. Rewrote the job as a Lambda function: Instead of running all night, it would now run only when triggered.
  2. Connected it to EventBridge and SQS: This allowed us to trigger the job only when there was new data, and queue tasks if multiple reports were generated at once.
  3. Set up CloudWatch monitoring: So we could track function runs, failures, and performance in real-time.
  4. Tested everything in staging first: We created test scenarios to simulate real usage, making sure the new system was reliable and fast.
  5. Used feature flags for a smooth launch: We turned the new system on slowly, starting with 10% of the users, then increasing gradually once we saw it was working well.
  6. Built a CI/CD pipeline: Using GitHub Actions and AWS CodePipeline, we made sure every update could be tested and deployed automatically with minimal risk.

This process became our template for the rest of the migration. Each week, we picked another function or process to convert, tested it, deployed it, and monitored the results.

Adapting Along the Way

Not everything was smooth. For example:

  • Some Lambda functions took too long to start: We fixed this by reducing the package size and using AWS’s Provisioned Concurrency for time-sensitive functions.
  • Migrating from RDS to DynamoDB: This meant rethinking how we stored and accessed data. It took some planning, but in the long run, it made the system faster and cheaper.
  • Debugging became tricky: With multiple small services instead of one big system, tracking issues was harder at first. We solved this by using CloudWatch Logs, X-Ray, and Grafana dashboards to get a full picture of what was happening.

Over time, more and more of the old system was replaced by event-driven, serverless components. Performance improved, response times dropped, and, most importantly, costs started going down.

In just a few months, we had restructured the core parts of the system and were ready to measure the full impact.

Phase 4: The Results

Moving to a serverless, event-driven setup made a real difference.

We saw a 60% drop in cloud costs almost immediately. Since services like Lambda and DynamoDB only charge when they’re used, we no longer paid for idle resources.

Performance also improved. The system became faster and more reliable, especially during sudden traffic spikes. Users noticed quicker response times, and the team saw fewer failures.

Security and monitoring got easier too. With tighter access controls, centralized logging, and tools like CloudWatch and X-Ray, we had better visibility and control over everything running in production.

And most importantly, development got faster. With smaller, focused functions and automated deployment pipelines, the team could ship updates more quickly, and roll them back just as easily if needed.

In the end, it wasn’t just about saving money. It was about building smarter, moving faster, and staying in control.

Comments 0 total

    Add comment