From Monolith to Microservices: Lessons (and Stories from the Trenches)

There’s a certain comfort to a monolith—until you try to change it. A few years ago, I was managing a pricing platform at a global e-commerce company. What started as a tidy codebase had grown into a giant, “don’t-touch-anything-on-Friday” system. If you’ve ever hesitated before deploying on a Friday night, you know the feeling.

Note: All public tool references (Kafka, SQS, Hystrix, Prometheus, ELK, Jaeger, gRPC, etc.) are mentioned as industry parallels—these are not what we used at Amazon, where internal proprietary systems provided similar functions.

Why We Needed a Change

The warning signs were everywhere:

A single deployment meant waiting for all tests to pass—and fixing random integration failures that only showed up in production.
Adding a new feature required code changes across modules nobody really owned.
Scaling meant throwing more hardware at the entire app, even if just the pricing engine was under pressure.

That’s when the conversation about microservices got real. For us, it wasn’t just about the latest buzzword—it was about regaining agility.

Where We Started: Domain-Driven Discovery

We didn’t jump straight to Kubernetes or fancy service meshes. First, we mapped our domain:

Bounded Contexts: We grouped code into pricing logic, promotions, eligibility, inventory sync, and external vendor integrations.
Ownership: Each context had a “product owner” and clear documentation. We used event storming (collaborative modeling) to spot tight couplings and fragile flows.

Lesson: Don’t just chop code by feature. Understand your data and business flows first.

Our Migration Approach: Patterns & Pitfalls

1. Strangler Fig Pattern

We wrapped the monolith with APIs (Application Programming Interfaces), then gradually redirected traffic. This allowed us to cut over services one by one, minimizing risk.

2. Polyglot Persistence

Some microservices needed new data models (think: NoSQL for fast lookups). We started with shared DB tables, then migrated to service-specific stores, tackling consistency with event sourcing and outbox patterns.

3. CI/CD Automation

We invested heavily in CI/CD (Continuous Integration/Continuous Deployment) pipelines using blue-green deployments and feature flags. Canary releases were our safety net.

4. Observability from Day One

While tools like Prometheus (metrics), ELK (Elasticsearch, Logstash, Kibana for logs), and Jaeger (distributed tracing) are common in the industry, we relied on Amazon’s internal monitoring, logging, and tracing services. If you’re outside of big tech, these open-source tools serve as great alternatives.

Real-World Surprises

N+1 Calls Multiply: Microservices can turn a simple flow into a web of network hops. We learned to batch requests, use async message queues (think SQS or Kafka), and implement circuit breakers (like Hystrix, but we used an internal platform solution).
Versioning is Hard: Changing a contract means negotiating with every consumer. Our first breaking change broke five teams at once—lesson learned: backward compatibility isn’t optional.
SRE (Site Reliability Engineering) is a Culture Shift: We had to train engineers to think in terms of SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets. “It works on my box” doesn’t cut it when dozens of services interact.
Documentation and RFCs: As our system grew, we introduced RFCs (Requests for Comments—detailed technical proposals and reviews), robust onboarding documentation, and clear API specifications.

Performance and Scalability: Not Always What You Expect

Some microservices were slower at first—network latency, more serialization, and cold starts. While gRPC (a high-performance RPC framework) is often used in the industry, we used internal RPC platforms at Amazon. The key is to match protocols and serialization formats to your system’s needs.

Testing and Validation: More Than Unit Tests

Consumer-Driven Contracts: Tools like Pact are popular for contract testing, though we relied on internal tools.
End-to-End Synthetic Tests: Simulated user journeys made sure every dependency played nicely.
Chaos Engineering: Public tools like Gremlin are well-known; we injected faults using Amazon’s proprietary capabilities to test resilience.

Human Side: Team Ownership and Communication

DevOps Alignment: Our teams owned their code from “keyboard to production.” Pager rotation (being on-call for your code) quickly motivates clean code and robust monitoring.
APIs as Products: Each service published internal API contracts, held office hours, and maintained up-to-date documentation.
Communication Overhead: Slack channels multiplied, so did RFCs and architecture reviews. We invested in onboarding and mentorship to help new engineers thrive.

Did It Work? The Results

Deployment velocity increased 5x. Teams deployed independently, with automated rollbacks.
Failures were isolated. No more all-hands-on-deck outages when a single module crashed.
Business agility soared. New features shipped faster; A/B testing became routine.

But… It’s Not a Silver Bullet:

We still wrestled with distributed transaction complexity and service discovery glitches. Microservices brought new challenges—but gave us the agility we desperately needed.

A Final Note on Confidentiality

All named technologies are examples for illustration—at Amazon, we leveraged proprietary solutions. If you’re hoping for detailed blueprints on how we solved every microservice migration challenge at Amazon, I can’t share specifics—that’s confidential. The high-level strategies, patterns, and lessons here reflect what any engineering organization can apply.

Advice for Your Journey

Start with your real pain points, not hype.
Automate observability and recovery from the start.
Invest in people and process, not just tech.
Iterate and celebrate small wins—you’ll need the momentum.

Final Thought:

Breaking up a monolith is as much a story about team growth and culture change as it is about code. If you get the people, patterns, and automation right, the architecture will follow.

About the Author:

Shushyam Malige Sharanappa is a Software Development Manager at Amazon. He builds scalable platforms, mentors distributed teams, and enjoys sharing war stories and best practices from the front lines of engineering.

Questions or want to swap stories about microservices? I’m always up for a tech chat!

Shushyam Malige Sharanappa @shushyam