Building secure, private API architectures on AWS can be deceptively complex. What appears to be a straightforward integration between API Gateway, VPC Links, and Network Load Balancers often reveals multiple layers of configuration challenges that aren't immediately obvious from the documentation.
This post chronicles a real-world troubleshooting journey that started with what seemed like a simple infrastructure deployment and evolved into a deep investigation across Lambda runtimes, API Gateway authentication, and AWS networking internals.
Whether setting up our first private API integration or debugging existing infrastructure, this troubleshooting story will help us understand common pitfalls and their solutions in AWS private networking architectures.
Goal
The goal was to create a secure, private API architecture with the following requirements:
- Private API Gateway accessible only within the VPC A
- Secure authentication using API keys and custom authorization logic
- High availability through load balancing across multiple containers
- Private networking with no internet exposure
- Scalable backend using Fargate containerized application
Target Architecture
┌─────────────────────────────────────────────────────────────────┐
│ VPC │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ EC2 Client │ │ API Gateway │ │
│ │ (Testing) │───▶│ using VPC link │ │
│ └─────────────────┘ └──────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Private API │ │
│ │ Gateway │ │
│ │ │ │
│ │ ┌─────────────┐ │ │
│ │ │ Lambda │ │ │
│ │ │ Authorizer │ │ │
│ │ └─────────────┘ │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────▼────┐ │
│ │VPC Link │ │
│ └────┬────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │Internal Network │ │
│ │ Load Balancer │ │
│ │ (Port 80) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ┌─────────▼─────────┐ ┌────────▼────────┐ ┌────────▼────────┐│
│ │ Fargate Task │ │ Fargate Task │ │ Fargate Task ││
│ │ (Container 1) │ │ (Container 2) │ │ (Container 3) ││
│ │ │ │ │ │ ││
│ │ ┌───────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ ││
│ │ │ Application │ │ │ │Application │ │ │ │Application │ ││
│ │ │ Server │ │ │ │ Server │ │ │ │ Server │ ││
│ │ └───────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ ││
│ └───────────────────┘ └─────────────────┘ └─────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────┘
External Internet ❌ (No direct access - Private API only)
Public subnet access ❌ (No direct access - except who are inside the private subnets of the vpc)
The target architecture involved:
- Private API Gateway (accessible only within VPC)
- VPC Link for private integration
- Internal Network Load Balancer
- Fargate containers as targets
- Custom Lambda authorizer for API key validation used by the API Gw
The Challenge
What seemed like a straightforward setup quickly revealed multiple layers of complexity. Despite following AWS documentation and best practices, I still encountered a series of issues that required deep troubleshooting across different AWS services.
Initially, our infrastructure appeared to work perfectly during development. Direct access to the NLB returned responses immediately, target groups were healthy, and everything looked correct.
> curl -v NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com
* Host NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com:80 was resolved.
* IPv6: (none)
* IPv4: xx.xxx.xxx.00, xx.xxx.x.111
* Trying xx.xxx.x.111:80...
* Connected to NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com (xx.xxx.xx.00) port 80
* using HTTP/1.x
> GET / HTTP/1.1
> Host: NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com
> User-Agent: curl/x.x.xx
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< x-correlation-id: aas19175-1b77-3333-0000-123456787
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 5
< ETag: W/"5-fvYUSdz1234556789"
< Date: Fri, 11 Jul 2025 08:48:13 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
* Connection #0 to host NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com left intact
hello
However, when tying to access the API GW through the VPC Link integration (I created an EC2 instance that can access the private subnets of the VPC and curl from there), I encountered various issues that weren't immediately obvious.
The First Challenge: Lambda Runtime Issues
Our Lambda authorizer had been working reliably for over three years, originally built with Node.js 12.x when the infrastructure was first deployed. However, when I needed to redeploy the stack, I encountered an immediate issue:
Error: Cannot find module 'aws-sdk'
The Root Cause: AWS had deprecated Node.js 12.x runtime, and our deployment pipeline automatically upgraded to Node.js 18.x. However, Node.js 18.x runtime doesn't include AWS SDK v2 by default - a breaking change that affected our existing Lambda code.
Legacy Code (Node.js 12.x - worked for 3 years):
var AWS = require('aws-sdk');
var ssm = new AWS.SSM({region: 'ap-southeast-2'});
const paramValue = await ssm.getParameter({...}).promise();
Updated Code (Node.js 18.x - required migration):
const { SSMClient, GetParameterCommand } = require('@aws-sdk/client-ssm');
const ssmClient = new SSMClient({ region: 'ap-southeast-2' });
const paramValue = await ssmClient.send(new GetParameterCommand({...}));
Lesson Learned: AWS runtime deprecations can affect long-running infrastructure. When Lambda runtimes are deprecated, existing functions continue to work, but new deployments require code updates to use supported runtimes and SDK versions.
The Second Challenge: API Key Configuration
Even after fixing the Lambda runtime issue, I continued getting 403 Forbidden
responses.
> curl -H "x-api-key: xxxxxx" -H "accept: application/json" -X GET "https://12345apigateway.execute-api.ap-southeast-2.amazonaws.com/" -i
HTTP/1.1 403 Forbidden
Server: Server
Date: Thu, 10 Jul 2025 06:05:15 GMT
Content-Type: application/json
Content-Length: 24
Connection: keep-alive
x-amzn-RequestId: 123ff3234-0000-xxx
x-amzn-ErrorType: ForbiddenException
x-amz-apigw-id: 1er12344556=
{"message":"Forbidden"}
The Lambda authorizer logs showed it was finding the API key and returning an "Allow" policy, but API Gateway was still rejecting requests.
The Missing Piece: API Usage Plans require explicit API key associations!
While I had created both the Usage Plan and API Key using the Cloudformation template, I was missing the crucial link between them:
ApiUsagePlanKey:
Type: AWS::ApiGateway::UsagePlanKey
Properties:
KeyId: !Ref ApiKey
KeyType: API_KEY
UsagePlanId: !Ref UsagePlan
This is a common gotcha that many developers encounter when setting up API Gateway authentication.
The Third Challenge: VPC Link Connectivity
With authentication working, I faced a new issue: VPC Link was timing out after 11 seconds with "Internal server error" messages.
> curl -H "x-api-key: xxxxxx" -H "accept: application/json" -X GET "https://12345apigateway.execute-api.ap-southeast-2.amazonaws.com/" -i
HTTP/1.1 500 Internal Server Error
Server: Server
Date: Fri, 11 Jul 2025 08:03:15 GMT
Content-Type: application/json
Content-Length: 36
Connection: keep-alive
x-amzn-RequestId: 34d5467890-15db-4a40-9acb-dr123456789
x-amzn-ErrorType: InternalServerErrorException
x-amz-apigw-id: EEgTmDq5556tYRwd=
{"message": "Internal server error"}
This was particularly puzzling because:
- ✅ Direct NLB access worked perfectly
- ✅ VPC Link showed as "AVAILABLE"
- ✅ Target groups were healthy
- ✅ Security groups seemed correct
The Mysterious Timeout Pattern
The pattern was clear: Direct connectivity worked flawlessly, but VPC Link couldn't reach the same NLB endpoints.
Deep Dive into Network Security
I spent considerable time investigating network-level issues:
Security Group Configuration
Initially, I configured the NLB security group to allow traffic from specific subnet CIDR blocks where the NLB was deployed. This seemed logical but didn't work.
# This approach failed
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: xx.x.x.0/22 # NLB subnet 1
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: xx.x.x.0/22 # NLB subnet 2
The Debugging Process
I systematically verified:
- Network ACLs: Properly configured with allow rules
- DNS Resolution: NLB resolved to correct private IPs
- Target Health: All Fargate containers responding
- VPC Endpoints: API Gateway VPC endpoint working correctly
- Direct Connectivity: Manual curl tests from EC2 instances succeeded
The Breakthrough Discovery
After extensive troubleshooting, I discovered that only allowing 0.0.0.0/0
in the NLB security group made VPC Link work. This was concerning from a security perspective but provided a crucial clue about the traffic flow.
The Real Solution: AWS Documentation
While preparing to contact AWS Support about this networking puzzle, I discovered the official solution in AWS documentation:
For NLBs used with VPC Link, you should disable security group evaluation.
This is documented in the AWS API Gateway Developer Guide under point 4b.
So I updated my CDK for the creating the NLB:
#PrivateLink traffic enforcement
NetworkLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: !Join ['', [!Ref ServiceName, NetworkLoadBalancer]]
Type: network
Scheme: internal
Subnets: !Split
- ","
- Fn::ImportValue: !Sub ${VpcStack}-PrivateSubnetIDs
SecurityGroups:
- !Ref NetworkLoadBalancerSecurityGroup
# This property controls PrivateLink traffic enforcement
EnforceSecurityGroupInboundRulesOnPrivateLinkTraffic: "off"
LoadBalancerAttributes:
- Key: load_balancing.cross_zone.enabled
Value: "true"
- Key: access_logs.s3.enabled
Value: "false"
- Key: deletion_protection.enabled
Value: "false"
How it looks like in the console.
Why This Works
VPC Link uses AWS-managed infrastructure that operates outside your VPC's IP ranges. Rather than trying to guess or discover these IP ranges, AWS recommends disabling security group evaluation for NLBs used in VPC Link integrations.
Security is still maintained through:
- Private subnet placement (no internet gateway routes)
- API Gateway authentication layers
- Network ACLs (if configured restrictively)
- Application-level security controls
Key Takeaways
1. AWS SDK Version Compatibility
When upgrading Lambda runtimes, be aware of SDK version changes. Node.js 18.x requires AWS SDK v3.
2. API Gateway Usage Plan Associations
Creating an API key and usage plan isn't enough - you must explicitly associate them with AWS::ApiGateway::UsagePlanKey
.
3. VPC Link Security Groups
For NLBs used with VPC Link, follow AWS guidance and disable security group evaluation rather than trying to configure specific IP ranges.
4. Systematic Troubleshooting
When facing complex networking issues:
- Verify each component independently
- Test direct connectivity vs. service-mediated connectivity
- Check AWS documentation for service-specific configuration patterns
- Consider contacting AWS Support for service-level investigations
5. Documentation First
Before implementing complex workarounds, always check the official AWS documentation. The solution to our VPC Link issue was documented but easy to miss.
Final Architecture
Our final working configuration:
- API Gateway with proper Usage Plan and API key associations
- Lambda authorizer using AWS SDK v3
- VPC Link connecting to NLB with security group evaluation disabled
- Internal NLB in private subnets
- Fargate containers handling application logic
This setup provides a secure, scalable private API architecture that follows AWS best practices.
Conclusion
Complex AWS networking scenarios often require understanding the interaction between multiple services. What seemed like a simple VPC Link + NLB integration revealed several layers of configuration requirements, from Lambda runtime compatibility to service-specific security patterns.
The key to successful troubleshooting is methodical verification of each component, careful reading of AWS documentation, and understanding that AWS services sometimes have specific configuration patterns that differ from general networking principles.
When in doubt, the AWS documentation and support team are invaluable resources for understanding the intended architecture patterns for complex service integrations.