Troubleshooting AWS API Gateway VPC Link with Network Load Balancer
Clariza Look

Clariza Look @clarizalooktech

About: DevOps Engineer

Location:
Perth, Western Australia
Joined:
Aug 9, 2024

Troubleshooting AWS API Gateway VPC Link with Network Load Balancer

Publish Date: Jul 11
0 0

Building secure, private API architectures on AWS can be deceptively complex. What appears to be a straightforward integration between API Gateway, VPC Links, and Network Load Balancers often reveals multiple layers of configuration challenges that aren't immediately obvious from the documentation.

This post chronicles a real-world troubleshooting journey that started with what seemed like a simple infrastructure deployment and evolved into a deep investigation across Lambda runtimes, API Gateway authentication, and AWS networking internals.

Whether setting up our first private API integration or debugging existing infrastructure, this troubleshooting story will help us understand common pitfalls and their solutions in AWS private networking architectures.

Goal

The goal was to create a secure, private API architecture with the following requirements:

  • Private API Gateway accessible only within the VPC A
  • Secure authentication using API keys and custom authorization logic
  • High availability through load balancing across multiple containers
  • Private networking with no internet exposure
  • Scalable backend using Fargate containerized application

Target Architecture

┌─────────────────────────────────────────────────────────────────┐
│                           VPC                                   │
│                                                                 │
│  ┌─────────────────┐    ┌──────────────────┐                    │
│  │   EC2 Client    │    │  API Gateway     │                    │
│  │   (Testing)     │───▶│  using VPC link  │                    │
│  └─────────────────┘    └──────────────────┘                    │
│                                   │                             │
│                          ┌────────▼────────┐                    │
│                          │  Private API    │                    │
│                          │    Gateway      │                    │
│                          │                 │                    │
│                          │ ┌─────────────┐ │                    │
│                          │ │   Lambda    │ │                    │
│                          │ │ Authorizer  │ │                    │
│                          │ └─────────────┘ │                    │
│                          └────────┬────────┘                    │
│                                   │                             │
│                              ┌────▼────┐                        │
│                              │VPC Link │                        │
│                              └────┬────┘                        │
│                                   │                             │
│                          ┌────────▼────────┐                    │
│                          │Internal Network │                    │
│                          │ Load Balancer   │                    │
│                          │   (Port 80)     │                    │
│                          └────────┬────────┘                    │
│                                   │                             │
│              ┌────────────────────┼────────────────────┐        │
│              │                    │                    │        │
│    ┌─────────▼─────────┐ ┌────────▼────────┐ ┌────────▼────────┐│
│    │   Fargate Task    │ │   Fargate Task  │ │   Fargate Task  ││
│    │   (Container 1)   │ │   (Container 2) │ │   (Container 3) ││
│    │                   │ │                 │ │                 ││
│    │ ┌───────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ ││
│    │ │ Application   │ │ │ │Application  │ │ │ │Application  │ ││
│    │ │   Server      │ │ │ │   Server    │ │ │ │   Server    │ ││
│    │ └───────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ ││
│    └───────────────────┘ └─────────────────┘ └─────────────────┘│
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

External Internet ❌ (No direct access - Private API only)
Public subnet access ❌ (No direct access - except who are inside the private subnets of the vpc)
Enter fullscreen mode Exit fullscreen mode

The target architecture involved:

  • Private API Gateway (accessible only within VPC)
  • VPC Link for private integration
  • Internal Network Load Balancer
  • Fargate containers as targets
  • Custom Lambda authorizer for API key validation used by the API Gw

The Challenge

What seemed like a straightforward setup quickly revealed multiple layers of complexity. Despite following AWS documentation and best practices, I still encountered a series of issues that required deep troubleshooting across different AWS services.

Initially, our infrastructure appeared to work perfectly during development. Direct access to the NLB returned responses immediately, target groups were healthy, and everything looked correct.

> curl -v NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com

* Host NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com:80 was resolved.

* IPv6: (none)
* IPv4: xx.xxx.xxx.00, xx.xxx.x.111
*   Trying xx.xxx.x.111:80...
* Connected to NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com (xx.xxx.xx.00) port 80
* using HTTP/1.x
> GET / HTTP/1.1
> Host: NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com 
> User-Agent: curl/x.x.xx
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 200 OK
< x-correlation-id: aas19175-1b77-3333-0000-123456787
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 5
< ETag: W/"5-fvYUSdz1234556789"
< Date: Fri, 11 Jul 2025 08:48:13 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
< 
* Connection #0 to host NLB-1xxxxx234567897.elb.ap-southeast-2.amazonaws.com  left intact
hello

Enter fullscreen mode Exit fullscreen mode

However, when tying to access the API GW through the VPC Link integration (I created an EC2 instance that can access the private subnets of the VPC and curl from there), I encountered various issues that weren't immediately obvious.

The First Challenge: Lambda Runtime Issues

Our Lambda authorizer had been working reliably for over three years, originally built with Node.js 12.x when the infrastructure was first deployed. However, when I needed to redeploy the stack, I encountered an immediate issue:

Error: Cannot find module 'aws-sdk'
Enter fullscreen mode Exit fullscreen mode

The Root Cause: AWS had deprecated Node.js 12.x runtime, and our deployment pipeline automatically upgraded to Node.js 18.x. However, Node.js 18.x runtime doesn't include AWS SDK v2 by default - a breaking change that affected our existing Lambda code.

Legacy Code (Node.js 12.x - worked for 3 years):

var AWS = require('aws-sdk');
var ssm = new AWS.SSM({region: 'ap-southeast-2'});
const paramValue = await ssm.getParameter({...}).promise();
Enter fullscreen mode Exit fullscreen mode

Updated Code (Node.js 18.x - required migration):

const { SSMClient, GetParameterCommand } = require('@aws-sdk/client-ssm');
const ssmClient = new SSMClient({ region: 'ap-southeast-2' });
const paramValue = await ssmClient.send(new GetParameterCommand({...}));
Enter fullscreen mode Exit fullscreen mode

Lesson Learned: AWS runtime deprecations can affect long-running infrastructure. When Lambda runtimes are deprecated, existing functions continue to work, but new deployments require code updates to use supported runtimes and SDK versions.

The Second Challenge: API Key Configuration

Even after fixing the Lambda runtime issue, I continued getting 403 Forbidden responses.

> curl -H "x-api-key: xxxxxx" -H "accept: application/json" -X GET "https://12345apigateway.execute-api.ap-southeast-2.amazonaws.com/" -i


HTTP/1.1 403 Forbidden
Server: Server
Date: Thu, 10 Jul 2025 06:05:15 GMT
Content-Type: application/json
Content-Length: 24
Connection: keep-alive
x-amzn-RequestId: 123ff3234-0000-xxx
x-amzn-ErrorType: ForbiddenException
x-amz-apigw-id: 1er12344556=

{"message":"Forbidden"}

Enter fullscreen mode Exit fullscreen mode

The Lambda authorizer logs showed it was finding the API key and returning an "Allow" policy, but API Gateway was still rejecting requests.

The Missing Piece: API Usage Plans require explicit API key associations!

While I had created both the Usage Plan and API Key using the Cloudformation template, I was missing the crucial link between them:

ApiUsagePlanKey:
  Type: AWS::ApiGateway::UsagePlanKey
  Properties:
    KeyId: !Ref ApiKey
    KeyType: API_KEY
    UsagePlanId: !Ref UsagePlan
Enter fullscreen mode Exit fullscreen mode

This is a common gotcha that many developers encounter when setting up API Gateway authentication.

The Third Challenge: VPC Link Connectivity

With authentication working, I faced a new issue: VPC Link was timing out after 11 seconds with "Internal server error" messages.

> curl -H "x-api-key: xxxxxx" -H "accept: application/json" -X GET "https://12345apigateway.execute-api.ap-southeast-2.amazonaws.com/" -i


HTTP/1.1 500 Internal Server Error
Server: Server
Date: Fri, 11 Jul 2025 08:03:15 GMT
Content-Type: application/json
Content-Length: 36
Connection: keep-alive
x-amzn-RequestId: 34d5467890-15db-4a40-9acb-dr123456789
x-amzn-ErrorType: InternalServerErrorException
x-amz-apigw-id: EEgTmDq5556tYRwd=

{"message": "Internal server error"}
Enter fullscreen mode Exit fullscreen mode

This was particularly puzzling because:

  • ✅ Direct NLB access worked perfectly
  • ✅ VPC Link showed as "AVAILABLE"
  • ✅ Target groups were healthy
  • ✅ Security groups seemed correct

The Mysterious Timeout Pattern

Direct NLB access worked perfectly

The pattern was clear: Direct connectivity worked flawlessly, but VPC Link couldn't reach the same NLB endpoints.

Deep Dive into Network Security

I spent considerable time investigating network-level issues:

Security Group Configuration

Initially, I configured the NLB security group to allow traffic from specific subnet CIDR blocks where the NLB was deployed. This seemed logical but didn't work.

# This approach failed
SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: 80
    ToPort: 80
    CidrIp: xx.x.x.0/22  # NLB subnet 1
  - IpProtocol: tcp
    FromPort: 80
    ToPort: 80
    CidrIp: xx.x.x.0/22  # NLB subnet 2
Enter fullscreen mode Exit fullscreen mode

The Debugging Process

I systematically verified:

  • Network ACLs: Properly configured with allow rules
  • DNS Resolution: NLB resolved to correct private IPs
  • Target Health: All Fargate containers responding
  • VPC Endpoints: API Gateway VPC endpoint working correctly
  • Direct Connectivity: Manual curl tests from EC2 instances succeeded

The Breakthrough Discovery

After extensive troubleshooting, I discovered that only allowing 0.0.0.0/0 in the NLB security group made VPC Link work. This was concerning from a security perspective but provided a crucial clue about the traffic flow.

The Real Solution: AWS Documentation

While preparing to contact AWS Support about this networking puzzle, I discovered the official solution in AWS documentation:

For NLBs used with VPC Link, you should disable security group evaluation.

This is documented in the AWS API Gateway Developer Guide under point 4b.

So I updated my CDK for the creating the NLB:

#PrivateLink traffic enforcement
NetworkLoadBalancer:
  Type: AWS::ElasticLoadBalancingV2::LoadBalancer
  Properties:
    Name: !Join ['', [!Ref ServiceName, NetworkLoadBalancer]]
    Type: network
    Scheme: internal
    Subnets: !Split
      - ","
      - Fn::ImportValue: !Sub ${VpcStack}-PrivateSubnetIDs
    SecurityGroups:
      - !Ref NetworkLoadBalancerSecurityGroup
    # This property controls PrivateLink traffic enforcement
    EnforceSecurityGroupInboundRulesOnPrivateLinkTraffic: "off"
    LoadBalancerAttributes:
      - Key: load_balancing.cross_zone.enabled
        Value: "true"
      - Key: access_logs.s3.enabled
        Value: "false"
      - Key: deletion_protection.enabled
        Value: "false"
Enter fullscreen mode Exit fullscreen mode

How it looks like in the console.

Network load balancer

Why This Works

VPC Link uses AWS-managed infrastructure that operates outside your VPC's IP ranges. Rather than trying to guess or discover these IP ranges, AWS recommends disabling security group evaluation for NLBs used in VPC Link integrations.

Security is still maintained through:

  • Private subnet placement (no internet gateway routes)
  • API Gateway authentication layers
  • Network ACLs (if configured restrictively)
  • Application-level security controls

Key Takeaways

1. AWS SDK Version Compatibility

When upgrading Lambda runtimes, be aware of SDK version changes. Node.js 18.x requires AWS SDK v3.

2. API Gateway Usage Plan Associations

Creating an API key and usage plan isn't enough - you must explicitly associate them with AWS::ApiGateway::UsagePlanKey.

3. VPC Link Security Groups

For NLBs used with VPC Link, follow AWS guidance and disable security group evaluation rather than trying to configure specific IP ranges.

4. Systematic Troubleshooting

When facing complex networking issues:

  • Verify each component independently
  • Test direct connectivity vs. service-mediated connectivity
  • Check AWS documentation for service-specific configuration patterns
  • Consider contacting AWS Support for service-level investigations

5. Documentation First

Before implementing complex workarounds, always check the official AWS documentation. The solution to our VPC Link issue was documented but easy to miss.

Final Architecture

Our final working configuration:

  • API Gateway with proper Usage Plan and API key associations
  • Lambda authorizer using AWS SDK v3
  • VPC Link connecting to NLB with security group evaluation disabled
  • Internal NLB in private subnets
  • Fargate containers handling application logic

This setup provides a secure, scalable private API architecture that follows AWS best practices.

Conclusion

Complex AWS networking scenarios often require understanding the interaction between multiple services. What seemed like a simple VPC Link + NLB integration revealed several layers of configuration requirements, from Lambda runtime compatibility to service-specific security patterns.

The key to successful troubleshooting is methodical verification of each component, careful reading of AWS documentation, and understanding that AWS services sometimes have specific configuration patterns that differ from general networking principles.

When in doubt, the AWS documentation and support team are invaluable resources for understanding the intended architecture patterns for complex service integrations.

Comments 0 total

    Add comment