Some thoughts running Perforce P4 on AWS

We needed a cloud deployment of Perforce P4 (Helix Core + Swarm).

Perforce is the industry standard version control system for game development—used by studios like Epic, EA, and Ubisoft—because it handles huge binary assets, massive repos, and global teams better than anything else.

AWS offered the basic building blocks — but making everything ready for production took some sharp edges and deep dives.

Here’s the story from my perspective, including the gotchas and fixes I wish someone had told me earlier. The view is more from administrative perspective than an end-user perspective.

For deployment we used the Cloud Game Development Toolkit (CGD) - which was enriched with some additional pieces here and there to fit our use case.

I deployed with CDG v1.1.2-alpha, and also contributed by flagging future improvements to the framework. At the time of writing this some of the issues we ran into are already tackled in the newer releases.

But for the blog purposes you can refer to the example deployment from the toolkit - you can also test it out if you want to yourself.

Perforce does let you use the product with less than 5 users without a license.

Fixing the Swarm Docker Image

After initially deploying the environment we noticed some issues, every time we triggered a redeploy of the swarm container our swarm stopped working, its extension configuration changed at the core.

It's a feature, it's meant to update the configuration - to make sure everything works after recreation of the container. Tokens are shared and requests are pointed to the correct place. Or at least they should be.

The official perforce/helix-swarm Docker image has a hardcoded http scheme in its configure-swarm.sh script. But since our setup used an AWS Network Load Balancer (NLB) + AWS Application Load Balancer (ALB) with an SSL certificate terminating TLS in front of the Fargate service, this meant every time the swarm container configured the Swarm extension on the commit server, but it set the Swarm URL with the wrong scheme.

The container only lets you configure the hostname part of the Swarm URL - http:// is hardcoded. Instead of using the Perforce provided container image, we had to create our own and host it on Amazon Elastic Container Registry (ECR).

Here's a short snippet of what to put into your Dockerfile:

FROM perforce/helix-swarm

USER root

# Change hardcoded http -> https for SWARM_URL
RUN sed -E 's/http(:\/\/\$SWARM_HOST)/https\1/g' -i /opt/perforce/swarm/sbin/configure-swarm.sh

# Make sure image defined entry point won't interfere
ENTRYPOINT []

# Ensure the container starts as the original image would
CMD ["/bin/sh", "-c", "/opt/perforce/swarm/sbin/swarm-docker-setup.sh"]

I'll assume here that you're familiar with Docker, and won't go into details of setting that up to build your own image.

After you've built a custom image on top of perforce/helix-swarm, push it to Amazon ECR, and deployed it from there (update your terraform to point to it).

echo "Building Docker image..."
docker buildx build --platform "${PLATFORM}" -t sc-helix-swarm:latest .

echo "Logging in to ECR..."
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com"

echo "Tagging local image as ${IMAGE_URI}..."
docker tag sc-helix-swarm:latest "${IMAGE_URI}"

echo "Pushing image to ECR..."
docker push "${IMAGE_URI}"

Attempting to scale Swarm containers

Could we run into performance issues with Swarm? When we have 500+ users, will it work? Will user experience be good?

Out of the box, Swarm comes with 3 workers, and is a rather small container running on Fargate. Our estimate was that this would at some point become a bottle neck - we haven't hit issues yet.

And since swarm is behind an AWS Application Load Balancer (ALB) - I thought it would support high availability, and automatic scaling. That made total sense to me. Who would release something that doesn't?

But no - not the case.

The template back then supported container count variable.

So I tried running multiple Swarm containers by increasing the container count, running multiple tasks in Fargate. Looked nice, everything came up ok. But 1/3 requests actually went through, all others got a ERROR: Swarm communication error (Missing or invalid token) error message.

Why is that? You give the container rights to update its own extension configuration to the core. So it calls home, and updates where it resides and a secret token that the core uses to talk to the swarm server. Since each container updates the central extension config with its own token, the last one to register wins. Others get invalidated. This caused 1/3 of our swarm tasks to have a valid token, and the other two had invalid tokens.

Thought for a moment, could I circumvent this - maybe I could create a container where the tokens are always the same. But in the end the containers would need to talk to each other in some way - for user experience to stay sane. The task became too cumbersome to fulfill. I hope that Perforce themselves take another look at their setup - and come up with a more HA solution.

Official answer from Perforce support was "Swarm does not scale though, you can only have a single Swarm service" when I asked about the topic.

So more ideas - maybe I can make swarm perform better, to not hit the errors that we forecast could come.

To do that, I was thinking I would tune the container to fork more efficiently with replacing mpm_prefork and mpm_worker with php8.1-fpm and forward PHP requests through PHP-FPM.

Didn't take too long to write a Dockerfile that would replace these in the default setup - after all it's just a basic apache+php configuration that one needs to do. Spent a few hours on it, and was happy it deployed nice an clean.

Turns out: threading and Swarm’s PHP stack aren’t friends.

End result was that it didn't work at all. Turns out the current Swarm PHP setup doesn’t support threading at all.

So unless you're rewriting Swarm, just don’t bother - just forget about making the Swarm container run better/faster/harder. You can give it memory/cpu and change the worker count, but other than that - just live with it.

Using SES to send emails from Swarm

A request came in to have Swarm send out emails.

You can configure Swarm to send emails through SES, here's the correct configuration:

        'transport' => array(
            'host' => 'email-smtp.<region>.amazonaws.com',
            'port' => 587,
            'connection_class' => 'login',
            'connection_config' => array(
                'username' => '<SES USER KEY>',
                'password' => '<SES USER SECRET>',
                'ssl' => 'tls',
            ),
        ),

I would have thought someone else would have done this earlier, but was unable to find clear configuration instructions for what one needs to set - so a bit trial and error. And documentation found in google wasn't too clear on that the connection_class needs to be.

Again issues with how the container in CGD toolkit is setup - this works - but configuration is stored on ephemeral storage and gets recreated when the container starts. So every restart, automatic due to an error for example. Causes email to stop going out.

The setup script doesn't support giving it any more details than -e (--email-host). Which ends up in the config as

    'transport' => array(
        'host' => '$EMAIL_HOST',
    ),

There is no persistent storage in the Fargate container, a clear misunderstanding on our part. We (wrongly) assumed that a separate volume would only be needed for persistent data—turns out, config was getting wiped on every restart.

Ended up starting to write code to have EFS put into the fargate container, and that enabled the configuration to persist.

In case you need to add efs to your swarm module, here you go:

# Define EFS file system
resource "aws_efs_file_system" "swarm" {
  creation_token = "helix-swarm-efs"
  lifecycle_policy {
    transition_to_ia = "AFTER_7_DAYS"
  }
  encrypted = true
}

# Create a mount target in the appropriate subnet
resource "aws_efs_mount_target" "swarm" {
  for_each = toset(var.helix_swarm_service_subnets)

  file_system_id  = aws_efs_file_system.swarm.id
  subnet_id       = each.key
  security_groups = [aws_security_group.swarm_efs.id]
}

# Security group to allow access to EFS from ECS tasks
resource "aws_security_group" "swarm_efs" {
  name        = "swarm-efs-sg"
  description = "Allow ECS tasks to connect to EFS"
  vpc_id      = var.vpc_id

  ingress {
    from_port   = 2049
    to_port     = 2049
    protocol    = "tcp"
    security_groups = [aws_security_group.helix_swarm_service_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EFS Access Point
resource "aws_efs_access_point" "swarm" {
  file_system_id = aws_efs_file_system.swarm.id

  root_directory {
    path = "/swarm"
    creation_info {
      owner_gid   = 0
      owner_uid   = 0
      permissions = "0777"
    }
  }
}

After persistent storage was put into the setup, then email setup via SES stays there between reboots.

Just make sure you take a look at how SSO is configured - if you're using SSO and have the parameter enabled in terraform. (You'll see)

Observability: Install P4Prometheus Early

Perforce can be really memory hungry in certain situations. When you have a lot of files, tags, branches - lists of files tend to grow. And if you're running commands that make Perforce look at the whole depots storage - it can easily OOM itself. Ran into memory issues early on while testing and developing the environment. Figuring out why and where took a lot of time, some questions back and forth to people smarter than ourselves.

Tip there, limit the users from accessing everything, and educate them on how their workspaces need to be setup.

The thing that Perforce does suggest you is to do is use p4prometheus, which we then did - even though it's not our monitoring tool of choice for all other environments (and still isn't).

Installing p4prometheus helped pinpoint bottlenecks.

It ships with ready-made Grafana dashboards and gives real visibility into performance—without the effort of building everything yourself. And the installation is a straightforward case.

Highly recommended to do early on. We opted for the EC2 way - running this on a small Graviton instance - where would you run it in your environment?

FSx for NetApp Ontap

When the requirement for storage goes over 16 TB, you can't run on a single Elastic Block Storage (EBS) volume anymore - maximum size is maximum size. Hard limit.

So what to do, well - you could run software raid on your instance or you could use separate volumes for separate depots. But that would introduce more complexity. So rather than doing that, we opted for Amazon FSx For NetApp ONTAP (FSxN).

It will scale up to 72 GBps of throughput, up to 2.4 million IOPs and up to 1 PiB of SSD storage.

Insane numbers if you ask me, and overkill for most, but perfect when you need it. It does the trick, does everything one might think of needing.

Just don't test it with terraform providers default example when creating it (at the time I was building this the CGD toolkit didn't support creating FSxN - so I learned and did it myself).

Provider example:

resource "aws_fsx_ontap_volume" "snaplock_volume" {
  name                       = "snaplock-vol"
  storage_virtual_machine_id = aws_fsx_ontap_storage_virtual_machine.example.id
  size_in_megabytes          = 102400
  junction_path              = "/snaplock-vol"
  ontap_volume_type          = "RW"
  security_style             = "UNIX"
  tiering_policy {
    name = "SNAPSHOT_ONLY"
  }

  snaplock_configuration {
    snaplock_type = "COMPLIANCE"

    retention_period {
      default_retention {
        type  = "MONTHS"
        value = 6
      }
      minimum_retention {
        type  = "MONTHS"
        value = 6
      }
      maximum_retention {
        type  = "MONTHS"
        value = 6
      }
    }
  }
}

Of course I tried it out, before I understood not to make snaplock volumes if you really don't need them - like us in dev environment while testing out the architecture.

Cannot delete the volume because it contains unexpired log files.

Ended up having a FSxN for 6 months before being able to delete it - luckily it was just a tiny one AZ deployment. Even AWS can't help you in removing it, you just need to wait until the snaplock expires.

So be careful out there.

Hidden pitfalls that we fell into

Empty space must be in the Storage Virtual Machine, not just in the Filesystem itself. When near full, Perforce started failing writes—despite what the top-level volume metrics showed.

We had allocated too much of the SVM storage into the iscsi block storage mounted on the server. ISCSI just stopped writing at times, with no clear reason why it did that. Storage looked like it had room, documented % of excess storage were left there. But it was in the wrong place.

Don’t forget _netdev in /etc/fstab for iSCSI mounts. Missing it caused server to hang on reboot. We rebuilt our instance a few times before catching this.

Human does become blind to tiny mistakes he does - two eyes principle might have helped.

Use SDP Tools - replicating data

Perforce's Server Deployment Package (SDP) is a gift. It gives you structure, scripts, backups, rotations—and a documented best-practices baseline.

Yes, you could do it your own way. But unless you're a masochist or need a one-off snowflake deployment, just use SDP.

We tried googling for how to setup a Perforce edge server - ending up in documentation what really wasn't for our use case, but we tried it anyway.

After some trial and error, we were pointed to the mkrep.sh script, it does all the required magic under the hood, and outputs you the required manual steps to get the replication up and running.

Graph depot replication bug

When users started using a graph depot through edge servers, we started getting reports of Blob data not found in archives for sha <sha> and them not being able to work. For some reason files were missing from the edge server.

It was fast identified that if one manually copies over the files, that works. Ok, initial fire put out. But it just lit again after next updates to the graph depot.

I spent hours on the phone with Perforce support, walking through logs and packet traces. It became a bit of a detective story.

We tcpdumped the network traffic between the edge and the commit server, and everything looked fine. The edge server sent a request upstream to the core server, and the core server dutifully responded with the blob in question. We literally saw the data leave the core and arrive at the edge — but somehow it never made it back to the client.

Here's how the flow looked:
Client -> Edge -> Core -> Answer to Edge -> Error to client

What made this harder was that everything seemed healthy — no logs complained (other than the error sent to client), and replication said it succeeded.

Eventually, Perforce support traced the issue to a bug in the system. Instead of writing to the mounted depot volume, the data was being written to /p4/1/root on the local file system.

So if you experience this error message - take a look if those Blobs are actually written into the root folder.

A temporary fix is to create a symlink to the depot volume.

ln -s /p4/1/depots/<graphdepot> /p4/1/root/<graphdepot>

This causes the data to be written to correct place, where p4 then responds with it to the client. Expecting Perforce to have this fixed in some future release.

Final Thoughts

Running Perforce on AWS can absolutely be done — but sometimes you'll need to look under the hood and adapt. Remember to observe early, and don’t hesitate to call Perforce support if/when you hit weird issues in replication. And remember, swarm doesn't scale horizontally.

Niklas Westerstråhle @niklaswesterstrahle