Transaction performance 👉🏻 retry with backoff

A benchmark sponsored by EDB, a PostgreSQL company, in 2019 contributed to the myth that MongoDB transactions are slow. Even though the work was done by the reputable OnGres team, the code wasn't properly designed to test MongoDB's scalability. At that time, the feature was new, likely not well-documented, and the benchmark framework overlooked the retry logic. In this context, no one is to blame for past publications, but analyzing this benchmark will help prevent the spread of these myths.

MongoDB uses lock-free optimistic concurrency control (OCC) with fail-on-conflict as soon as a write detects concurrent changes to the Multi-Version Concurrency Control (MVCC), requiring applications to manage transient errors differently than traditional RDBMS with pessimistic locking and wait-on-conflict behavior. The benchmark developers, PostgreSQL experts, likely missed this because they based the benchmark on a MongoDB demo focused on capabilities, not performance, and neglecting proper concurrency control.
We should disregard this benchmark today, but this blog post series offers an opportunity to analyze its flaws, debunk myths, and educate readers on effective transaction handling in MongoDB applications.

The problematic code in the MongoDB 4.0 demo from 7 years ago was:

def run_transaction_with_retry(functor, session):
    assert (isinstance(functor, Transaction_Functor))
    while True:
        try:
            with session.start_transaction():
                result=functor(session)  # performs transaction
                commit_with_retry(session)
            break
        except (pymongo.errors.ConnectionFailure, pymongo.errors.OperationFailure) as exc:
            # If transient error, retry the whole transaction
            if exc.has_error_label("TransientTransactionError"):
                print("TransientTransactionError, retrying "
                      "transaction ...")
                continue
            else:
                raise

    return result

It was translated to Java in the benchmark code as:

  private void runWithRetry() {
    while (true) {
      try {
        runnable.run();
        break;
      } catch (RetryUserOperationException ex) {
        retryMeter.mark();
        continue;
      }
    }
  }

If you are familiar with Optimistic or Fail-on-Conflict Concurrency Control, you may recognize a significant issue: there is no wait (backoff) before retry. With such an infinite loop, high concurrency access acts like a DDoS attack on the database, rather than resolving contention.

A typical retry loop implements exponential backoff, and here is an example:

private void runWithRetry() {  
    final long initialDelayMillis = 5;      // start with 5ms  
    final long maxDelayMillis = 1000;       // max wait of 1s  
    long delay = initialDelayMillis;  

    while (true) {  
        try {  
            runnable.run();  
            break;  
        } catch (RetryUserOperationException ex) {  
            retryMeter.mark();  
            try {  
                // jitter by up to 50% to avoid thundering herd  
                long jitter = (long) (Math.random() * delay / 2);  
                long sleep = delay + jitter;  
                Thread.sleep(sleep);  
            } catch (InterruptedException ie) {  
                Thread.currentThread().interrupt();  
                // Optionally log or handle interruption here  
                throw new RuntimeException("Retry loop interrupted", ie);  
            }  
            delay = Math.min(delay * 2, maxDelayMillis);  
            continue;  
        }  
    }  
}

This code makes the first retry wait 5–7 ms, then 10–13 ms, 20–25 ms, and so on, up to 1000–1500 ms. I you use Spring Data, you can simply annotate your @Transactional method with:

    @Retryable(  
        value = RetryUserOperationException.class,  
        maxAttempts = 10,  
        backoff = @Backoff(  
            delay = 5,        // first delay in ms  
            maxDelay = 1000,  // max delay in ms  
            multiplier = 2,   // exponential  
            random = true     // adds jitter  
        )  
    )

MongoDB employs a similar approach for auto-commit single-document transactions, transparently, so that it appears as if the application is waiting for a lock to be acquired. However, it cannot automatically cancel and retry explicit transactions where the application might perform non-transactional work, such as writing to a file, pushing to a queue, or sending an email. For transactions involving multiple statements, no database can automatically retry the process. The application itself must handle retries.

In PostgreSQL, a conflict might cause a serializable error, even under the Read Committed isolation level, where deadlocks can still occur. PostgreSQL locks data while writing during a transaction using two-phase locking and typically waits for the lock to be released. In this case, the impact of an inefficient retry loop is minimal.
However, MongoDB is optimized for high concurrency, allowing it to avoid holding locks between database calls. Instead of waiting, it detects write conflicts instantly and raises a retriable error. Therefore, implementing an efficient retry mechanism is essential.

The benchmark's flaws can't be blamed on anyone, as it was created when MongoDB transactions were new, perhaps not well documented, and the supposed knowledge of distributed systems was at a time when monolithic RDBMS were more popular. The problem is that people still reference this benchmark without understanding what was wrong. The poor performance was due to unnecessary retries because there was no backoff implemented in the retry loop.
The authors of the benchmark have been looking for documentation that they believe explains this behavior, which likely contributed to their decision not to implement backoff in the application, mistakenly thinking it was handled by the database:

Since the probability of collision increases (possibly exponentially) with the effective number of transactions processed, it follows that MongoDB is more eager to retry transactions. This is consistent with the expectation set on MongoDB’s documentation about transactions and locking, which states that “by default, transactions waits up to 5 milliseconds to acquire locks required by the operations in the transaction. If the transaction cannot acquire its required locks within the 5 milliseconds, the transaction aborts”. This behavior can be changed by setting the maxTransactionLockRequestTimeoutMillis parameter.

What is called "lock" here is different from what SQL databases call "lock" with two-phase locking transactions where locks are acquired for the duration of the transaction. MongoDB is lock-free in that sense, using optimistic concurrency control rather than locking. What is called "lock" here is more similar to what SQL databases call "latch" or "lightweight locks", which are short duration and do not span multiple database calls. For such wait, five milliseconds is a good default. But this is not what the benchmark experienced.
Such timeout would raise the following exception: Unable to acquire lock ... within a max lock request timeout of '5ms' milliseconds.
What the benchmark catches in the retry loop is a write conflict, that happens before trying to acquire such short lock: Command failed with error 112 (WriteConflict): 'Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.'

Such write conflict has nothing to do with maxTransactionLockRequestTimeoutMillis, it doesn't try to acquire a lock because it has nothing to write, as transaction isolation (the 'I' in 'ACID') is not possible. When reading, it has detected that the read snapshot, the state as of the beginning of the transaction, has been modified by another transaction. It doesn’t wait because the snapshot would be stale if the other transaction commits, and it immediately returns to the application. The application must compensate or roll back what it did during the transaction, wait a small amount of time (the exponential backoff), and retry.

In PostgreSQL, when operating under the Read Committed isolation level, it can wait because it allows reading a state that may not be consistent with the transaction's start time. If a concurrent transaction commits during this time, PostgreSQL simply continues to read the committed data, mixing data from different states. This is not permitted in higher isolation levels, like serializable, and a transient error must be raised, like in MongoDB, to guarantee ACID properties. However, PostgreSQL uses a pessimistic locking approach, waits to determine whether the other transaction commits, allowing it to retry immediately once the conflict is resolved. This is why the retry logic without backoff does not have the same consequences.

You may wonder why MongoDB doesn't implement waiting like PostgreSQL does. PostgreSQL is designed for a single-writer instance, which cannot scale horizontally, making it simple to use a wait queue in shared memory. However, when sharding PostgreSQL using the Citus extension, this design breaks down, leading to eventual consistency for cross-shard reads. In contrast, MongoDB is built for horizontal scalability and opts for optimistic concurrency control instead of a distributed wait queue, providing consistent cross shard reads across nodes (when the read concern is set to majority).

I prefer not to link the benchmark paper to avoid helping search engines or LLM crawlers find outdated content, but it is easy to find. The benchmark code is available in a repository. I prefer to link to the MongoDB transaction documentation instead. Now you know where the myth about slow transactions comes from: incorrect understanding of MongoDB lock-free ACID transactions.

There's more to say about this benchmark. In the benchmark code, there's a hotspot on the "audit" table which is indexed for the PostgreSQL definition, but not for the MongoDB definition. This is visible as MongoDB logs slow queries by default:

mongo-1 | {"t":{"$date":"2025-08-05T21:31:22.655+00:00"},"s":"I", "c":"WRITE", "id":51803, "ctx":"conn31"
,"msg":"Slow query","attr":{"type":"update","isFromUserConnection":true,"ns":"postgres.audit","collectionType":"normal"
,"command":{"q":{"schedule_id":4778,"day":{"$date":"2025-08-05T00:00:00.000Z"}},"u":{"$set":{"date":{"$date":"2025-08-05T21:31:22.533Z"}},"$inc":{"seats_occupied":1}},"multi":false,"upsert":true}
,"planSummary":"COLLSCAN","planningTimeMicros":255,"keysExamined":0,"docsExamined":5962,"nMatched":1,"nModified":1,"nUpserted":0,"keysInserted":0,"keysDeleted":0,"numYields":0,"planCacheShapeHash":"99470B66","queryHash":"99470B66","planCacheKey":"031BFB16"
,"locks":{"MultiDocumentTransactionsBarrier":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":5}}},"flowControl":{"acquireCount":1},"readConcern":{"level":"snapshot","provenance":"clientSupplied"},"storage":{"data":{"txnBytesDirty":128}},"cpuNanos":32601533,"remote":"172.18.0.6:42292","queues":{"execution":{"admissions":1},"ingress":{"admissions":1}},"workingMillis":121,"durationMillis":121}}

To improve performance for scalability, create indexes and avoid hotspots. If hotspots are unavoidable, fail fast by performing operations subject to write conflict early in the transaction, rather than at the end like it is done here. The data model should allow critical transactions to be single-document, avoiding the need for normalization across multiple tables, but this benchmark uses the same normalized data model on both databases. Finally, no real application will perform business transaction like this: reserving a flight seat, recording payment, and incrementing an audit counter all in one database transaction. You don't want to maintain a database state with locks while waiting for payment validation that typically depends on an external service.

Some myths are spread from outdated online content. We can use these to identify common mistakes and best practices for MongoDB. That's the goal of this blog post series.

Franck Pachot @franckpachot

Transaction performance 👉🏻 retry with backoff

Comments 0 total