One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is efficient at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS 👉🏻 aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim	fact	lookup	$getField	$switch	$arrayElemAt	update	single doc
10	1000	0.008s	0.002s	0.001s	0.001s	0.08s	0s
100	1000	0.008s	0.006s	0.005s	0.001s	0.078s	0s
1000	1000	0.011s	0.062s	0.033s	0.001s	0.082s	0s
10000	1000	0.013s	0.754s	0.067s	0.003s	0.08s	0s
10	10000	0.075s	0.021s	0.016s	0.012s	0.199s	0.005s
100	10000	0.078s	0.066s	0.055s	0.013s	0.191s	0.005s
1000	10000	0.105s	0.62s	0.292s	0.013s	0.229s	0.005s
10000	10000	0.104s	6.94s	0.305s	0.015s	0.237s	0.005s
10	100000	0.738s	0.215s	0.171s	0.129s	1.306s	0.052s
100	100000	0.781s	0.673s	0.571s	0.131s	1.359s	0.052s
1000	100000	1.044s	6.259s	2.71s	0.141s	1.756s	0.054s
10000	100000	1.068s	73.205s	2.702s	0.144s	1.769s	0.059s
10	1000000	7.583s	2.199s	1.761s	1.332s	12.524s	0.559s
100	1000000	7.992s	6.634s	5.741s	1.346s	13.03s	0.557s
1000	1000000	10.551s	62.385s	26.4s	1.398s	16.771s	0.557s
10000	1000000	10.794s	742.086s	26.039s	1.437s	17.008s	0.578s
10	10000000	76.225s	22.127s	17.795s	13.196s	124.922s	5.789s
100	10000000	80.828s	67.602s	57.981s	13.695s	131.738s	5.714s
1000	10000000	106.194s	622.382s	267.555s	14.054s	168.854s	5.778s
10000	10000000	107.211s	7351.675s	265.404s	14.046s	171.13s	5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases must do all optimizations in the query planner to follow Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

Franck Pachot @franckpachot