One million $lookup challenge
Franck Pachot

Franck Pachot @franckpachot

About: 🥑 Developer Advocate at 🍃 MongoDB, 🔶 AWS Data Hero, 🐘 PostgreSQL fan,▝▞ YugabyteDB expert, 🅾️ Oracle Certified Master, and 💚 loving all databases 🛢️

Location:
Lausanne, Switzerland
Joined:
Nov 12, 2018

One million $lookup challenge

Publish Date: Jun 19
6 0

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is efficient at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS 👉🏻 aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);
Enter fullscreen mode Exit fullscreen mode

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

Enter fullscreen mode Exit fullscreen mode

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

Enter fullscreen mode Exit fullscreen mode

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

Enter fullscreen mode Exit fullscreen mode

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

Enter fullscreen mode Exit fullscreen mode

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

Enter fullscreen mode Exit fullscreen mode

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

Enter fullscreen mode Exit fullscreen mode

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases must do all optimizations in the query planner to follow Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

Comments 0 total

    Add comment