I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is efficient at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS 👉🏻 aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.
In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.
When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.
To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.
I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:
db.dim.drop();
db.fact.drop();
db.dim.insertMany(
Array.from({ length: 1000 }, (_, i) => ({
_id: i + 1,
value: Math.random()
}))
);
db.fact.insertMany(
Array.from({ length: 1000000 }, () => ({
ref: Math.ceil(Math.random() * 1000),
value: Math.random()
}))
);
Lookup (IndexedLoopJoin): 10 seconds
Here is an aggregation pipeline with a lookup.
x=db.fact.aggregate([
{
$lookup: {
from: "dim",
localField: "ref",
foreignField: "_id",
as: "dimData" ,
}
},
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")
On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.
Map to object and $getField: 61 seconds
To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField
const dimMap = {};
db.dim.find().forEach(doc => {
dimMap[doc._id] = doc.value;
});
print( dimMap )
x=db.fact.aggregate([
{
$addFields: {
dimValue: {
$getField: {
field: { $toString: "$ref" },
input: dimMap
}
}
}
}
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")
On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.
Map to $switch branches: 23 seconds
Instead of using that map, I build a $switch
statement to use in the aggregation pipeline.
const dimMap = {};
db.dim.find().forEach(doc => {
dimMap[doc._id] = doc.value;
});
print( dimMap )
const switchBranches = Object.entries(dimMap).map(([id, value]) => ({
case: { $eq: ["$ref", parseInt(id)] },
then: value
}));
print( switchBranches )
x=db.fact.aggregate([
{
$addFields: {
dimValue: {
$switch: {
branches: switchBranches,
default: null
}
}
}
}
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")
On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.
Map to array and $arrayElemAt: 1 second
Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.
// Get the maximum ID
const maxId = db.dim.aggregate([
{$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
d => dimValues[d._id] = d.value
);
print(dimValues)
//
x=db.fact.aggregate([
{ $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")
This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.
Embed rather than join (denormalization)
Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.
const startTime = new Date();
db.fact.aggregate([
{
$lookup: {
from: "dim",
localField: "ref",
foreignField: "_id",
as: "dimData"
}
},
{
$out: "fact"
}
])
const endTime = new Date();
const executionTime = (endTime - startTime) / 1000;
print(`Update execution time: ${executionTime} seconds`);
This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.
To compare, I can simply read the document and project the embedded value:
x=db.fact.aggregate([
{
$project: {
_id: 1,
ref: 1,
dimValue: 1, // Simply project the pre-computed field
// Add any other fact fields you need
someFactField: 1
}
}
]).explain("executionStats");
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")
This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".
Conclusion
I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.
dim | fact | lookup | $getField | $switch | $arrayElemAt | update | single doc |
---|---|---|---|---|---|---|---|
10 | 1000 | 0.008s | 0.002s | 0.001s | 0.001s | 0.08s | 0s |
100 | 1000 | 0.008s | 0.006s | 0.005s | 0.001s | 0.078s | 0s |
1000 | 1000 | 0.011s | 0.062s | 0.033s | 0.001s | 0.082s | 0s |
10000 | 1000 | 0.013s | 0.754s | 0.067s | 0.003s | 0.08s | 0s |
10 | 10000 | 0.075s | 0.021s | 0.016s | 0.012s | 0.199s | 0.005s |
100 | 10000 | 0.078s | 0.066s | 0.055s | 0.013s | 0.191s | 0.005s |
1000 | 10000 | 0.105s | 0.62s | 0.292s | 0.013s | 0.229s | 0.005s |
10000 | 10000 | 0.104s | 6.94s | 0.305s | 0.015s | 0.237s | 0.005s |
10 | 100000 | 0.738s | 0.215s | 0.171s | 0.129s | 1.306s | 0.052s |
100 | 100000 | 0.781s | 0.673s | 0.571s | 0.131s | 1.359s | 0.052s |
1000 | 100000 | 1.044s | 6.259s | 2.71s | 0.141s | 1.756s | 0.054s |
10000 | 100000 | 1.068s | 73.205s | 2.702s | 0.144s | 1.769s | 0.059s |
10 | 1000000 | 7.583s | 2.199s | 1.761s | 1.332s | 12.524s | 0.559s |
100 | 1000000 | 7.992s | 6.634s | 5.741s | 1.346s | 13.03s | 0.557s |
1000 | 1000000 | 10.551s | 62.385s | 26.4s | 1.398s | 16.771s | 0.557s |
10000 | 1000000 | 10.794s | 742.086s | 26.039s | 1.437s | 17.008s | 0.578s |
10 | 10000000 | 76.225s | 22.127s | 17.795s | 13.196s | 124.922s | 5.789s |
100 | 10000000 | 80.828s | 67.602s | 57.981s | 13.695s | 131.738s | 5.714s |
1000 | 10000000 | 106.194s | 622.382s | 267.555s | 14.054s | 168.854s | 5.778s |
10000 | 10000000 | 107.211s | 7351.675s | 265.404s | 14.046s | 171.13s | 5.767s |
An array, when queried with $arrayElemAt
, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup
when the lookup table is small, when spilling to disk is permissible, and when there's no index.
When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases must do all optimizations in the query planner to follow Codd's rules on data independence for relational databases.
A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.