When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:
$lookup: {
from: "xxb",
localField: "xxa1",
foreignField: "xxb1",
as: "xxb"
}
is often compared to the following SQL statement
SELECT *
FROM xxa
LEFT JOIN xxb
ON xxa.xxa1 = xxb.xxb1;
If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:
- In MongoDB, any field can be a scalar or an array. The
$lookup
function operates as an intersection of two sets, unlike SQL's equality operator (=
) which operates on two values. It is more similar to PostgreSQL's&&
operator for arrays. - The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an
$unwind
stage follows, the$lookup
will then unnest this array into multiple documents.
To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.
Scalar on the outer and inner collections
I define two small collections with some values in common and some which do not match:
db.xxa.drop();
db.xxa.insert([
{ xxa1: 101 },
{ xxa1: 102 },
]);
db.xxb.drop();
db.xxb.insert([
{ xxb1: 102 },
{ xxb1: 103 },
]);
The following query "joins" the two with a $lookup:
db.xxa.aggregate([
{
$lookup: {
from: "xxb",
localField: "xxa1",
foreignField: "xxb1",
as: "xxb"
}
}
]);
[
{ _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
{
_id: ObjectId('68504d1df99599b54cd4b119'),
xxa1: 102,
xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
}
]
For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.
Array on the outer, scalar in the inner
I replace the outer collection with some arrays:
db.xxa.drop();
db.xxa.insert([
{ xxa1: [100,101] },
{ xxa1: [101,102] },
{ xxa1: [102,103] },
]);
I run the same query:
db.xxa.aggregate([
{
$lookup: {
from: "xxb",
localField: "xxa1",
foreignField: "xxb1",
as: "xxb"
}
}
]);
[
{
_id: ObjectId('68504d43f99599b54cd4b11c'),
xxa1: [ 100, 101 ],
xxb: []
},
{
_id: ObjectId('68504d43f99599b54cd4b11d'),
xxa1: [ 101, 102 ],
xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
},
{
_id: ObjectId('68504d43f99599b54cd4b11e'),
xxa1: [ 102, 103 ],
xxb: [
{ _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
{ _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
]
}
]
The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.
The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.
Scalar on the outer, array in the inner
I do the opposite:
db.xxa.drop();
db.xxa.insert([
{ xxa1: 101 },
{ xxa1: 102 },
]);
db.xxb.drop();
db.xxb.insert([
{ xxb1: [100,101] },
{ xxb1: [101,102] },
{ xxb1: [102,103] },
]);
I run the same query:
db.xxa.aggregate([
{
$lookup: {
from: "xxb",
localField: "xxa1",
foreignField: "xxb1",
as: "xxb"
}
}
]);
[
{
_id: ObjectId('68504d5ef99599b54cd4b11f'),
xxa1: 101,
xxb: [
{ _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
{ _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
]
},
{
_id: ObjectId('68504d5ef99599b54cd4b120'),
xxa1: 102,
xxb: [
{ _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
{ _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
]
}
]
A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.
Arrays on both side
The general case can handle arrays on both sides:
db.xxa.drop();
db.xxa.insert([
{ xxa1: [100,101] },
{ xxa1: [101,102] },
{ xxa1: [102,103] },
]);
db.xxa.aggregate([
{
$lookup: {
from: "xxb",
localField: "xxa1",
foreignField: "xxb1",
as: "xxb"
}
}
]);
[
{
_id: ObjectId('68504d7ef99599b54cd4b124'),
xxa1: [ 100, 101 ],
xxb: [
{ _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
{ _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
]
},
{
_id: ObjectId('68504d7ef99599b54cd4b125'),
xxa1: [ 101, 102 ],
xxb: [
{ _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
{ _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
{ _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
]
},
{
_id: ObjectId('68504d7ef99599b54cd4b126'),
xxa1: [ 102, 103 ],
xxb: [
{ _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
{ _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
]
}
]
In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.
Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:
-- Main entities
CREATE TABLE xxa (
id SERIAL PRIMARY KEY
-- other scalar fields if any
);
CREATE TABLE xxb (
id SERIAL PRIMARY KEY
-- other scalar fields if any
);
-- Junction tables to represent the arrays
CREATE TABLE xxa_values (
xxa_id INT REFERENCES xxa(id),
value INT,
PRIMARY KEY (xxa_id, value)
);
CREATE TABLE xxb_values (
xxb_id INT REFERENCES xxb(id),
value INT,
PRIMARY KEY (xxb_id, value)
);
Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:
SELECT
xxa.id as xxa_id,
ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,
ARRAY_AGG(
DISTINCT jsonb_build_object(
'id', xxb.id,
'xxb1', ARRAY(
SELECT value
FROM xxb_values
WHERE xxb_id = xxb.id
ORDER BY value
)
)
) FILTER (WHERE xxb.id IS NOT NULL) as xxb
FROM xxa
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id
GROUP BY xxa.id
ORDER BY xxa.id;
The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.
Conclusion
I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.
If you've published on Dev.to, read this: free tokens now live for Dev.to contributors in recognition of your efforts on Dev.to! Claim your rewards here (wallet connection required). – Dev.to Airdrop Desk