$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

Franck Pachot @franckpachot