MongoDB Arrays: Sort Order and Comparison
Franck Pachot

Franck Pachot @franckpachot

About: 🥑 Developer Advocate at 🍃 MongoDB, 🔶 AWS Data Hero, 🐘 PostgreSQL fan,▝▞ YugabyteDB expert, 🅾️ Oracle Certified Master, and 💚 loving all databases 🛢️

Location:
Lausanne, Switzerland
Joined:
Nov 12, 2018

MongoDB Arrays: Sort Order and Comparison

Publish Date: Aug 18
2 0

Arrays in MongoDB are ordered lists of values, and querying them can be confusing because a field might be a scalar value in one document and an array in another. Unlike SQL databases, there's no need to choose between one-to-one, one-to-many, or many-to-many relationships once for all future data. Within the limits of schema validation, each document can have a unique schema, allowing fields to be any data type, scalar or array. When indexing or querying a collection, field names are used, and operations like filtering and sorting must work for all documents, regardless of their data types. This flexibility can lead to behavior that may seem strange to developers accustomed to the strict schemas of SQL databases.

This post covers the following:

Sort and comparison in SQL databases

I've heard users familiar with PostgreSQL being surprised to learn that MongoDB employs a different semantic approach when it comes to sorting and comparing data. For instance, when sorting fruits alphabetically in ascending order, 'Apple' appears before 'Banana' and you also expect 'Apple' > 'Banana' to be true. I was not so surprised because I know other databases and other languages. For example, I create a table of fruits in Oracle Autonomous Database in Zurich:

SQL> CREATE TABLE fruits_de (
  2      icon NVARCHAR2(4),
  3      name NVARCHAR2(50)
  4* );

Table FRUITS_DE erstellt.

SQL> INSERT INTO fruits_de (icon, name) VALUES
  2    (N'🍌', N'Banane'),
  3    (N'🍏', N'Äpfel'),
  4    (N'🍇', N'Traube'),
  5*   (N'🍊', N'Orange');

4 Zeilen eingefügt.

SQL> COMMIT;
Commit abgeschlossen.
Enter fullscreen mode Exit fullscreen mode

A sort shows the apple (Äpfel) first:

SQL> SELECT icon, name 
     FROM fruits_de 
     ORDER BY name
;

ICON    NAME
_______ _________
🍏      Äpfel
🍌      Banane
🍊      Orange
🍇      Traube
Enter fullscreen mode Exit fullscreen mode

If comparison and sort worked the same, I would expect a filter on name > 'Banane' to show only Orange and Traube, but Äpfel is still there:

SQL> SELECT icon, name FROM fruits_de WHERE name >'Banane' order by name ;

ICON    NAME
_______ _________
🍏      Äpfel
🍊      Orange
🍇      Traube
Enter fullscreen mode Exit fullscreen mode

By default, NLS_SORT determines ordering based on language (in German, 'Ä' is treated as 'Ae'), while NLS_COMP, defaulting to binary, dictates comparison (in UTF8, 'Ä' follows all ASCII characters).

In Oracle Database, this difference happens with text, as it depends on the language, and can be configured (the default is there for historical reason and better performance). With text, MongoDB is consistent by default between comparison and sorting:

db.fruits_de.insertMany([
      { icon: "🍌", name: "Banane" },  { icon: "🍏", name: "Äpfel" },  { icon: "🍇", name: "Traube" },  { icon: "🍊", name: "Orange" }
    ]);

db.fruits_de.find(
  { _id:0 }
).collation({ locale: "de", strength: 1 }).sort({ name: 1 })
      { }  ,
      { _id:0 }
    ).collation({ locale: "de", strength: 1 }).sort({ name: 1 })

[
  { icon: '🍏', name: 'Äpfel' },
  { icon: '🍌', name: 'Banane' },
  { icon: '🍊', name: 'Orange' },
  { icon: '🍇', name: 'Traube' }
]

db.fruits_de.find(
  { name: { $gt: "Banane" } }  ,
  { _id:0 }
).collation({ locale: "de", strength: 1 }).sort({ name: 1 })

[ { icon: '🍊', name: 'Orange' }, { icon: '🍇', name: 'Traube' } ]

Enter fullscreen mode Exit fullscreen mode

When the field that is sorted or compared is an array, there's not one obvious rule that can be used for sorting and comparing. Is [Cherry, Apple] greater or less than [Banana, Orange]? Which comes first in ascending order: the one that starts with Banana (which is before Cherry), or the one with Apple (which is before both Banana and Orange)?

MongoDB and arrays (comparison and sort)

I have a collection of fruits, identified by a text name ("txt" field) and an indexed array of characters ("arr" field):

db.fruits.createIndex({txt:1});

db.fruits.createIndex({arr:1});

db.fruits.insertMany([
  { _id: "🍏", txt: "apple",              arr: ["a","p","p","l","e"] },
  { _id: "🍐", txt: "pear",               arr: ["p","e","a","r"] },
  { _id: "🍊", txt: "tangerine",          arr: ["t","a","n","g","e","r","i","n","e"] },
  { _id: "🍋", txt: "lemon",              arr: ["l","e","m","o","n"] },
  { _id: "🍌", txt: "banana",             arr: ["b","a","n","a","n","a"] },
  { _id: "🍉", txt: "watermelon",         arr: ["w","a","t","e","r","m","e","l","o","n"] },
  { _id: "🍇", txt: "grapes",             arr: ["g","r","a","p","e","s"] },
  { _id: "🍓", txt: "strawberry",         arr: ["s","t","r","a","w","b","e","r","r","y"] },
  { _id: "🫐", txt: "blueberries",        arr: ["b","l","u","e","b","e","r","r","i","e","s"] },
  { _id: "🍈", txt: "melon",              arr: ["m","e","l","o","n"] },
  { _id: "🍒", txt: "cherries",           arr: ["c","h","e","r","r","i","e","s"] },
  { _id: "🍑", txt: "peach",              arr: ["p","e","a","c","h"] },
  { _id: "🥭", txt: "mango",              arr: ["m","a","n","g","o"] },
  { _id: "🍍", txt: "pineapple",          arr: ["p","i","n","e","a","p","p","l","e"] },
  { _id: "🥥", txt: "coconut",            arr: ["c","o","c","o","n","u","t"] },
  { _id: "🥝", txt: "kiwi fruit",         arr: ["k","i","w","i"," ","f","r","u","i","t"] }
]);
Enter fullscreen mode Exit fullscreen mode

Here are the documents ordered by their identifier:

mdb> db.fruits.find( ).
              sort( { "_id": 1 } ).
              forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍇","txt":"grapes","arr":["g","r","a","p","e","s"]}
{"_id":"🍈","txt":"melon","arr":["m","e","l","o","n"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍋","txt":"lemon","arr":["l","e","m","o","n"]}
{"_id":"🍌","txt":"banana","arr":["b","a","n","a","n","a"]}
{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍏","txt":"apple","arr":["a","p","p","l","e"]}
{"_id":"🍐","txt":"pear","arr":["p","e","a","r"]}
{"_id":"🍑","txt":"peach","arr":["p","e","a","c","h"]}
{"_id":"🍒","txt":"cherries","arr":["c","h","e","r","r","i","e","s"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🥝","txt":"kiwi fruit","arr":["k","i","w","i"," ","f","r","u","i","t"]}
{"_id":"🥥","txt":"coconut","arr":["c","o","c","o","n","u","t"]}
{"_id":"🥭","txt":"mango","arr":["m","a","n","g","o"]}
{"_id":"🫐","txt":"blueberries","arr":["b","l","u","e","b","e","r","r","i","e","s"]}
Enter fullscreen mode Exit fullscreen mode

The identifiers are in binary order of their Unicode point:
🍇 U+1F347 < 🍈 U+1F348 < 🍉 U+1F349 < 🍊 U+1F34A < 🍋 U+1F34B < 🍌 U+1F34C < 🍍 U+1F34D < 🍏 U+1F34F < 🍐 U+1F350 < 🍑 U+1F351 < 🍒 U+1F352 < 🍓 U+1F353 < 🥝 U+1F95D < 🥥 U+1F965 < 🥭 U+1F96D < 🫐 U+1FAD0

Here are the documents ordered by their name in text:

mdb> db.fruits.find( ).
              sort( { "txt": 1 } ).
              forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍏","txt":"apple","arr":["a","p","p","l","e"]}
{"_id":"🍌","txt":"banana","arr":["b","a","n","a","n","a"]}
{"_id":"🫐","txt":"blueberries","arr":["b","l","u","e","b","e","r","r","i","e","s"]}
{"_id":"🍒","txt":"cherries","arr":["c","h","e","r","r","i","e","s"]}
{"_id":"🥥","txt":"coconut","arr":["c","o","c","o","n","u","t"]}
{"_id":"🍇","txt":"grapes","arr":["g","r","a","p","e","s"]}
{"_id":"🥝","txt":"kiwi fruit","arr":["k","i","w","i"," ","f","r","u","i","t"]}
{"_id":"🍋","txt":"lemon","arr":["l","e","m","o","n"]}
{"_id":"🥭","txt":"mango","arr":["m","a","n","g","o"]}
{"_id":"🍈","txt":"melon","arr":["m","e","l","o","n"]}
{"_id":"🍑","txt":"peach","arr":["p","e","a","c","h"]}
{"_id":"🍐","txt":"pear","arr":["p","e","a","r"]}
{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
Enter fullscreen mode Exit fullscreen mode

This is the alphabetical order, from 'banana' to 'watermelon'. It is the binary order, as I've not defined a collation.

Let's add a comparison. The last five fruits in alphabetical order have a name > 'pine':

mdb> db.fruits.find( { "txt": { $gt: "pine" } } ).
               sort( { "txt": 1 } ).
               forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
Enter fullscreen mode Exit fullscreen mode

In text comparison, items are sorted semantically from left to right. For example, 'banana' is greater than 'apple' because its first letter is higher, and 'blueberries' is greater than 'banana' as the first letter is the same, but the second is higher. This applies to sorting and comparison.

Arrays function differently from other data types because they hold multiple values, and those values can be any type, including nested objects and arrays. To compare them, MongoDB examines the ordered values from left to right, much like the characters in a text. This fits with the array of characters, but consider your annual grades at university: Is [A, F, C, F] better than [B, B+, A+, A] simply because the first grade is higher? Or the opposite, because the maximum grade is higher? MongoDB had to establish a consistent rule that applies uniformly across all data types.

Compare the fruits with name > 'pine' to the same operator on the array, arr > [p,i,n,e]:

mdb> db.fruits.find( { "arr": { $gt: ["p","i","n","e"] } } ).
               sort( { "txt": 1 } ).
               forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
Enter fullscreen mode Exit fullscreen mode

The behavior for comparison of arrays is from left to right, comparing items at the same position. Let's order by the array to see the difference:

mdb> db.fruits.find( { "arr": { $gt: ["p","i","n","e"] } } ).
               sort( { "arr": 1 } ).
               forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
Enter fullscreen mode Exit fullscreen mode

The result set is the same, but the order differs because sorting arrays works differently from comparing arrays in a query filter. Comparisons evaluate array elements from left to right until a difference is found, while sorting uses only a single representative value from the array.
For an ascending sort, MongoDB uses the array’s smallest element (by BSON comparison order), the minimum. For a descending sort, it uses the maximum. In this example, the sort { "arr": 1 } is ascending, and each document’s smallest element is "a", so they all rank the same. In a descending sort { "arr": -1 }, "🍓" and "🍉" rank first with "w", while "🍊" and "🍍" follow with "t" and "p" respectively.

mdb> db.fruits.find( { "arr": { $gt: ["p","i","n","e"] } } ).
               sort( { "arr": -1 } ).
               forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
Enter fullscreen mode Exit fullscreen mode

⚠️ Ascending and descending sorts of arrays differ beyond direction. One isn't the reverse of the other.

This behavior may look weird for this data, but if you want a specific sort behavior, you can store it as is, like in the "txt" field, or calculate it at query time, like this:

db.fruits.aggregate([  
  { $match:     { "arr": { $gt: ["p","i","n","e"] } }  },  
  { $addFields: {  
      mySort: { $reduce: {  
        input: "$arr",  
        initialValue: "",  
        in: { $concat: ["$$value", "$$this"] }  
      }}  
    } },  
  {  $sort:     { mySort: 1 } },  
  {  $project:  { _id: 1, txt: 1, mySort: 1 } }  
]);  

[
  { _id: '🍍', txt: 'pineapple', mySort: 'pineapple' },
  { _id: '🍓', txt: 'strawberry', mySort: 'strawberry' },
  { _id: '🍊', txt: 'tangerine', mySort: 'tangerine' },
  { _id: '🍉', txt: 'watermelon', mySort: 'watermelon' }
]
Enter fullscreen mode Exit fullscreen mode

Another example of MongoDB arrays (sort on min/max)

For another usage of arrays, this behavior may sound more logical. For example, I add some random ratings to the fruits:

for (let i = 0; i < 20; i++) {  
  const fruit = db.fruits.aggregate([{ $sample: { size: 1 } }]).next();  
  const rating = Math.round(Math.random() * 50) / 10;  
  db.fruits.updateOne(  
    { _id: fruit._id },  
    { $push: { ratings: rating } }  
  );  
  print(`Added rating ${rating} to ${fruit.txt}`);  
}  
Enter fullscreen mode Exit fullscreen mode

Here are the fruits ordered by ascending rating:

db.fruits.find({}, { txt: 1, ratings: 1 }).sort({ ratings: 1 })  

[
  { _id: '🍓', txt: 'strawberry', ratings: [] },
  { _id: '🍑', txt: 'peach', ratings: [] },
  { _id: '🍍', txt: 'pineapple', ratings: [] },
  { _id: '🥥', txt: 'coconut', ratings: [] },
  { _id: '🍋', txt: 'lemon', ratings: [ 0.7, 4.6, 0.9 ] },
  { _id: '🍈', txt: 'melon', ratings: [ 0.7 ] },
  { _id: '🍇', txt: 'grapes', ratings: [ 0.8, 4.9, 4.6 ] },
  { _id: '🥝', txt: 'kiwi fruit', ratings: [ 3.2, 1.3, 3.8 ] },
  { _id: '🫐', txt: 'blueberries', ratings: [ 2 ] },
  { _id: '🍊', txt: 'tangerine', ratings: [ 2.4, 2.7 ] },
  { _id: '🍐', txt: 'pear', ratings: [ 2.5 ] },
  { _id: '🍏', txt: 'apple', ratings: [ 4.3, 3 ] },
  { _id: '🍒', txt: 'cherries', ratings: [ 3 ] },
  { _id: '🍉', txt: 'watermelon', ratings: [ 3.7 ] },
  { _id: '🍌', txt: 'banana', ratings: [ 4.4 ] },
  { _id: '🥭', txt: 'mango', ratings: [ 4.6 ] }
]
Enter fullscreen mode Exit fullscreen mode

In ascending order, you find first the arrays which contain a small value, an empty array being smaller than everything: 0.7 for 🍋 and 🍈, 0.8 for 🍇, 1.3 for 🥝 ... and it ends with those having only high values.

Here are the fruits ordered by descending rating:

db.fruits.find({}, { txt: 1, ratings: 1 }).sort({ ratings: -1 }) 

[
  { _id: '🍇', txt: 'grapes', ratings: [ 0.8, 4.9, 4.6 ] },
  { _id: '🍋', txt: 'lemon', ratings: [ 0.7, 4.6, 0.9 ] },
  { _id: '🥭', txt: 'mango', ratings: [ 4.6 ] },
  { _id: '🍌', txt: 'banana', ratings: [ 4.4 ] },
  { _id: '🍏', txt: 'apple', ratings: [ 4.3, 3 ] },
  { _id: '🥝', txt: 'kiwi fruit', ratings: [ 3.2, 1.3, 3.8 ] },
  { _id: '🍉', txt: 'watermelon', ratings: [ 3.7 ] },
  { _id: '🍒', txt: 'cherries', ratings: [ 3 ] },
  { _id: '🍊', txt: 'tangerine', ratings: [ 2.4, 2.7 ] },
  { _id: '🍐', txt: 'pear', ratings: [ 2.5 ] },
  { _id: '🫐', txt: 'blueberries', ratings: [ 2 ] },
  { _id: '🍈', txt: 'melon', ratings: [ 0.7 ] },
  { _id: '🍓', txt: 'strawberry', ratings: [] },
  { _id: '🍑', txt: 'peach', ratings: [] },
  { _id: '🍍', txt: 'pineapple', ratings: [] },
  { _id: '🥥', txt: 'coconut', ratings: [] }
] 
Enter fullscreen mode Exit fullscreen mode

In descending order, 🍇 is at the top because it has the highest rating (4.9), despite also receiving lower ratings. In ascending order, we didn't see it at the bottom because it also has a low value (0.8)

All this is documented in Comparison/Sort Order, and I will explain the rationale later.

Comparison with PostgreSQL

If you are familiar with PostgreSQL, here is how to get similar comparison and sorting:

postgres> with fruits(id, txt, arr, ratings) as (VALUES
-- same data in a CTE
('🍈','melon', ARRAY['m','e','l','o','n'], ARRAY[0.7]),
('🍒','cherries', ARRAY['c','h','e','r','r','i','e','s'], ARRAY[3]),
('🍓','strawberry', ARRAY['s','t','r','a','w','b','e','r','r','y'], ARRAY[]::numeric[]),
('🍍','pineapple', ARRAY['p','i','n','e','a','p','p','l','e'], ARRAY[]::numeric[]),
('🍐','pear', ARRAY['p','e','a','r'], ARRAY[2.5]),
('🍌','banana', ARRAY['b','a','n','a','n','a'], ARRAY[4.4]),
('🫐','blueberries', ARRAY['b','l','u','e','b','e','r','r','i','e','s'], ARRAY[2]),
('🍋','lemon', ARRAY['l','e','m','o','n'], ARRAY[0.7,4.6,0.9]),
('🍇','grapes', ARRAY['g','r','a','p','e','s'], ARRAY[0.8,4.9,4.6]),
('🍉','watermelon', ARRAY['w','a','t','e','r','m','e','l','o','n'], ARRAY[3.7]),
('🍑','peach', ARRAY['p','e','a','c','h'], ARRAY[]::numeric[]),
('🥝','kiwi fruit', ARRAY['k','i','w','i',' ','f','r','u','i','t'], ARRAY[3.2,1.3,3.8]),
('🍏','apple', ARRAY['a','p','p','l','e'], ARRAY[4.3,3]),
('🥥','coconut', ARRAY['c','o','c','o','n','u','t'], ARRAY[]::numeric[]),
('🍊','tangerine', ARRAY['t','a','n','g','e','r','i','n','e'], ARRAY[2.4,2.7]),
('🥭','mango', ARRAY['m','a','n','g','o'], ARRAY[4.6])
) select
 -- sort on id
 row_number() over (order by id  asc)  "id🔼",
 -- sort on text
 row_number() over (order by txt asc)  "txt🔼",
 -- sort on array
 row_number() over (order by arr asc)  "arr🔼",
 row_number() over (order by arr desc) "arr🔽",
 -- comparisons on array and text
 case when txt > 'pine'                 then ' > ' end as "txt >" ,
 case when arr > ARRAY['p','i','n','e'] then ' > ' end as "arr >",
 -- ascending sort on minimum rating
 rank() over ( order by
  ( select unnest from unnest(ratings) order by unnest asc   limit 1)
  asc  nulls first ) as "rating🔼",
 -- descending sort on maximum rating
 rank() over ( order by
  ( select unnest from unnest(ratings) order by unnest desc  limit 1)
  desc nulls last) as "rating🔽",
 -- all columns
 *
from fruits order by txt asc;

 id🔼 | txt🔼 | arr🔼 | arr🔽 | txt > | arr > | rating🔼 | rating🔽 | id |     txt     |           arr           |    ratings
------+-------+-------+-------+-------+-------+----------+----------+----+-------------+-------------------------+---------------
    8 |     1 |     1 |    16 |       |       |       12 |        5 | 🍏 | apple       | {a,p,p,l,e}             | {4.3,3}
    6 |     2 |     2 |    15 |       |       |       15 |        4 | 🍌 | banana      | {b,a,n,a,n,a}           | {4.4}
   16 |     3 |     3 |    14 |       |       |        9 |       11 |  🫐 | blueberries | {b,l,u,e,b,e,r,r,i,e,s} | {2}
   11 |     4 |     4 |    13 |       |       |       12 |        8 | 🍒 | cherries    | {c,h,e,r,r,i,e,s}       | {3}
   14 |     5 |     5 |    12 |       |       |        1 |       13 | 🥥 | coconut     | {c,o,c,o,n,u,t}         | {}
    1 |     6 |     6 |    11 |       |       |        7 |        1 | 🍇 | grapes      | {g,r,a,p,e,s}           | {0.8,4.9,4.6}
   13 |     7 |     7 |    10 |       |       |        8 |        6 | 🥝 | kiwi fruit  | {k,i,w,i," ",f,r,u,i,t} | {3.2,1.3,3.8}
    5 |     8 |     8 |     9 |       |       |        5 |        2 | 🍋 | lemon       | {l,e,m,o,n}             | {0.7,4.6,0.9}
   15 |     9 |     9 |     8 |       |       |       16 |        2 | 🥭 | mango       | {m,a,n,g,o}             | {4.6}
    2 |    10 |    10 |     7 |       |       |        5 |       12 | 🍈 | melon       | {m,e,l,o,n}             | {0.7}
   10 |    11 |    11 |     6 |       |       |        1 |       13 | 🍑 | peach       | {p,e,a,c,h}             | {}
    9 |    12 |    12 |     5 |       |       |       11 |       10 | 🍐 | pear        | {p,e,a,r}               | {2.5}
    7 |    13 |    13 |     4 |  >    |  >    |        1 |       13 | 🍍 | pineapple   | {p,i,n,e,a,p,p,l,e}     | {}
   12 |    14 |    14 |     3 |  >    |  >    |        1 |       13 | 🍓 | strawberry  | {s,t,r,a,w,b,e,r,r,y}   | {}
    4 |    15 |    15 |     2 |  >    |  >    |       10 |        9 | 🍊 | tangerine   | {t,a,n,g,e,r,i,n,e}     | {2.4,2.7}
    3 |    16 |    16 |     1 |  >    |  >    |       14 |        7 | 🍉 | watermelon  | {w,a,t,e,r,m,e,l,o,n}   | {3.7}
(16 rows)
Enter fullscreen mode Exit fullscreen mode

To get the same behavior as MongoDB for ascending sort on the ratings array, I had to:

  • Unnest (equivalent to unwind), to get one row per item.
  • Within the unnested array, order by ascending limit 1, to get the minimum.
  • Across all initial rows, order by ascending in the minimum.
  • Put nulls first to get empty arrays at the top.

I did that with a window function per row, on the unnested array:

 -- ascending sort on minimum rating
 rank() over ( order by
  ( select unnest from unnest(ratings) order by unnest asc   limit 1)
  asc  nulls first ) as "rating🔼",
Enter fullscreen mode Exit fullscreen mode

To get the same behavior as MongoDB for descending sort on the ratings array, I had to:

  • Unnest (equivalent to unwind), to get one row per item.
  • Within the unnested array, order by descending limit 1, to get the maximum.
  • Across all initial rows, order by descending in the minimum.
  • Put nulls last to get empty arrays at the bottom.

I did that with a window function per row, on the unnested array:

 -- descending sort on maximum rating
 rank() over ( order by
  ( select unnest from unnest(ratings) order by unnest desc  limit 1)
  desc nulls last) as "rating🔽",
Enter fullscreen mode Exit fullscreen mode

This is not the most optimal query, but I made it readable. I could use one lateral join to unnest arrays and apply a window function with a partition to re-group them. This is complex in a relational database and explains why MongoDB emulations on SQL databases may be less performant for some queries due to RecordId deduplication after index scan.

Sort semantics is close to how indexes work

I used an ORDER BY LIMIT for the example on PostgreSQL. Of course, there are other possibilities, such as using min() and max(), but this approach is closer to how MongoDB works. The sort behavior is designed to be index-friendly. MongoDB multi-key indexes are built by unwinding the items to index entries (similar to UNNEST). Here is the order of the index entries on "arr" (I use the "_id" but index entries store the internal RecordId):

(space)🥝 a🍇 a🍉 a🍊 a🍌 a🍌 a🍌 a🍍 a🍏 a🍐 a🍑 a🍓 a🥭 b🍌 b🍓 b🫐 b🫐 c🍑 c🍒 c🥥 c🥥 e🍇 e🍈 e🍉 e🍉 e🍊 e🍊 e🍋 e🍍 e🍍 e🍏 e🍐 e🍑 e🍒 e🍒 e🍓 e🫐 e🫐 e🫐 f🥝 g🍇 g🍊 g🥭 h🍑 h🍒 🍍 i🍒 i🥝 i🥝 i🥝 i🫐 k🥝 l🍈 l🍉 l🍋 l🍍 l🍏 l🫐 m🍈 m🍉 m🍋 m🥭 n🍈 n🍉 n🍊 n🍊 n🍋 n🍌 n🍌 n🍍 n🥥 n🥭 o🍈 o🍉 o🍋 o🥥 o🥥 o🥭 p🍇 p🍍 p🍍 p🍍 p🍏 p🍏 p🍐 p🍑 r🍇 r🍉 r🍊 r🍐 r🍒 r🍒 r🍓 r🍓 r🍓 r🥝 r🫐 r🫐 s🍇 s🍒 s🍓 s🫐 t🍉 t🍊 t🍓 t🥝 t🥥 u🥝 u🥥 u🫐 w🍉 w🍓 w🥝 y🍓

The query reads them in ascending or descending order (similar to ORDER BY), and deduplication keeps only the first one (similar to LIMIT 1). For example, find( ).sort( { "arr": 1 } ) examines those index keys, drops the duplicates (❌) and returns one entry per document (✅):

[" ",🥝]✅ ["a",🍇]✅ ["a",🍉]✅ ["a",🍊]✅ ["a",🍌]✅ ["a",🍌]❌ ["a",🍌]❌ ["a",🍍]✅ ["a",🍏]✅ ["a",🍐]✅ ["a",🍑]✅ ["a",🍓]✅ ["a",🥭]✅ ["b",🍌]❌ ["b",🍓]❌ ["b",🫐]✅ ["b",🫐] ["c",🍑]❌ ["c",🍒]✅ ["c",🥥]✅ ["c",🥥]❌ ["e",🍇]❌ ["e",🍈]✅ ["e",🍉]❌ ["e",🍉]❌ ["e",🍊]❌ ["e",🍊]❌ ["e",🍋]✅ ["e",🍍]❌ ["e",🍍]❌ ["e",🍏]❌ ["e",🍐]❌ ["e",🍑]❌ ["e",🍒]❌ ["e",🍒]❌ ["e",🍓]❌ ["e",🫐]❌ ["e",🫐]❌ ["e",🫐]❌ ["f",🥝]❌ ["g",🍇]❌ ["g",🍊]❌ ["g",🥭]❌ ["h",🍑]❌ ["h",🍒]❌ ["i",🍊]❌ ["i",🍍]❌ ["i",🍒]❌ ["i",🥝]❌ ["i",🥝]❌ ["i",🥝]❌ ["i",🫐]❌ ["k",🥝]❌ ["l",🍈]❌ ["l",🍉]❌ ["l",🍋]❌ ["l",🍍]❌ ["l",🍏]❌ ["l",🫐]❌ ["m",🍈]❌ ["m",🍉]❌ ["m",🍋]❌ ["m",🥭]❌ ["n",🍈]❌ ["n",🍉]❌ ["n",🍊]❌ ["n",🍊]❌ ["n",🍋]❌ ["n",🍌]❌ ["n",🍌]❌ ["n",🍍]❌ ["n",🥥]❌ ["n",🥭]❌ ["o",🍈]❌ ["o",🍉]❌ ["o",🍋]❌ ["o",🥥]❌ ["o",🥥]❌ ["o",🥭]❌ ["p",🍇]❌ ["p",🍍]❌ ["p",🍍]❌ ["p",🍍]❌ ["p",🍏]❌ ["p",🍏]❌ ["p",🍐]❌ ["p",🍑]❌ ["r",🍇]❌ ["r",🍉]❌ ["r",🍊]❌ ["r",🍐]❌ ["r",🍒]❌ ["r",🍒]❌ ["r",🍓]❌ ["r",🍓]❌ ["r",🍓]❌ ["r",🥝]❌ ["r",🫐]❌ ["r",🫐]❌ ["s",🍇]❌ ["s",🍒]❌ ["s",🍓]❌ ["s",🫐]❌ ["t",🍉]❌ ["t",🍊]❌ ["t",🍓]❌ ["t",🥝]❌ ["t",🥥]❌ ["u",🥝]❌ ["u",🥥]❌ ["u",🫐]❌ ["w",🍉]❌ ["w",🍓]❌ ["w",🥝]❌ ["y",🍓]❌

This has examined 93 keys, dropped 77 index keys during deduplication, and returned 16 entries. This is visible in IXSCAN execution statistics as keysExamined, dupsDropped, and nReturned:


db.fruits.find( ).
          sort( { "arr": 1 } ).
          explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 16,
  executionTimeMillis: 0,
  totalKeysExamined: 93,
  totalDocsExamined: 16,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 16,
    executionTimeMillisEstimate: 1,
    works: 94,
    advanced: 16,
    needTime: 77,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 16,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 16,
      executionTimeMillisEstimate: 1,
      works: 94,
      advanced: 16,
      needTime: 77,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { arr: 1 },
      indexName: 'arr_1',
      isMultiKey: true,
      multiKeyPaths: { arr: [ 'arr' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { arr: [ '[MinKey, MaxKey]' ] },
      keysExamined: 93,
      seeks: 1,
      dupsTested: 93,
      dupsDropped: 77
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

With a comparison, the index is still used to get fast access for the first item in the filter (indexBounds: { arr: [ '["p", MaxKey]' ] } in IXSCAN), the other items are part of an additional filter (filter: { arr: { '$gt': [ 'p', 'i', 'n', 'e' ] } } in FETCH), and the result is sorted according to the sort semantic (sortPattern: { arr: 1 } in SORT):


db.fruits.find( { "arr": { $gt: ["p","i","n","e"] } } ).
          sort( { "arr": 1 } ).
          explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 4,
  executionTimeMillis: 5,
  totalKeysExamined: 29,
  totalDocsExamined: 12,
  executionStages: {
    isCached: false,
    stage: 'SORT',
    nReturned: 4,
    executionTimeMillisEstimate: 6,
    works: 36,
    advanced: 4,
    needTime: 30,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    sortPattern: { arr: 1 },
    memLimit: 33554432,
    type: 'simple',
    totalDataSizeSorted: 753,
    usedDisk: false,
    spills: 0,
    spilledDataStorageSize: 0,
    inputStage: {
      stage: 'FETCH',
      filter: { arr: { '$gt': [ 'p', 'i', 'n', 'e' ] } },
      nReturned: 4,
      executionTimeMillisEstimate: 6,
      works: 30,
      advanced: 4,
      needTime: 25,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      docsExamined: 12,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 12,
        executionTimeMillisEstimate: 3,
        works: 30,
        advanced: 12,
        needTime: 17,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 1,
        keyPattern: { arr: 1 },
        indexName: 'arr_1',
        isMultiKey: true,
        multiKeyPaths: { arr: [ 'arr' ] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { arr: [ '["p", MaxKey]' ] },
        keysExamined: 29,
        seeks: 1,
        dupsTested: 29,
        dupsDropped: 17
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The query planner could have chosen the same index to serve the sort() first (indexBounds: { arr: [ '[MinKey, MaxKey]' ] } in IXSCAN) and filter on all index entries later (filter: { arr: { '$gt': [ 'p', 'i', 'n', 'e' ] } } in FETCH), which doesn't necessitate a SORT operation. In my case, this plan was estimated more expensive and was rejected:

db.fruits.find( { "arr": { $gt: ["p","i","n","e"] } } ).
          sort( { "arr": 1 } ).
          explain().queryPlanner.rejectedPlans

[
  {
    isCached: false,
    stage: 'FETCH',
    filter: { arr: { '$gt': [ 'p', 'i', 'n', 'e' ] } },
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { arr: 1 },
      indexName: 'arr_1',
      isMultiKey: true,
      multiKeyPaths: { arr: [ 'arr' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { arr: [ '[MinKey, MaxKey]' ] }
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Interestingly, the same index can serve both comparison and sorting purposes by enabling quick access to the minimum or maximum value, or by filtering a range based on the first item. Unlike many multi-value indexes in other databases, MongoDB multi-key indexes can be used to optimize sorts and comparisons.

Conclusion

This might clarify why MongoDB arrays have different semantics for sorting and comparison, or at least help you remember it. Comparison operators ($gt, $lt, etc.) work on two arrays, comparing items based on their position. In contrast, sorting (sort()) is performed on an unwound array ordered by the item's value in either ascending or descending order, and retaining the first encountered item in this order, which is the minimum or maximum.

Comments 0 total

    Add comment