MongoSQL Syntax & Semantics

MongoSQL Syntax & Semantics

Overview

This document describes the syntax and semantics of MongoSQL, a SQL dialect that is designed to provide first-class support for working with structured and schemaless data in MongoDB, while remaining as compatible with the SQL-92 standard as possible.

This document is intended to be descriptive, not a formal specification, though we do provide some formal definitions as part of our behavioral descriptions where it is helpful to do so. At the moment, the only formal specification that we provide for MongoSQL is a set of spec tests. We expect the language to evolve as needed over the course of the MongoSQL initiative INIT-63) and beyond.

Abstract Model

Summary

This section provides a model for describing the behavior of a MongoSQL query, and standardizes definitions and notation that are used throughout this document for precisely specifying the behavior of various language features. This model is an "Abstract" model because it only exists for the purposes of this document, and does not imply or prescribe any particular implementation decisions.

Environments

Each MongoSQL (sub-)query and MongoSQL (sub-)expression q is evaluated within a catalog environment ρ_c′, and values environment ρ_n′, where n is a subscript used to denote different versions of ρ during evaluation. The values environment, ρ_n′, contains values which are in scope at a given point (temporally and syntactically) in a query execution, n. ρ_c can be thought of as the catalog from most SQL databases. The initial ρ_n at the start of execution is noted as ρ₀. In general, n need not be an integer; the subscript serves only as a convenient way to distinguish distinct environments during query execution.

In all cases, an environment is a binding tuple, which is a set ⟨(x₁,i₁): v₁, ..., (x_n,i_n): v_n⟩ where each x_i is a valid BSON key or ⟘, each i_i is a subquery nesting depth, starting at 0 for the top-level query, and each v_i is a BSON value. The shorthand ⟨X:v, ...⟩ stands in for ⟨(X, 0):v, ...⟩.

In a top-level query, ρ₀ is always the empty binding tuple because there are no local value bindings prior to query execution.

We also define the common set operators over arrays where the results are ordered in terms of the left argument.

Basic Query Clause Semantics

In MongoSQL, each clause can be defined in isolation from the others. A clause is a function that inputs and outputs collections of binding tuples and the environments defined above. Formally, for Ρ, the set of all possible environments (binding tuples), and C, the set of all possible arrays/collections of binding tuples, each clause is a function:

Clause: Ρ × C → Ρ × C

That is, each clause is a function accepting a tuple of (ρ_i, c) where each ρ is a values environment as defined before, and each c is an array/collection of binding tuples. The global catalog environment, ρ_c, is always accessible, and never modified, so we elide it from function inputs and outputs. No clauses actually make changes to ρ, but instead create modified versions of ρ for use in subqueries or later clauses.

This is very similar to MQL's stages, which are functions over collections of BSON documents with one global environment containing the catalog and the state of certain global variables (e.g. $$NOW). The MongoSQL values environment ρ_i is analogous to the set of fields defined by the current document or by let bindings in stages like $lookup.

As in standard SQL, the clauses of a MongoSQL query are evaluated in the following order:

FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
OFFSET
LIMIT

Consider the following example:

ρ_c: ⟨ test: {
  foo: [
    {'a': 24.5},
    {'a': 999}
  ],
  bar: [
    {'a':41, 'b':42},
    {'a':21, 'c':23}
  ]
}⟩

ρ₀: ⟨⟩
SELECT *
FROM test.foo AS x CROSS JOIN test.bar AS y
WHERE x.a > y.a

The query will be executed as follows, producing the specified inputs and outputs at each step. This example provides a basic introduction to the semantics of SELECT, FROM, and WHERE clauses. See the specific sections for a more in depth treatment.

FROM test.foo as x CROSS JOIN test.bar as y
FROM_in = (ρ₀, c^FROM_in = [ ])
FROM_out = WHERE_in = (ρ₀, c^WHERE_in = [
   ⟨(x, 0): {a: 24.5}, (y, 0): {a: 41, b: 42}⟩,
   ⟨(x, 0): {a: 24.5}, (y, 0): {a: 21, c: 23}⟩,
   ⟨(x, 0): {a: 999}, (y, 0): {a: 41, b: 42}⟩,
   ⟨(x, 0): {a: 999}, (y, 0): {a: 21, c: 23}⟩
])

WHERE x.a > y.a
WHERE_out = SELECT_in = (ρ₀, c^SELECT_in = [
  ⟨(x, 0): {a: 24.5}, (y, 0): {a: 21, c: 23}⟩,
  ⟨(x, 0): {a: 999}, (y, 0): {a: 41, b: 42}⟩,
  ⟨(x, 0): {a: 999}, (y, 0): {a: 21, c: 23}⟩
])

SELECT *
SELECT_out = (ρ₀, RESULT = [
  ⟨(x, 0): {a: 24.5}, (y, 0): {a: 21, c: 23}⟩,
  ⟨(x, 0): {a: 999}, (y, 0): {a: 41, b: 42}⟩,
  ⟨(x, 0): {a: 999}, (y, 0): {a: 21, c: 23}⟩
])

First, the FROM clause is evaluated within the environments ρ_c and ρ₀ shown at the top of the example. After that, the FROM clause outputs the shown array of binding tuples, leaving the environments unchanged. In each binding tuple of FROM_out,the field names correspond to the aliases defined in the FROM clause, namely x and y. Because this is a cross join, the resulting binding tuples are formed from the cross product of test.foo and test.bar.

There are no restrictions that the types of values bound to x and y be homogeneous, and fields can be missing, as the example shows. The binding tuples themselves, however, are homogeneous: the same set of binding-tuple keys must appear in every tuple from the same result set.

Each clause following the FROM clause takes as its input an environment and an array of binding tuples and outputs an environment and an array of binding tuples (which then become the input to the next clause.

In the above example, the WHERE clause filters out those binding tuples where x.a is not greater than y.a by evaluating its condition for each binding tuple in the input array, and producing a new output array of binding tuples containing only those tuples that meet the condition.

Each conditional evaluation in the WHERE clause is performed within the environments ρ_c and ρ_i = tupleConcat(ρ₀,t_j). This means that the values from ρ₀ will be consistent while the values in ρ₁ that come from the current tuple t_j will vary from tuple to tuple.

The value environment when the first input tuple is evaluated by the WHERE clause is:

ρ_{WHERE 1} = tupleConcat(ρ₀, d₀)
= tupleConcat(⟨⟩, ⟨(x, 0): {a: 24.5}, y: {a: 41, b: 42}⟩)
= ⟨(x, 0): {a: 24.5}, y: {a: 41, b: 42}⟩

As we can see, ρ_{WHERE 1} is equivalent to the first input binding tuple, because ρ₀ for this query is empty.

The WHERE condition evaluates to false for the first binding tuple in WHERE_in because:

(ρ_c, ρ_{WHERE 1}) ⊢ x.a > y.a
  → {a: 24.5}.a > {a: 41, b: 42}.a
  → 24.5 > 41
  → false

The last three binding tuples match the condition, and thus three binding tuples are sent to the SELECT clause:

(ρ_c, ρ_{WHERE 2} = tupleConcat(⟨⟩, ⟨(x, 0): {a: 24.5}, y: {a: 21, c: 23}⟩))
  ⊢ x.a > y.b
  → {a: 24.5}.a > {a: 21, b: 23}.a
  → 24.5 > 23
  → true

(ρ_c, ρ_{WHERE 3} = tupleConcat(⟨⟩, ⟨(x, 0): {a: 999, y: {a: 41, b: 42}⟩))
  ⊢ x.hello > y.b
  → {a: 999}.a > {a: 41, b: 42}.a
  → 999 > 42
  → true

(ρ_c, ρ_{WHERE 4} = tupleConcat(⟨⟩, ⟨(x, 0): {a: 999, y: {a: 21, c: 23}⟩))
  ⊢ x.hello > y.b
  → {a: 999}.a > {a: 21, b: 23}.a
  → 999 > 23
  → true

This pattern of "input an array (stream) of binding tuples, evaluate the constituent expressions, output an array (stream) of binding tuples" has a couple exceptions: LIMIT and OFFSET do not need to actually evaluate any expressions for each binding tuple. Additionally, the FROM clause takes no input. For example, a LIMIT 10 clause that inputs an array or stream with 100 binding tuples need not access binding tuples 11-100.

Overall, each clause of MongoSQL is an operator that inputs/outputs arrays (or streams) of binding tuples. As such, we can define the semantics of each clause separately from the semantics of the other clauses.

Materializing Binding Tuples as Documents

MongoDB's wire protocol allows only for result sets of documents, not binding tuples. Therefore, we will need a way of transforming binding tuples in our result sets into BSON documents so that they can be returned in MongoDB result sets.

There are many possible algorithms for rendering a binding tuple to a BSON document, and MongoSQL does not require any particular one as part of the specification. The choice of rendering algorithm(s) is left to the database hosting the MongoSQL implementation. For the sake of consistency, we describe a reasonable default algorithm below:

Check each binding in the binding tuple
- If keys exist with nesting depths other than 0, it is an error¹.
- If the key's datasource name is ⟂:
  - The value for that key must be a document, and all the keys of that document will be materialized in the root of the target document
- If the key's datasource name is a valid BSON key name:
  - If the value is a document:
    - If all keys of that document do not conflict with any
      
      keys of any other subdocuments or the root, we materialize that document's bindings in the root of the target document
    - If there are key conflicts, we materialize the document as a nested subdocument of the root materialized document with the binding tuple key as a BSON key
  - If the value is not a document, materialize as a BSON key, value pair using the binding tuple key

Examples:

⟨(a, 0): 1, (b, 0): 2⟩

materializes as

{a: 1, b: 2}

⟨(⟂, 0): {a: 1}, (b, 0): {a: 1, b: 2}⟩

materializes as

{a: 1, b: {a: 1, b: 2}}

Because b.a would conflict with a if rerooted.

⟨(⟂, 0): {a: 1}, (b, 0): {c: 1, b: 2}⟩

materializes as

{a: 1, c: 1, b: 2}

Because there are no conflicts introduced by rerooting b.

If the keys of documents in the generated binding tuples cannot be statically enumerated at query planning time, we assume that conflicts exist, meaning that the result will be materialized as a nested subdocument under the binding tuple key.

¹ This is an error because the only place where we should ever be transforming a MongoSQL binding-tuple result set to a BSON-document result set is at the root of a query. This condition should never arise unless there is an implementation error.

MongoSQL Grammar

This is a linked grammar of the entire MongoSQL syntax. Each production links to the section of the document that further elucidates the syntax and semantics of the given construct.

<select query> ::= <set query>
                           | <select clause> <from clause>?
                             <where clause>? <group by clause>?
                             <having clause>? <order by clause>?
                             <limit/offset clause>?

<limit/offset clause> ::= <limit clause> <offset clause>?
| <offset clause> <limit clause>?

Type System

Data Types

Behavioral Description (Data Types)

MongoSQL types are the set of BSON types. All of these types can be queried in MongoSQL, though not all of them have literal syntax (deprecated BSON types, for example). See below for the list of all BSON types:

Double
String
Document
Array
BinData
ObjectId
Boolean
Date
Null
Undefined
Regex
DBPointer
Javascript
Symbol
JavascriptWithScope
Int
Timestamp
Long
Decimal
MinKey
MaxKey

Type Names and Aliases

Each type in MongoSQL has a name, which is a keyword that can be used to reference the type when necessary (e.g. in an expression like CAST). Some types have one or more aliases in addition to their primary name.

Generally, MongoSQL types use their BSON type names and, for SQL-92 compatibility, their corresponding SQL-92 type names as aliases. For example, the MongoSQL Integer type uses the name INT from BSON and the aliases INTEGER and SMALLINT from SQL-92.

There are two exceptions to this: the BSON Date type and the BSON Timestamp type. The BSON Date type name conflicts with the SQL-92 Date type name. A BSON Date is a datetime, whereas a SQL-92 Date is just a date. The BSON Timestamp type name conflicts with the SQL-92 Timestamp type name. A BSON Timestamp is a special type for MongoDB, whereas a SQL-92 Timestamp is a datetime. Since MongoSQL must be SQL-92 compliant, SQL-92 names take precedence. Therefore, to address the naming conflicts, the MongoSQL Datetime type, which corresponds to the BSON Date type, uses the SQL-92 Timestamp type name, TIMESTAMP, and the MongoSQL Timestamp type, which corresponds to the BSON Timestamp type, uses the name BSON_TIMESTAMP. The MongoSQL Datetime type also uses the alias BSON_DATE. See the Grammar for all type names and aliases.

Type aliases are rewritten to their core names in MongoSQL queries. See the Examples for details.

Grammar

<double type> ::= DOUBLE PRECISION?
| REAL
| FLOAT ("(" <integer literal> ")")?

<document type> ::= DOCUMENT
<array type> ::= ARRAY
<binary data type> ::= BINDATA
<undefined type> ::= UNDEFINED
<objectId type> ::= OBJECTID
<boolean type> ::= BOOL | BIT | BOOLEAN
<datetime type> ::= BSON_DATE | TIMESTAMP
<null type> ::= NULL
<regular expression type> ::= REGEX
<dbPointer type> ::= DBPOINTER
<javascript type> ::= JAVASCRIPT
<symbol type> ::= SYMBOL
<javascriptWithScope type> ::= JAVASCRIPTWITHSCOPE
<32-bit integer type> ::= INT | INTEGER | SMALLINT
<timestamp type> ::= BSON_TIMESTAMP
<64-bit integer type> ::= LONG
<decimal128 type> ::= DECIMAL ("(" <integer literal> ("," <integer literal>)? ")")?
| DEC ("(" <integer literal> ("," <integer literal>)? ")")?
| NUMERIC ("(" <integer literal> ("," <integer literal>)? ")")?
<min key type> ::= MINKEY
<max key type> ::= MAXKEY

Examples

See spec tests.

Rejected Alternatives/Competitive Analysis

The INTERVAL type is not supported by MongoSQL. There are different levels of SQL-92 support and INTERVAL is not part of SQL-92 ENTRY support, so we chose not to include it. Additionally, there is no corresponding BSON type for INTERVAL.

Given the name conflict between the BSON Timestamp and SQL-92 Timestamp types, and the BSON Date and SQL-92 Date types, we chose to prefix the MongoSQL names for the BSON types with "BSON". For consistency across all BSON types, we considered supporting aliases for all BSON types that used the "BSON_" prefix. We ultimately decided against this because (1) it avoids clutter, and (2) consistency across alias names for all BSON types may actually lead to more user confusion as opposed to reducing it.

Type Conversions

Behavioral Description (Type Conversions)

Type conversions fall into two categories: explicit and implicit. Explicit type conversions are directly invoked by users. Implicit type conversions are performed without users specifically requesting them. In MongoSQL, all type conversions are explicit.

Explicit type conversions are expressed via the CAST scalar function or the :: operator. The CAST function accepts up to four arguments. As defined by SQL-92, the base-level invocation of CAST accepts an operand to cast and a target data type. In addition to that form, MongoSQL accepts two additional, optional arguments. One additional argument is an expression to return if the operand is NULL or MISSING; if omitted, casting NULL or MISSING to anything results in NULL. The other additional argument is an expression to return if the conversion produces a runtime error; if omitted, MongoSQL will return NULL if it encounters a casting error. See the Grammar for details on how to provide these extra arguments.

The :: operator is a shorthand alias for the two-argument form of the CAST function.

<expr>::<type>
will be rewritten as:
CAST(<expr> AS <type>)

In addition to CAST and ::, implementations may also define "constructor" scalar functions that alias CAST invocations to certain types. For example, an implementation may define an OBJECTID(<expression>) function as shorthand for CAST(<expression> AS OBJECTID).

Although all type conversions must be explicit, that does not mean type conversions are required in all circumstances. Numeric types are all mutually comparable. Therefore, MongoSQL allows operations between the various numeric types without casting the operands to be the same numeric type. For example, an int can be added to a double, or a long could be compared to a decimal, etc.

Conversion Behavior

The semantics of type conversions in MongoSQL are similar to those of the aggregation pipeline operator $convert. The range of target types is a subset of all MongoSQL types. Valid target types are ARRAY, DOCUMENT, DOUBLE, STRING, OBJECTID, BOOL, BSON_DATE, INT, LONG, and DECIMAL, or any of their corresponding SQL-92 type aliases as shown in the Data Types section. If the provided target type is invalid, a static error is returned.

For all target types except ARRAY and DOCUMENT, MongoSQL type conversion behaves the same as $convert. The exception to this is that MongoSQL CAST returns NULL instead of throwing a runtime error if a conversion error is encountered and no ON ERROR argument is provided. Attempting to CAST to a target type from an incompatible source type, for example BOOL to BSON_DATE, is considered a conversion error and evaluates to ON ERROR or NULL if that is not provided.

Casting to ARRAY behaves as follows:

Input	Behavior
ARRAY	No-op
NULL MISSING	ON NULL expression, if provided NULL otherwise
Any other type	ON ERROR expression, if provided NULL otherwise

Casting to DOCUMENT behaves as follows:

Input	Behavior
DOCUMENT	No-op
NULL MISSING	ON NULL expression, if provided NULL otherwise
Any other type	ON ERROR expression, if provided NULL otherwise

Grammar

<cast expression> ::= <expression> "::" <type>
                                    | CAST "(" <expression> AS <type>
                                     ("," <expression> ON NULL)?
                                     ("," <expression> ON ERROR)?
                                   ")"

Examples

See query spec tests here and rewrite spec tests here.

Rejected Alternatives/Competitive Analysis

SQL-92 specifies that implicit casting is only required within groups of similar types, i.e. numeric types, character types, bit types, datetime types, and interval types, not between them (SQL-92 4.6).

MongoSQL does not have implicit type conversions between different groups of types. For example, <int> + <string> requires that the <string> be explicitly cast to a numeric type. MongoSQL does support some operations between different numeric types, but this is transparent to users. Implicit conversions lead to confusing semantics and potentially unexpected and surprising behavior. Competitors do not support implicit type conversions:

SQL++ clearly specifies it does not.
N1QL specifies it does not by each "Functions"/"Operators" docs page.
Rocket specifies it does not by each "Functions" docs page.
PartiQL does not specify in the docs, but we observed that it does not.
Presto does not specify in the docs, but we observed that it does not.

Popular SQL dialects MySQL and SQLite do seem to have implicit casting such that <int> + <string> works without an explicit cast.

Schema/Type Constraints

Behavioral Description

A schema in MongoSQL is a collection of facts about an expression or collection that are known to be true at compile time.

MongoSQL schemas are similar to a structural type system, though we use the word "schema" instead of "type" to avoid confusion with MongoSQL data types. For example, a MongoSQL schema might tell us that "this expression is either a boolean or a document with subfields a and b", or "this expression is either an array of length 1 or a positive integer".

Schema Inference

MongoSQL gets schema information from a number of different places:

The database that is hosting a MongoSQL implementation may provide schema information about some or all collections.
The types of all other MongoSQL expressions (literals, operators, scalar functions, etc) can be determined at compile time. For example, we can know at compile time that "abc" is a string, and that CASE WHEN x THEN 5 ELSE false END is either an integer or a boolean.

If schema information is not available from the host database, MongoSQL still assigns static types to all expressions (including column references); these types will just be much less constrained than they would be if schema data were available. For example, when no schema data is available for the collection foo, the field a in the query SELECT a FROM foo could be a value of any MongoSQL type, or could be missing.

Type Checking

There are numerous situations where an expression's type needs to be verified at compile time. Every function and operator has constraints on the types of its arguments, and some clauses also have type constraints for expressions in certain positions. Some expressions that would typically check types at runtime (x IS INTEGER, for example) may also be evaluated at compile time if sufficient type information is available.

If a static type constraint is not satisfied, then the query will fail to compile.

Type assertion

In order to match the type constraint mentioned above, users will need to add CAST on expressions to declare its type in the schema-less mode. To simplify this process and reduce the unnecessary conversion stage created during translation, we introduced the type assertion operator ::!, with which the expression will be statically treated as the type appended to the operator. During evaluation, there won't be any conversion applied on the result of this expression.

When the type assertion operator is applied on the expression with determined types, a static check will be performed to make sure that the target type is among those types. If not, a static error will be thrown. For example, given schema

'foo': {
'bsonType': "int"
},

foo::!STRING will fail the static check because STRING is not part of the possible types foo contains.

Given the schema:

'foo': {
  'anyOf': [
    { 'bsonType': "int" },
    { 'bsonType': "string" }
  ]
}

foo::!STRING will pass the static check and be treated as string type in the query.

Be aware if the column reference contains a type that doesn't work under the expression, even if it can pass the static check, a runtime error may be returned when the query is executed. For example, substr(foo::!STRING, 1, 2) will throw a runtime error if foo is not actually a STRING type because substr only accepts a STRING or NULL value.

Grammar

There is no additional syntax associated with schema/type constraints. A MongoSQL implementation may want to provide an API for getting type information about things such as collections, result sets, or expressions, but it is intentional that this document specifies neither syntax for accessing that information in a query nor a standard textual representation of a MongoSQL type.

Examples

Since there is no new syntax associated with this section, most of the type-related testing will reside with the spec tests for various clauses and expressions. When it is relevant to do so, spec tests for clauses and expressions include:

Tests validating expression return types
Tests validating type constraints for arguments
Tests validating that a specific type inference can be made

Rejected Alternatives/Competitive Analysis

Schemas in PartiQL
- Though PartiQL is the SQL dialect most similar to MongoSQL, the section of its spec that discusses types is marked as "WIP", and has no useful information. For all intents and purposes, the behavior of types/schemas in PartiQL is unspecified.
Type metadata in relational databases
- Most SQL databases provide type information in a queryable format via the INFORMATION_SCHEMA database or DESCRIBE command.
- There are a few reasons why we choose not to provide this information via the MongoSQL query language:
  - We expect that MongoSQL will be "hosted" by a MongoDB-compatible database server. Unlike many other SQL databases, the SQL interface is not the only way of interacting with the database. Metadata can be made available via a separate command
  - The data model of MongoSQL is fundamentally different from that of standard SQL, and so we would not be able to expose MongoSQL's richer type information in a standards-compliant manner
  - Consumers of MongoSQL who need a standards-compatible way of accessing metadata (notably BI Tools) will be able to do so via ODBC or JDBC, as opposed to by writing DESCRIBE or INFORMATION_SCHEMA queries

Clauses

SELECT Clause

Behavioral Description

The main form of the <select clause> is SELECT VALUES. Other formats of <select clause> are syntactic sugar that can be syntactically rewritten to an equivalent SELECT VALUES query.

VALUE vs VALUES

MongoSQL allows SELECT VALUE and SELECT VALUES to be used interchangeably. Both are supported to allow for a more natural-reading query in the presence of single or multiple expressions. MongoSQL will rewrite the keyword to match the number of arguments, as shown in the following examples:

SELECT VALUE a.*, b.* FROM a JOIN b
will be rewritten to
SELECT VALUES a.*, b.* FROM a JOIN b

SELECT VALUES {'a': 1} FROM foo
will be rewritten to
SELECT VALUE {'a': 1} FROM foo

SELECT VALUE Semantics

SELECT VALUE(S) accepts a list of <select value expr>s, which can be one of two things: an ordinary expression that resolves to a document or a sub-star expression. If an expression provided to SELECT VALUE is not a sub-star expression and cannot be statically determined to be a document, then a static error will be returned.

SELECT VALUE constructs one output binding tuple for each tuple in the SELECT_in stream. Each output binding tuple has one binding per <select value expr>. If the <select value expr> is a document expression, then the binding's key is ⟘ and its value is the result of evaluating the document in the local values environment ρ. If the <select value expr> is a sub-star expression, then the binding's key is the identifier preceding the star, and the value is the root value from the datasource referenced by that identifier.

Consider the following query that demonstrates how document expressions are handled:

SELECT VALUE {a: y.a, b: y.c} FROM [{a: 1, c: 2}, {a: 3}] AS y

Since this SELECT query is a top-level query (not a correlated subquery), the initial values environment is empty (i.e. ρ₀ = ⟨⟩). For each input tuple, we create the local values environment by concatenating ρ₀ with the tuple. In the example above, the document expression will be evaluated twice; first in the environment:

ρ_row1 = tupleConcat(ρ₀, ⟨y: {a: 1, c: 2}⟩)
= tupleConcat(⟨⟩, ⟨y: {a: 1, c: 2}⟩)
= ⟨y: {a: 1, c: 2}⟩

And then (for the second input tuple) in the environment

ρ_row2 = tupleConcat(ρ₀, ⟨y: {a: 3}⟩)
= tupleConcat(⟨⟩, ⟨y: {a: 3}⟩)
= ⟨y: {a: 3}⟩

We create one binding tuple per input binding tuple, each mapping the ⟘ key to the value of the document expression. For the example query above, the output stream is:

SELECT_out = [⟨⟘ : ρ_row1 ⊢ <document expression>⟩,
⟨⟘ : ρ_row2 ⊢ <document expression>⟩]

= [⟨⟘ : ρ_row1 ⊢ {a: y.a, b: y.c}⟩,
⟨⟘ : ρ_row2 ⊢ {a: y.a, b: y.c}⟩]

= [⟨⟘ : ⟨y: {a: 1, c: 2}⟩ ⊢ {a: y.a, b: y.c}⟩,
⟨⟘ : ⟨y: {a: 3}⟩ ⊢ {a: y.a, b: y.c}⟩]

= [⟨⟘ : {a: 1, b: 2}⟩,
⟨⟘ : {a: 3}⟩]

Next, consider a query that demonstrates how sub-star expressions are handled:

SELECT VALUE y.* FROM [{a: 1, c: 2}, {a: 3}] AS y

As in the previous example, the sub-star expression will be evaluated twice; first in the environment:

ρ_row1 = ⟨y: {a: 1, c: 2}⟩

And then (for the second input tuple) in the environment

ρ_row2 = ⟨y: {a: 3}⟩

Which gives us the output stream

SELECT_out = [⟨y : getReference(ρ_row1, y)⟩,
⟨y : getReference(ρ_row2, y)⟩]

= [⟨y : getReference(⟨y: {a: 1, c: 2}⟩, y)⟩,
⟨y : getReference(⟨y: {a: 3}⟩, y)⟩]

= [⟨y : {a: 1, c: 2}⟩⟩,
⟨y : {a: 3}⟩]

Rewriting SELECT ... with no FROM clause

SELECT ...

will be rewritten as:

SELECT ... FROM [{}] AS _dual

Rewriting SELECT x.y.a, b, ... where x.y.a and b are field references

SELECT x.y.a, b, ...

will be rewritten as:

SELECT x.y.a AS a, b AS b, ...

Rewriting SELECT <expression1>, <expression2>, ...

SELECT <expression1>, <expression2>, ...

will be rewritten as:

SELECT <expression1> AS _1, <expression2> AS _2

Where _1 and _2 are names generated by the implementation corresponding to position in the resulting documents, starting at 1 because SQL traditionally numbers the first column as 1.

Rewriting SELECT <expression₁> AS x₁, <expression₂> as x₂, ..., t.*, ...

SELECT <expression₁> AS x₁, <expression2> as x₂, ..., t.*, ...

will be rewritten as:

SELECT VALUES {x₁: <expression₁>, x₂: <expression₂>}, ..., t.*, ...

Semantics of SELECT *

SELECT * will return all binding tuples from the input stream unmodified. Other expressions may not be present in the select list alongside *.

Semantics of SELECT DISTINCT

The DISTINCT keyword specifies that duplicate rows in the result set will be removed.

Document and array comparison for DISTINCT follows MongoDB's equality semantics, where:

Document comparison includes field order
Array comparison requires equal length and equal elements in the same order

Grammar

<select clause> ::= SELECT <set quantifier>? <value keyword>
<select value list>
| SELECT <set quantifier>? <select list>

<value keyword> ::= VALUE | VALUES

<set quantifier> ::= ALL

<select value list> ::= <select value expr> (, <select value expr>)*

<select list> ::= <select expr> (, <select expr>)*

<aliased expr> ::= <expression> <alias>?

<substar expr> ::= <identifier> "." "*"

Examples

Rewrite Examples

Query Examples

Rejected Alternatives/Competitive Analysis

We chose not to support a DUAL table, which deviates from existing BI Connector and SQL in ADL behavior.
- The DUAL table was originally added in Oracle, and several other database systems support it, e.g., MySQL.
- However, many database systems, e.g. SQLite and Postgres, do not have a DUAL table, see https://en.wikipedia.org/wiki/DUAL_table for more information on which database systems have DUAL tables and which just allow SELECT without a table to achieve the same effect.

FROM Clause

Behavioral Description

FROM is the first clause of every MongoSQL query to be evaluated. This makes it a special case, because it does not take a binding-tuple stream as its input. Instead, it generates its output tuple stream from various datasources.

MongoSQL has four categories of datasources:

Simple datasources
Compound (Join) datasources
Unwind datasources
Flatten datasources

Datasources provide various ways of creating streams of binding tuples. Simple datasources create streams of tuples with a single key, while compound datasources create streams of tuples with multiple keys.

The top-level datasource in a FROM clause forms the clause's output stream.

Datasources

Simple Datasources

Collection Datasource

A collection datasource is composed of a collection reference (qualified or unqualified) and an optional alias. Formally, the collection reference is resolved in the catalog environment using the getReference function defined in the Abstract Model section. Informally, qualified references are treated as <db>.<collection> pairs, while unqualified references are treated as collections in the current database.

Collection datasources without an explicit alias are syntactically rewritten to have an alias. For unqualified identifiers, the whole identifier is used as the alias. For qualified identifiers, the collection part is used as the alias.

A collection datasource creates a stream of binding tuples with a single key-value pair. One binding tuple is created per document in the referenced collection. The key of each tuple is the alias name, and the value is the root document.

For example, consider the output of the collection datasource in the following query:

SELECT * FROM collection AS alias

SELECT_out = FROM_out = [⟨alias: d⟩ for d ∊ collection]

Array Datasource

An array datasource is composed of an array literal and an alias. The array's elements must statically evaluate to document values; syntactically, expressions are permitted inside the array, as long as they can be evaluated at compile time.

An array datasource creates a stream of binding tuples with a single key-value pair. One binding tuple is created per value in the array. The key of each tuple is the alias name, and the value is the array element.

For example, consider the output of the array datasource in the following query:

SELECT * FROM [{'a': 1}, {'a': 2}] AS alias

SELECT_out = FROM_out = [⟨alias: {'a': 1}⟩, ⟨alias: {'a': 2}⟩]

Derived Table Datasource

A derived table datasource is made up of a parenthesized MongoSQL query and an alias. Note, unlike a subquery expression, a derived table datasource cannot have correlated fields from outer queries.

A derived table datasource creates a stream of binding tuples as defined by the semantics for the SELECT clause of the derived table datasource query. One new binding tuple is created per binding tuple returned by the SELECT clause of the derived table query. The key of each tuple is the alias name, and the value is the result of merging the values of all bindings in the corresponding input tuple. For example, consider the output of the derived table datasource in the following query:

SELECT * FROM (
                 SELECT * FROM [{'a': 1}] AS arr1
                                  CROSS JOIN [{'b': 2}, {'b': 3}] AS arr2
                 ) AS derived

SELECT_out = FROM_out = [
⟨derived: {'a': 1, 'b': 2}⟩,
⟨derived: {'a': 1, 'b': 3}⟩
]

The semantics for derived tables for FROM (q) AS x are thus:

FROM_out = [ ⟨x: $mergeObjects(v₀, ...,v_n)⟩
where d = ⟨y₀: v₀, ..., y_n : v_n⟩
for d ∊ q]

where $mergeObjects is the MQL function, which has semantics similar to tupleConcat, but applied to documents rather than binding tuples.

Compound (Join) Datasources

A join datasource is a compound datasource that combines two other datasources. The binding tuples created by the join contain the keys from the two combined datasources. This requires that the sets of datasource names created by each side of the join must be disjoint. If the same datasource name appears on both sides of a join, the query will fail to compile.

The number and contents of the tuples output by a join datasource depends on the type of join and the join criteria. Behavior for each join type is described below. MongoSQL supports INNER JOIN, (CROSS) JOIN, LEFT OUTER JOIN, and RIGHT OUTER JOIN.

Rewrites

There are two types of JOIN that may be rewritten syntactically. We also rewrite to have aliases as specified in the Collection Datasource section.

Comma (CROSS) JOIN Datasource

CROSS JOIN performs a mathematical cross product of two datasources. For example, consider the output of the join datasource in the following query:

SELECT * FROM A AS a1 CROSS JOIN B AS b1

SELECT_out = FROM_out = [⟨a1: a, b1: b⟩ for (a, b) ∊ A ⨯ B]

Comma join:

will be rewritten as:

<datasource> CROSS JOIN <datasource>

INNER JOIN Datasource

Semantically, an INNER JOIN is equivalent to a CROSS JOIN filtered by a WHERE clause. For example:

SELECT * FROM X INNER JOIN Y ON <condition>

is equivalent to

SELECT * FROM X CROSS JOIN Y WHERE <condition>

The same predicate typing restrictions apply when using ON as when using WHERE. The only difference between an inner join's ON predicate and a WHERE clause is that the only values in scope for the ON predicate are those in the two datasources being joined.

For example, consider the output of the join datasource in the following query. For the purpose of the formal definition, we consider the join criteria <condition> a function that takes a binding tuple and returns a boolean.

SELECT * FROM A as a1 INNER JOIN B as b1 ON <condition>

SELECT_out = FROM_out = [tup if <condition>(tup)
                                            where tup = ⟨a1: a, b1: b⟩
                                            for (a, b) ∊ A ⨯ B
                                           ]

MongoSQL does not support the USING Clause at the initial version(see Future Work).

LEFT OUTER JOIN Datasource

Like in standard SQL, left outer joins in MongoSQL guarantee that every tuple from the left side of the join appears at least once in the result. The main way in which MongoSQL differs from standard SQL is that we cannot necessarily enumerate all field names in a datasource. So, in the cases where SQL would return null values for all fields on the right side of a join, we instead set the value for all right-side datasource names to the empty document.

SELECT * FROM A AS a1 LEFT OUTER JOIN B AS b1 ON <condition>

SELECT_out = FROM_out = [..., ⟨a1: {...}, b1: {...}⟩, ⟨a1: {...}, b1: {}⟩, ...]

RIGHT OUTER JOIN Datasource

A right outer join is the inverse of a left outer join. Since MongoSQL does not provide any guarantees about field order, the following queries are semantically equivalent:

SELECT * FROM A AS a1 LEFT OUTER JOIN B AS b1 ON <condition>

SELECT * FROM B AS b1 RIGHT OUTER JOIN A AS a1 ON <condition>

Examples

Rewrite Tests
Query Tests

Unwind Datasource

The syntax for unwinding array fields is an UNWIND keyword that can be used in the FROM clause in conjunction with a datasource and options.

UNWIND(<datasource> WITH PATH => <path>,
INDEX => <name>,
OUTER => <bool>)

where

datasource is the source of the array field
PATH is a field path that references the field in the datasource to unwind
INDEX is the name to assign the index column
OUTER indicates whether documents with null, missing, or empty array values are preserved

Semantics

The UNWIND datasource "unwinds" an array field into multiple rows–each one corresponding to a value from the array field. It is a datasource that operates on any valid MongoSQL datasource (collection, array, join, another unwind, etc.). The UNWIND datasource specifies a datasource and a sequence of named arguments. The only required argument is a "PATH", which is a compound identifier that references an array field from the datasource. There are optional "INDEX" and "OUTER" arguments that behave similarly to their $unwind counterparts, "includeArrayIndex" and "preserveNullAndEmtpyArrays", respectively. By default, if those arguments are not included, the output does not contain an index field and does not preserve documents for which the unwind field is null, missing, or an empty array.

Note that UNWIND operates on arbitrary datasources, meaning it could operate on simple or compound datasources. For the latter case, the UNWIND datasource outputs multiple namespaces. Therefore, the UNWIND datasource cannot be aliased; instead, it uses the alias(es) of the provided datasource. The field that is unwound is nested along its field path under the alias of its datasource. See the Examples for a more concrete view of this.

Generally, an UNWIND datasource can be thought of as a two-step process. First, the inner datasource is evaluated to produce a stream of binding tuples. Then, each binding tuple is sent through the "unwinding" process to produce 0 or more binding tuples, depending on the value of the field specified in the UNWIND. In the abstract model, this is actually modeled in one step as follows:

SELECT * FROM UNWIND(ds WITH PATH => array_path,
INDEX => i)

SELECT_out = FROM_out = [⟨bk: {...d, array_path: val, i: idx}, ...bt⟩
                                                for (idx, val) ∊ enumerate(d.array_path)
                                                  where (bk: d) is the binding in bt that
                                                  contains array_path
                                                for bt ∊ ds
                                           ]

For each binding tuple bt in datasource ds, determine the binding (bk, d) that contains the unwind path according to the MongoSQL Scoping Rules and then "enumerate" d.array_path. Here, enumerate means to range over the values in the array along with the index for that value. For each index-value pair (idx, val) in the array, output a binding tuple that binds "bk" to the document containing everything in d (denoted as ...d) with "array_path" replaced by the unwind value val and an "i" field added with the index value idx, as well as all other bindings in bt (denoted as ...bt).

There are several important notes to point out here. The Examples highlight these points in finer detail.

If the value at the PATH for a certain document d is not an array and is not NULL or missing, the abstract model enumerate function simply outputs the value at the path and the index value NULL.
If the value at the PATH for a certain document d is an empty array, NULL, or missing, bt is omitted from the output by default.
- If the optional argument OUTER is provided and set to TRUE, then bt is included in the output similar to how MongoDB $unwind works.
- For brevity, the abstract model equations in this document omit the steps that exclude such values by default.
- Note that several competitors use JOIN UNNEST syntax to unwind array values from datasources.
  - A CROSS JOIN (the default) omits documents when the unwind path references a null, empty, or missing array.
  - A LEFT JOIN includes such documents.
If the datasource is a compound datasource that contains multiple datasources, UNWIND follows the MongoSQL Scoping Rules to determine which field from which datasource is being referenced. The unwound data is nested under the appropriate datasource in the output.
- To make this clear, here is how this looks according to the abstract model. For simplicity of notation in this example, we can assume a and b are both collection datasources:
       SELECT * FROM UNWIND(a JOIN b WITH PATH => a.array_path)
       SELECT_out = FROM_out= [
           ⟨a: {...d_a, array_path: val}, b: d_b⟩
               for (idx, val) in enumerate(d_a.array_path)
               for (d_a, d_b)∊a ⨯ b
       ]
- See the Examples for more details.
If the INDEX name MAY² or MUST conflict with a field name in the datasource, then the UNWIND source is semantically invalid and an error is produced.
MongoSQL uses 0-indexing.

Examples

Spec tests

Definitions

There are three schema satisfaction levels:

MUST contain f => we have a schema that proves a datasource is guaranteed to have the top-level field f
CANNOT contain f => we have a schema that proves a datasource cannot contain the top-level field f
MAY contain f => we cannot prove or disprove the existence of a top-level field f in a datasource

Flatten Datasource

FLATTEN(<datasource> WITH depth => n, separator => <some_string>), where

depth => n
- n is an nonnegative integer
- Optional and defaults to flattening all subdocuments
separator => <some_string>
- Use <some_string> as the delimiter when concatenating field names
- Optional and defaults to '_'

Semantics

A FLATTEN datasource operates on any valid MongoSQL datasource (collection, array, join, another flatten, etc.), some of which are simple datasources, while others are compound. If FLATTEN is operating on a compound datasource, it will output multiple namespaces. Therefore, the FLATTEN datasource cannot be aliased; instead, it uses the alias(es) of the provided datasource. The flattened fields are nested under the alias of their datasource.

Refer to the spec tests for a more concrete view of this.

FLATTEN creates a stream of binding tuples with the same keys as the input tuples; one binding tuple is created per document in the datasource. For example, consider the following query

SELECT * FROM FLATTEN(<input datasource> WITH DEPTH => n,
SEPARATOR => s)

its output is defined as:

SELECT_out = FROM_out = [flatten_tuple(t, n, s) for t ∊ datasource]

where flatten_tuple(t, n, s) is defined as follows:

for (datasource_name, root_value) in t:
emit_binding(datasource_name, flatten_doc(root_value, n, s))

and flatten_doc(d, n, s) is the following recursive algorithm that produces a document:

for (k, v) in d:
    if v is not a document or n <= 0:
        emit(k, v)
    else:
        for (sub_k, sub_v) in flatten_doc(v, n-1, s):
            emit(k + s + sub_k, sub_v)

Note: n defaults to ∞ and s defaults to '_'.

In other words, the flattening process will recursively flatten each top-level field until it reaches a non-document subfield or the user-specified document depth. Here, emit_binding(name, doc) means return a binding of name to doc. emit(k, v) means include the key-value pair (k, v) in the result document. If the user sets the value of depth to 0, flatten_doc will be a no-op.

Schema Information

Flattening will return an error if

the datasource contains polymorphic object fields
we cannot enumerate all of the keys for a field given its document schema
a naming collision MAY or MUST¹ occur.
- For example, suppose we have the following data:
```
{ 'foo': { 'a_b': 1, 'a': { 'b': 2 } } }
SELECT * FROM FLATTEN(foo)
```
  would produce 'a_b' the original field, and 'a_b' the flattened field. Those names conflict so MongoSQL will return an error.

Examples

Spec Tests

Grammar

<collection datasource> ::= <collection reference> <alias>?

<join type> ::= INNER
                        | CROSS?
                        | LEFT OUTER?
                        | RIGHT OUTER?

<cross join> ::= <datasource> CROSS? JOIN <datasource>

<qualified join> ::= <datasource> <join type> JOIN <datasource> <join spec>?

<join column list> ::= <compound identifier> ("," <compound identifier>)*

<unwind datasource> ::= UNWIND "(" <datasource>
(WITH <unwind options>)? ")"

<unwind options> ::= <unwind option> ("," <unwind option>)*

<path option> ::= PATH "=>" <compound identifier> ("," <compound identifier>)*
<index option> ::= INDEX "=>" <identifier>
<outer option> ::= OUTER "=>" <boolean literal>

<flatten datasource> ::= FLATTEN "(" <datasource> (WITH <flatten option> ("," <flatten option>)*)? ")"

<flatten option> ::= SEPARATOR "=>" <string literal>
| DEPTH "=>" <integer literal>

WHERE Clause

Behavioral Description

A WHERE clause is a filter on the incoming binding tuples. Its output binding tuples are those binding tuples for which the condition is met.

Formally, for WHERE e
error if e(x) IS NOT {BOOLEAN, NULL, MISSING} (statically)
otherwise
WHERE_out = [
                         x if e(x)
                         for x ∊ WHERE_IN
                       ]

Of note here is that we report an error if e(x) is not statically guaranteed to be a BOOLEAN. The expression, e, must statically return a boolean, either by using a CAST to BOOLEAN, or by using only expressions that are guaranteed to return BOOLEAN, such as comparison operators, AND, and OR, or a CASE statically determined to always return BOOLEAN. Note that NULL (and MISSING) being distinct from TRUE, causes the current document, x, to not be included in the result set. This is consistent with every major implementation of SQL.

Grammar

<where clause> ::= WHERE <expression>

Examples

Query Tests

Rejected Alternatives/Competitive Analysis

Many different implementations of SQL automatically coerce WHERE arguments to BOOLEANs. Some, like MySQL, do not even have a BOOLEAN type. The behavior of requiring the expression to be a BOOLEAN is consistent with Postgres and PartiQL, however, except that PartiQL makes this determination dynamically, while we prefer the route of statically ensuring the argument will be a BOOLEAN (or NULL/MISSING). Having queries fail during execution due to polymorphic data is something we wish to avoid because analytic queries can often take hours to run.

GROUP BY Clause

Behavioral Description

GROUP BY provides a means for grouping data into equivalence classes. Aggregations can then be done on these equivalence classes. The main form of GROUP BY is as follows:

GROUP BY e₁ AS x₁ , ..., e_m AS x_m AGGREGATE agg_function(e) AS y₁, ... agg_function'(e') AS y_n

The expressions e₁ ... e_m, when taken as an array, provide the equivalence class key. For each array value achieved by evaluating these expressions on a document, a separate group is formed. Unless it is a column reference, each element of the equivalence class key must have an alias for reuse purposes after the GROUP stage. Aliases are automatically generated where needed using the alias rules in Section Rewrites for Generation of GROUP Aliases.

The output of the clause contains one binding tuple per unique value of the equivalence class key with the values of the group equivalence class key and the results of the aggregation functions named as mentioned. Top-level MISSING values are converted to NULL in GROUP BY expressions, as in all arrays. GROUP keys use the semantics of MongoSQL =, except that all NULLs are considered equivalent. Specifically this means that the double 3.0 and the integer 3 are grouped into the same group. Because of this, GROUP keys must be statically proved to be comparable to each other via equality. When multiple numeric types are grouped into one group, the type in the output group chosen is undefined behavior and implementation specific². The arguments to agg_function can be any expression.

ALL and DISTINCT agg_functions

ALL does nothing in agg_functions as that is the default behavior. It is removed during syntactic rewrite. Thus:

agg_function(ALL e)

will be rewritten as:

agg_function(e)

DISTINCT agg_functions only consider distinct elements of the groups over which they are aggregating. Distinctness is defined using the MongoSQL equality operator, except that NULL is considered equivalent to NULL, and MISSING is converted to NULL. Thus, arguments to DISTINCT agg_functions must be statically proved to be comparable to each other via equality.

GROUP BY Clause Output

The GROUP BY clause, like all clauses, outputs a stream of binding tuples. As described above, the output stream contains one tuple for each group key equivalence class. The binding tuples contain only the group keys and aggregates defined in the GROUP BY clause.

Consider the following example query and the output tuples generated by its GROUP clause. In this example, all group keys are aliased, so every field is nested under the ⟘ datasource name in the output tuples:

SELECT * FROM foo GROUP BY foo.a AS key AGGREGATE sum(a) AS sum;
SELECT_in = GROUP_out = [⟨⟘: {key: <value>, sum: <value>}⟩, ...]

If there is one or more unaliased group keys, those keys are nested under their original namespaces in the binding tuple:

SELECT * FROM foo GROUP BY a AGGREGATE sum(a) AS sum;
SELECT_out = GROUP_out = [⟨⟘: {sum: <value>}, foo: {a: <value>}⟩, ...]

² mongod 4.2 chooses the first type seen during grouping.

Rewrite implicit GROUP BY to explicit GROUP BY NULL

Using an aggregate function in the select clause creates an implicit GROUP BY where only one group is created. The explicit GROUP BY we rewrite this to uses NULL as the group key, though grouping by any constant value is equivalent.

SELECT ..., ... agg_function(x) ... AS y, ... FROM <from item>

will be rewritten as:

SELECT ..., ... agg_function(x) ... AS y, ... FROM <from item> GROUP BY NULL

Rewrite SELECT clause aggregate functions into AGGREGATE clause

SELECT ..., ... agg_function(e) ..., ... FROM <from item> GROUP BY ... AGGREGATE ...

will be rewritten as:

SELECT ..., ... _aggN ..., ... FROM <from item> GROUP BY ... AGGREGATE agg_function(e) AS _aggN, ...

Duplicated aggregate function expressions, e.g., SUM(x + 1) and SUM(x + 1) should only have one computation in the AGGREGATE section for efficiency reasons. The same alias can be used in each position.

Rewrite HAVING clause aggregate functions into AGGREGATE clause

GROUP BY ... HAVING ... agg_function(e) ...

will be rewritten as:

GROUP BY ... AGGREGATE agg_function(e) AS _aggn HAVING ... _aggn ...

As above, duplicated aggregation function expressions will be rewritten to occur only once in the AGGREGATE phrase of the clause.

Rewrites for Generation of GROUP Aliases

When the expressions e₁ ... e_n+m are non-reference expressions

GROUP BY e₁, ..., e_n AGGREGATE e_n+1, ..., e_n+m

will be rewritten as:

GROUP BY e₁ AS _groupKey1, ..., e_n AS _groupKeyn

AGGREGATE e_n+1 AS _agg1, ..., e_n+m AS _aggm

When an expression in e₁ ... e_n is a reference to a top-level field, we do not generate an alias for the field, so that it can be referenced by its fully qualified name in subsequent clauses. Since we can't distinguish between top-level field references and subpaths during syntactic rewrites, we will skip generating an alias for any GROUP BY key that might resemble a top-level field.

When both SELECT and HAVING contain agg_function expressions, the SELECT agg_function expressions are added to AGGREGATE first, followed by HAVING. If there is duplication of agg_function expressions, the position is determined by the first instance of the duplicated agg_function expression. This necessarily has an impact on alias generation. For example:

SELECT SUM(x), COUNT(y)
FROM foo
GROUP BY z
HAVING COUNT(y) + SUM(x) + SUM(y) > 20

will be rewritten as:

SELECT VALUE {'_1': ', _agg1, '_2': _agg2}
FROM foo AS foo
GROUP BY z AS z
AGGREGATE SUM(x) AS _agg1, COUNT(y) AS _agg2, SUM(y) AS _agg3
HAVING _agg2 + _agg1 + _agg3 > 20

GROUP BY Alias

MongoSQL supports grouping by aliases. For example, MongoSQL can translate the query SELECT calculation AS calc FROM foo GROUP BY calc.

The GROUP BY clause, like all clauses, outputs a stream of binding tuples: one tuple for each group key equivalence class. The binding tuples contain only the group keys and aggregates defined in the GROUP BY clause.

Consider the following example query and the output tuples generated by its GROUP clause. Aliased group keys are nested under the ⟘ datasource name in the output tuples, while unaliased group keys are nested under their original namespaces:

SELECT * FROM foo GROUP BY a AGGREGATE sum(a) AS sum; SELECT_out = GROUP_out = [⟨⟘: {sum: }, foo: {a: }⟩, ...]

MongoSQL rewrites GROUP BY keys that are aliases in the SELECT clause to the corresponding aliased expression in the SELECT clause.

SELECT e1 AS x1, e2 AS x2, … GROUP BY x1, x2, …

will get rewritten to:

SELECT x1, x2 … GROUP BY e1 AS x1, e2 AS x2, …

Aggregation Functions

The following are the aggregation functions supported by MongoSQL. Each one is evaluated on all of the specified elements from each value in a group as determined by the group key value.

ADD_TO_ARRAY - Pushes the argument to the end of an array, the total output of this function will be an array.
- The type of the argument to ADD_TO_ARRAY does not matter.
ADD_TO_SET - Pushes the argument to the end of an array removing duplicates, the total output of this function will be an array with all duplicate items removed. Duplicates are determined using MongoSQL's = operator. Note that ADD_TO_SET(x) is equivalent to ADD_TO_ARRAY(DISTINCT x), and maintained for compatibility with MQL.
- The type of the argument to ADD_TO_SET does not matter.
AVG - Takes the average of all the arguments.
- The argument must be statically typed to a numeric type.
COUNT - Counts the number of elements.
- COUNT(*) counts all values unconditionally.
- COUNT(<expression>) counts all values for which the expression does not result in NULL or MISSING.
- COUNT(col1, col2, ...) counts combinations of the specified columns where at least one of the columns has a non-NULL, non-MISSING value.
- COUNT(DISTINCT *) unconditionally counts the number of distinct documents in the result set.
- COUNT(DISTINCT <expression>) conditionally counts distinct values of the expression.
- COUNT(DISTINCT col1, col2, ...) counts distinct combinations of the specified columns.
- The type of the argument to COUNT does not matter.
Note: COUNT(DISTINCT? col1, col2, ..., coln) is syntactically equivalent to COUNT(DISTINCT? {'_1': col1, '_2': col2, ..., '_n': coln}). Implementations may, but are not required to, implement this as a syntactic rewrite.
FIRST - Returns the first element in the group. Deterministic only when the input has deterministic order, otherwise undefined.
- The type of the argument to FIRST does not matter.
LAST - Returns the first element in the group. Deterministic only when the input has deterministic order, otherwise undefined.
- The type of the argument to LAST does not matter.
MAX - Returns the max element as ordered by the MongoSQL > operator.
- The argument must be statically typed to be comparable via the > operator.
MERGE_DOCUMENTS - Returns a document formed by successively merging documents, with the previous element used as the left hand side. In the case of duplicate keys, the value of the key in the new element is kept.
- The argument must be statically typed as DOCUMENT, and thus MERGE_DOCUMENTS(DISTINCT x) is not allowed at this time. As with FIRST and LAST, the output is only deterministic when the input has deterministic ordering.
MIN - Returns the min element as ordered by the MongoSQL < operator.
- The argument must be statically typed to be comparable via the < operator.
STDDEV_POP - Returns the standard deviation of all elements over the entire group population.
- The argument must be statically typed to a numeric type. See more Here.
STDDEV_SAMP - Returns the standard deviation of a sample of all elements in the group.
- The argument must be statically typed to a numeric type. See more Here.
SUM - Takes the sum of all the arguments.
- The argument must be statically typed to a numeric type

HAVING clause

The HAVING clause operates the same as a WHERE clause, but after the GROUP BY clause, meaning it can reference aliases defined in the GROUP BY and can contain expressions with agg_functions; only aliases defined in the GROUP BY are available to the HAVING clause. This is necessary for filters that need the values computed by the GROUP BY, as the WHERE clause is applied before GROUP BY. Just as a WHERE clause, the HAVING clause takes an expression that must statically have type BOOL or NULL and may evaluate to MISSING. HAVING agg_function applications will be rewritten as follows:

GROUP BY ... AGGREGATE ... HAVING ... agg_function(e) ...

will be rewritten as:

GROUP BY ... AGGREGATE ... agg_function(e) AS _n ... HAVING ... _n ...

The alias _n is derived numbering left to right as in Section SELECT Clause.

Grammar

<group by clause> ::= GROUP BY <group key list> <aggregations>?

<having clause> ::= HAVING <expression>

<group key list> ::= <group key> ("," <group key>)*

<group key> ::= <expression> (AS? <identifier>)?

<aggregations> ::= AGGREGATE <aggregation function application> AS? <identifier> ("," <aggregation function application> AS? <identifier>)*

<aggregation function application> ::= <aggregation function> "(" (DISTINCT | ALL)? <expression> ("," <expression>)* ")"

Examples

Query Test
Rewrite Test

ORDER BY Clause

Behavioral Description

SQL's ORDER BY clause provides a way to order a result set by one or more sort keys. Each sort key can be a column reference, or can be an integer literal referring to a SELECT expression by its position in the select expr list. Sort keys that are column references can be compound identifiers. These compound identifiers can be qualified with datasource names or refer to document subfields. Name resolution follows the Scoping Rules.

The semantics for MongoSQL's ordering are consistent with the behavior described by section 13.1 of the SQL-92 Specification, which we will quote here instead of attempting to rephrase:

If an <order by clause> is specified, then the ordering of rows of the result is effectively determined by the <order by clause> as follows:

a) Each <sort specification> specifies the sort direction for the corresponding sort key Ki. If DESC is not specified in the i-th <sort specification>, then the sort direction for Ki is ascending and the applicable <comp op> is the <less than operator>. Otherwise, the sort direction for Ki is descending and the applicable <comp op> is the <greater than operator>.

b) Let P be any row of the result table and let Q be any other row of that table, and let PVi and QVi be the values of Ki in these rows, respectively. The relative position of rows P and Q in the result is determined by comparing PVi and QVi according to the rules of Subclause 8.2, "<comparison predicate>", where the <comp op> is the applicable <comp op> for Ki, with the following special treatment of null values. Whether a sort key value that is null is considered greater or less than a non-null value is implementation-defined, but all sort key values that are null shall either be considered greater than all non-null values or be considered less than all non-null values. PVi is said to precede QVi if the value of the <comparison predicate> "PVi <comp op> QVi" is true for the applicable <comp op>.

c) In the result table, the relative position of row P is before row Q if and only if PVn precedes QVn for some n greater than 0 and less than the number of <sort specification>s and PVi = QVi for all i < n. The relative order of two rows that are not distinct is implementation-dependent.

The only addition we make to the ordering behavior described in the SQL 92 Specification is to clarify an implementation-defined behavior:

MongoSQL sorts NULL and MISSING (these are considered to be equivalent values when sorting) before all other values. (this is consistent with the behavior of MongoSQL's less-than operator)

Rewrite: Positional sort keys to references

All positional select-expression references in the ORDER BY clauses are rewritten to be expressions. We check each sort key to determine whether it needs to be rewritten; if a sort key is any expression other than an integer literal, no transformation is needed.

If a sort key is an integer literal, then it is treated as a one-indexed reference to the list of select expressions. We perform a syntactic rewrite, substituting the alias of the select expression from the indicated position as the new sort key expression.

For example,

SELECT e1 AS a, e2 AS b FROM foo ORDER BY 1, 2

is rewritten to

SELECT e1 AS a, e2 AS b FROM foo ORDER BY a, b

There are a few circumstances under which this rewrite will fail. Queries that lead to these circumstances are not allowed in MongoSQL:

A positional sort key is used with SELECT VALUE
A positional sort key is used with a select list containing a star expression

Rewrite: Implicit to Explicit ASC

The default sort key direction is ASC if not specified by the user. This is made explicit by a rewrite, where

... ORDER BY e, ...

is rewritten to:

... ORDER BY e ASC, ...

Type Constraints

The ORDER BY clause requires that all possible values in a sort key expression can be statically verified to be comparable via the > and < operators.

Grammar

<order by clause> ::= ORDER BY <sort specification> (, <sort specification> )*

<sort specification> ::= <sort key> <sort direction>?

<sort direction> ::= ASC | DESC

Open Questions

Is MongoDB's $sort behavior consistent with its $lt and $gt behavior?
- [Answer] Maybe. We'll have to discover during implementation

LIMIT, FETCH FIRST, and OFFSET Clauses

FETCH FIRST is an alias for LIMIT. MongoSQL supports LIMIT and FETCH FIRST interchangeably. Throughout this document, all specifications for LIMIT apply to FETCH FIRST.

Behavioral Description

LIMIT and OFFSET allow users to retrieve only part of the rows generated by a query. Both LIMIT and OFFSET numbers have to be positive integers. Using LIMIT/OFFSET without ORDER BY does not guarantee the same result.

If a LIMIT number is provided, no more than that number of rows will be returned. If an OFFSET number is provided, that number of rows is skipped before returning rows.

When LIMIT and OFFSET are both set, the first OFFSET rows will be skipped before returning the rest of the results which should contain no more than the LIMIT number rows. LIMIT i, j is a shorter form of LIMIT i OFFSET j.

The LIMIT and OFFSET can be used in subqueries, unlike in some SQL dialects.

Formally, for LIMIT i

LIMIT_out = [
                     x
                     for x ∊ [x₁, ..., x_i]
                     where [x₁, ..., x_i, ..., x_j] = LIMIT_IN^i

]

Formally, for OFFSET i

OFFSET_outⁱ = [
                           x
                           for x ∊ [x_i+1, ..., x_j]
                           where [x₁, ..., x_i, ..., x_j] = OFFSET_INⁱ
                         ]

Grammar

<limit clause> ::= LIMIT <integer literal> ("," <integer literal> )?
| FETCH (FIRST | NEXT) <integer literal> ROW(S?) ONLY`

<offset clause> ::= OFFSET <integer literal>

Rewrites

LIMIT i, j

will be rewritten as

LIMIT i OFFSET j

Examples

Query Test
Rewrite Test

Rejected Alternatives/Competitive Analysis

The LIMIT and OFFSET clauses are not part of SQL-92 standard, and different database providers have slightly different implementations. Since SQL-2008, the FETCH FIRST clause has become part of the SQL standard. However, it's not the most widely supported syntax so far.

From this table, you can see LIMIT/OFFSET is best supported among DB providers. In addition to the DB providers listed in that table, PartiQL and Presto also support LIMIT and OFFSET. Also because we are targeting SQL-92 compatibility, SQL 2008 is not a requirement, so we choose to support LIMIT/OFFSET syntax over FETCH FIRST.

Set Operations

Behavioral Description

MongoSQL provides a UNION and UNION ALL set operator for taking the union over the result sets of two select queries. The UNION operator removes duplicate rows from the result set, while the UNION ALL operator does not. The result set returned by these operators does not have a defined order.

MongoSQL does not support the INTERSECT or EXCEPT set operations.

UNION combines the results of two queries and returns all distinct documents from the combined result set

UNION ALL outputs all the documents from each side of the UNION ALL. For example, consider the output of the UNION ALL in the following query:

SELECT * FROM X UNION ALL SELECT * FROM Y

UNION_out = [x for x ∊ X, y for y ∊ Y]

Grammar

<set operator> ::= UNION ALL?

Open Questions

Should we find a way to unify the union set operator and union join in the abstract model? It's a little strange to have two different places in the semantics that we're implementing what is essentially the same operation
- Proposed Answer: Yes, that would be a good thing in the long run, but not a priority at the moment. The semantics currently outlined in this section would all be forward-compatible with such a unification.
- The work for this unification is tracked in SQL-118

Rejected Alternatives/Competitive Analysis

INTERSECT and EXCEPT
- Though these are part of the SQL-92 spec, they are not supported by MongoSQL
- Since not all SQL databases support them (notably MySQL), it seems unlikely that BI Tools will require them
- If needed, they can be added in the future without breaking changes
- MQL does not support any set operations other than UNION, so adding support for INTERSECT and EXCEPT would require non-trivial additions to the underlying database engine

Ambiguous Bindings

Unless we can prove statically that there are no duplicate top-level keys in the documents to be merged, we disallow the query with a static error. If the keys of documents in the subquery's output binding tuples cannot be statically enumerated at query planning time (in the case of a SELECT *, for example), we assume that conflicts exist and raise a static error:

SELECT * FROM (
                  SELECT * FROM foo AS foo
                  CROSS JOIN bar AS bar
                 ) AS derived

Will result in an error:

The keys of datasources 'foo', 'bar' are not enumerable, and may be ambiguous. Try replacing 'SELECT *' with direct references, or providing schemata for 'foo', 'bar'.

Competitive Analysis And Rejected Alternatives

Different implementations of SQL handle ambiguous references in different ways:

MySQL reports duplicate key errors in the same cases we will, in so far as all MySQL tables have schemata
Postgres allows duplicate keys to be displayed from a SELECT * at the top, that is SELECT * FROM (SELECT * FROM foo AS f, foo AS f2) AS derived is allowed. However, there is no way to reference a key from the derived table in the SELECT clause, only SELECT * works. Additionally, top level queries may reference a column multiple times.
SQLite also allows duplicate keys as Postgres, but the SELECT * results slightly mangle the names of duplicate columns such that no name is strictly duplicated; they are still not accessible.
PartiQL allows duplicate field names in results, so does nothing special to handle ambiguous fields in derived tables. If a field is selected by name, the first field with that name from left to right is chosen.
Presto operates similar to Postgres, allowing duplicate keys to be displayed from a SELECT * at the top, that is SELECT * FROM (SELECT * FROM foo AS f, foo AS f2) AS derived is allowed. However, there is no way to reference a key from the derived table in the SELECT clause, only SELECT * works.

While any of these approaches are feasible for MongoSQL, we choose to report ambiguity as an error as early as possible (at the point of the derived table query rather than at the uses of those derived values). This results in an easier implementation and better immediate feedback to users who may be writing queries that take hours or days. Additionally, duplicate keys in MongoDB are undefined behavior.

Expressions

Grammar

Identifiers

Behavioral Description

Identifiers in MongoSQL refer to databases, tables, and columns. The SQL-92 spec says that by default identifiers should only contain simple latin letters, digits, and underscores, but it also explains that in different contexts the character set could be expanded. MongoSQL identifiers support most utf-8 characters. The only exception to this is the null character, '\x00', which is disallowed in BSON keys according to the BSON spec.

There may be some semantic meaning associated with certain characters. For example, the grammar proposes using the "." character to create compound identifiers. The rest of this document includes other examples of such characters. SQL-92 defines delimited (quoted) and regular (unquoted) identifiers. In MongoSQL, regular identifiers are restricted to a limited subset of identifiers to avoid conflicts with characters that have other semantic meaning; for an identifier to include such a character, it must be delimited. For example, among other things, an identifier must be delimited if it begins with a digit or if it conflicts with a reserved keyword.

There should exist a bijection (a one-to-one mapping) from MongoSQL identifiers to valid BSON keys. Given the above restrictions on regular identifiers, this bijection exists specifically from delimited MongoSQL identifiers to BSON keys.

Since a valid BSON key could contain the identifier delimiter characters (" and `), MongoSQL identifiers must be able to contain these characters. Of course, to include only double quotes or only backticks, the other delimiter can be used (i.e. `"quoted"` or "`backticked`"). To include a delimiter character in an identifier delimited by that same character, double it³ (i.e. "contains_one""" or `contains_one```, which correspond to contains_one" and contains_one`, respectively).

Identifiers should always be case-sensitive, whether delimited or not. This is consistent with MongoDB identifiers for database, collection, and field names.

³ This "double it" practice is used by PostgreSQL and MySQL.

Aliases

Identifiers are used for all aliases in MongoSQL. In most cases, it is a static semantic error for aliases to be used more than once in the same MongoSQL clause. The exception to this is that aliases can be repeated on both sides of a UNION. This also applies to automatically generated aliases (see Sections SELECT Clause and FROM Clause) conflicting with user aliases or user field names.

Keywords

MongoSQL keywords (such as SELECT, FROM, JOIN, etc.) are generally not allowed to be used as undelimited identifiers, to simplify lexing/parsing.

We reserve the right to add new keywords to MongoSQL in the future, though we will never introduce a keyword that begins with an underscore. This means that a user who wants an ironclad guarantee that a query will not start failing with a newer version of MongoSQL because it uses an identifier that has become a keyword should delimit all identifiers that do not begin with an underscore.

In most cases, however, such a drastic approach is likely unnecessary. Some mitigating factors to consider:

If a new keyword is introduced, it is very likely that it will be a keyword in another existing SQL dialect, or a stage name in MQL
When possible, we will allow newly introduced keywords to continue to be used as identifiers

Grammar

<compound identifier> ::= <identifier> ("." <compound identifier>)?

<regular identifier> ::= ([A-Za-z] | "_")[A-Za-z0-9_]*

<delimited identifier> ::= " <identifier character>* " | ` <identifier character>* `

<identifier character> ::= [^\x00]

Examples

See Identifiers spec tests here and here.

See Aliases spec tests here.

Open Questions

Should we require that identifiers are valid UTF-8?
- Proposed Answer: Require UTF-8 identifiers to start, relax the requirement later on if it becomes an actual problem for customers.
- Though the BSON spec requires key names to be valid UTF-8, MongoDB does not actually validate this, and so may contain documents with non-UTF-8 keys.
- If we require our identifiers to be valid UTF-8, it would mean that we could not construct an identifier to reference non-UTF-8 fields that, despite being invalid BSON, may still exist in a MongoDB collection.

Rejected Alternatives/Competitive Analysis

Special characters in identifiers

The proposed solution is to use delimited identifiers when semantically significant characters, i.e. ".", are part of an identifier. An alternative to this would be to introduce a mechanism for escaping characters in identifiers, such as an escape character. This is not desirable since there is no precedent for it in other SQL dialects and is not required or even considered by the SQL-92 spec. Along with that, there is no equivalent mechanism in MongoDB, so introducing one in MongoSQL seems unnecessary and unreasonable.

Case sensitivity

SQL-92 specifies how comparisons between identifiers should behave. Comparisons between regular identifiers and regular identifiers are case-insensitive; comparisons between regular identifiers and delimited identifiers are case-insensitive; comparisons between delimited identifiers and delimited identifiers are case-sensitive (SQL-92 5.2.10-14). It seems some other dialects deviate from this behavior:

Partiql seems to be case-sensitive
N1QL is case-sensitive
SQL++ does not specify
Presto is case-insensitive but it looks like they built (or are building?) an API to configure case-sensitivity
Drill is case-insensitive; however, data sources may be case-sensitive.

Given these deviations, in addition to MongoDB's native case-sensitivity, it is reasonable for MongoSQL to always consider all identifiers case-sensitive.

Literals

Behavioral Description

MongoSQL supports literals for booleans, null, numbers, and strings. Booleans, null, and strings are represented as expected, as seen in the Grammar. Importantly, strings are enclosed in single quotes. To include a single quote character in a string literal, double it.⁴ Numbers are slightly more nuanced: literal integers are typed as INT when possible (i.e. values within the int32 range) and LONG otherwise, and literal floating point numbers or scientific notation numbers are always considered to have type DOUBLE. Note that the grammar does not specify signs for numeric literals. To write a literal negative number, users can use the unary minus operator before the numeric literal (this is effectively the same as supporting literal negative numbers).

MongoSQL does not support literals for every type, for example OBJECTID, BSON_DATE, and DECIMAL have no literal syntax. For such types, pseudo-literal values can be obtained by using the CAST operator to go from a string or numeric representation of those types to their respective type. Some types may also have "constructor" functions which alias the relevant CAST invocations. See the Type Conversions section for more details.

⁴ The "double it" practice is used by PostgreSQL and MySQL.

Grammar

<null literal> ::= NULL

<boolean literal> ::= TRUE | FALSE

<string literal> ::= ' <any utf-8 character>* '

<integer literal> ::= 0 | ([1-9] [0-9]*)

<double literal> ::= (<integer literal> "." <exp component>?)
| (<integer literal>? "." [0-9]+ <exp component>?)
| (<integer literal> <exp component>)⁵

<exp component> ::= (e | E) ("+" | "-")? [0-9]+^6(#6)

⁵ Note: MongoDB treats all numbers with exponent components as doubles, so MongoSQL does the same, even if the number is an integer (i.e. 1.2e2).
⁶ Intentionally allowing multiple 0s after the “e”, as do many SQL dialects.

Examples

See spec tests here.

Open Questions

Should we require that string literals are valid UTF-8?
- Proposed Answer: Require UTF-8 string literals to start, relax the requirement later on if it becomes an actual problem for customers.
- The BSON spec technically requires this, but MongoDB does not check it during validation.
- We don't have any data on how common it is for customers to have invalid UTF-8 strings in practice.
- If we do require string literals to be valid UTF-8, it could make it difficult for customers to write certain queries over data containing non-UTF-8 strings

Rejected Alternatives/Competitive Analysis

Extended JSON was considered as an option for representing all BSON/MongoSQL types as literals, however this was ultimately rejected. First, it is not clear that users would benefit from the ability to represent all BSON types as literals. Also, while extended JSON is technically human-readable and writable, it is not an ergonomic experience to type out literals in that form. Lastly, the object notation used in MongoSQL is JSON-like but not guaranteed to be completely JSON-compliant, so interspersing it with extended JSON values could prove to be challenging from a parsing perspective.

MongoSQL does not have a literal syntax for every data type. For example, regular expression literals of the form /pattern/options could be supported instead of or in addition to the REGEX() function.

Parenthesized Expressions

A parenthesized expression is an expression grouped by parentheses, similar to the majority of programming languages and other implementations of SQL92 compliant SQL dialects. Any time infix operators are present, the need for parentheses (or a similar mechanism) to distinguish order of operations may be necessary. MongoSQL has several infix operators, such as '+' and '::'. For example, the value of 1 + 2 * 3 is 7, while the value of (1 + 2) * 3 is 9.

Grammar

<parenthesized expression> ::= "(" <expression> ")"

Operators

Behavioral Description

MongoSQL provides several operators for the built-in data types. Specifically, it supports all operators defined in the SQL-92 spec with only one exception. There is no support for the INTERVAL data type in MongoSQL, so arithmetic operators involving BSON_DATE and INTERVAL values are unsupported. See the Data Types section for more details about supported types.

In addition to the SQL-92 operators, MongoSQL also has array and document operators. See the Document and Field-Access Expressions section for details about the document field access operators, and see the Array, Indexing, and Slicing Expressions section for details about the array index access operator.

Note, the SQL-92 spec describes operators as monadic, dyadic, or n-adic, corresponding to operators having one operand, two operands, or a variable number of operands, respectively (SQL-92 3.1). This document uses the adjectives unary, binary, and n-ary to describe operators.

Operators can be divided into several different groups. There are string operators, arithmetic operators, comparison operators, boolean operators, control-flow operators, and type operators. The following sections explain the behavior of each type of operator.

Semantics of String Operators

String operators are those which operate on strings. The operands of string operators must statically have type NULL or STRING, and may evaluate to MISSING. If an operand evaluates to NULL or MISSING, the result of the operation is NULL.

The binary string operator || specifies string concatenation. It returns the string made by joining its string operands in the order given. The result of string concatenation either has type NULL or STRING.

The 3-ary string operator LIKE determines whether a string matches a pattern. The third argument is optional. When provided, it specifies an escape character used in the pattern. If the third argument is not specified, the default escape character, '\', is used. In the pattern, an unescaped underscore character '_' represents any single character and an unescaped percent character '%' represents any number of characters, even zero characters.

To paraphrase SQL-92: "If there is not a partitioning of the pattern into substrings such that each substring has length 1 or 2, no substring of length 1 is the escape character, and each substring of length 2 is the escape character followed by either the escape character, an '_', or a '%'," then the result is NULL (SQL-92 8.5).

The first two operands do not need to be string literals, they can be any expressions that statically have type NULL or STRING. If provided, the optional third argument must be a literal string consisting of exactly one character, any other value produces a static error. The result of LIKE either has type NULL or BOOL.

The inverse, NOT LIKE, is syntactic sugar for the negation of the result of LIKE. As in,

e1 NOT LIKE e2

is rewritten as:

NOT (e1 LIKE e2)

Semantics of Arithmetic Operators

Arithmetic operators are those which operate on numeric data types. The operands of arithmetic operations must statically have type NULL or a numeric type --- INT, LONG, DOUBLE, or DECIMAL --- and may evaluate to MISSING. If an operand evaluates to NULL or MISSING, the result of the operation is NULL. There are unary and binary arithmetic operators, described below.

The unary arithmetic operators + and - specify unary addition and unary subtraction, respectively. These unary operators are used to specify the sign of their operands. Unary addition does not change its operand; unary subtraction reverses the sign of its operand. The result of a unary arithmetic operation has the same type as its operand.

The binary arithmetic operators +, -, *, and / specify addition, subtraction, multiplication, and division, respectively. If the value of a divisor is zero, then the result of division is NULL. The operands of binary arithmetic operations do not need to have the same type; any combination of valid operand types (as declared at the beginning of this section) is allowed. When both operand types are numeric, the result of a binary arithmetic operation has a type according to the following table:

Type of Operand 1	Type of Operand 2	Result
INT	INT	INT
INT or LONG	LONG	LONG
Any numeric non-DECIMAL	DOUBLE	DOUBLE
Any numeric	DECIMAL	DECIMAL

For division operations including only INTs and/or LONGs, the result may not be exactly an INT or LONG and will need to be truncated or rounded. For example, 5 / 2 is truly 2.5, however the result must be an INT since the operands are both INTs. The choice of whether to round or truncate is implementation-defined.

Note that arithmetic operations that result in overflow or underflow have undefined behavior. For example, if 1 (an INT) is added to the maximum INT value, the result exceeds the bounds of the INT type and therefore cannot be represented as an INT. MongoSQL does not specify behavior for such cases.

Semantics of Comparison Operators

Comparison operators are those which compare values. The operands of comparison operations must statically have comparable types. In most cases, that means the operands must have the same type (or NULL). The exceptions to this are numeric types. Any combination of numeric types can appear as operands for comparison operators. If an operand evaluates to NULL or MISSING, the result of the operation is NULL. Note, this deviates from MongoDB's comparison semantics.

The binary comparison operators <, <=, <>, =, >, and >= specify less than, less than or equals, not equals, equals, greater than, and greater than or equals, respectively. Document comparison follows MongoDB's equality semantics, requiring exact field name and value matches, with field order being significant. Array comparison requires equal length and equal elements in the same order. Booleans are compared such that FALSE is less than TRUE. Numbers are compared with respect to their algebraic values. Strings are compared lexicographically. Datetimes are compared as expected. The result of a binary comparison operation has either type NULL or BOOL.

The != operator is a non-standard operator that specifies not equals. Despite being non-standard, it is supported by many competitors and is likely expected by many users. MongoSQL supports it as well, though it is rewritten to the standard <> operator:
a != b
is rewritten as:
a <> b

The 3-ary comparison operator BETWEEN specifies a range comparison. Logically, the expression x BETWEEN y AND z is equivalent to x >= y AND x <= z, though this specification does not require that it be rewritten as such. The same type constraints and comparison behaviors described above for the binary comparison operators apply to BETWEEN. The result of a BETWEEN operation has either type NULL or BOOL.

The inverse, NOT BETWEEN, is syntactic sugar for the negation of the result of BETWEEN. As in,

e1 NOT BETWEEN e2 AND e3

is rewritten as:

NOT (e1 BETWEEN e2 AND e3)

Semantics of Boolean Operators

Boolean operators are those which operate on boolean data types. The operands of boolean operations must statically have type NULL or BOOLEAN, and may evaluate to MISSING. If an operand evaluates to NULL or MISSING, the result of the operation is NULL. There are unary and binary boolean operators, described below.

The semantics of the unary boolean operator NOT are described by the truth table below.

a	NOT a
TRUE	FALSE
FALSE	TRUE
NULL or MISSING	NULL

The semantics of the binary boolean operators OR and AND are described by the truth table below.

a	b	a AND b	a OR b
TRUE	TRUE	TRUE	TRUE
TRUE	FALSE	FALSE	TRUE
TRUE	NULL or MISSING	NULL	TRUE
FALSE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE
FALSE	NULL or MISSING	FALSE	NULL
NULL or MISSING	TRUE	NULL	TRUE
NULL or MISSING	FALSE	FALSE	NULL
NULL or MISSING	NULL or MISSING	NULL	NULL

Semantics of Control-Flow Operators

MongoSQL supports the CASE operator for control-flow. In this context, "control-flow" refers to conditionally producing values based on criteria argued to the operator. Concretely, a CASE expression consists of one or more "WHEN clauses," which each specify a boolean condition and a result, and an "ELSE clause," which specifies a default result if none of the conditions are TRUE. The result of a CASE expression is the result of the first (leftmost) WHEN clause whose condition is TRUE, or the default result specified by the ELSE clause if none of the conditions are TRUE. The type of the result of a CASE expression is the union of types from the WHEN and ELSE clauses.

SQL-92 specifies two forms of the CASE expression, simple and searched.

A simple CASE expression has the following form:

CASE co WHEN wo₁ THEN r₁ WHEN wo₂ THEN r₂ ... ELSE r_d END

In this form, the first expression, co, is the "case operand," the WHEN expressions, wo_i, are the "when operands," and the THEN and ELSE expressions, r_i, are the potential result values ("d" stands for "default"). The result of a simple CASE expression is the r_i corresponding to the first (leftmost) wo_i for which co = wo_i evaluates to TRUE, or r_d if none of the comparisons evaluate to TRUE. This is equivalent to a searched CASE expression where each condition is co = wo_i, though this specification does not require that simple CASE expressions be rewritten as searched CASE expressions. Since the case operand and when operands are compared using the equals operator, they must follow the type constraint rules described in the Comparison Operators section.

A searched CASE expression has the following form:
CASE WHEN c₁ THEN r₁ WHEN c₂ THEN r₂ ... ELSE r_d END

In this form, the WHEN expressions, c_i, are the "search conditions," and the THEN and ELSE expressions, r_i, are the potential result values ("d" stands for "default"). The result of a searched CASE expression is the r_i corresponding to the first (leftmost) c_i that evaluates to TRUE, or r_d if none of the conditions evaluate to TRUE. The search conditions must statically have type NULL or BOOLEAN, and may evaluate to MISSING. Again, note that if a condition evaluates to TRUE, its corresponding result is the result of the expression. Therefore, if a condition evaluates to NULL or MISSING, nothing special happens; the result of the expression is either the next result corresponding to a TRUE condition or the default result if none of the following conditions evaluate to TRUE.

For either form of the CASE expression, if an ELSE clause is not provided, then ELSE NULL is implicit. As in,

CASE o WHEN e THEN r END
is rewritten as:
CASE o WHEN e THEN r ELSE NULL END

and

CASE WHEN e THEN r END
is rewritten as:
CASE WHEN e THEN r ELSE NULL END

Semantics of Type Operators

MongoSQL provides the binary IS operator to check the type of an expression. The left operand can be any expression and the right operand can be any MongoSQL type name or the keyword MISSING. The operator returns TRUE if the argued expression evaluates to a value of the argued type, and FALSE if it evaluates to a value of a different type. If the left operand evaluates to MISSING, the operator returns TRUE if the right operand is the keyword MISSING or NULL, and FALSE otherwise. The result of the IS operator therefore always has type BOOL.

The SQL-92 "null predicate", <expr> IS NOT? NULL, returns TRUE if the expression is a NULL value and FALSE otherwise. In MongoSQL, NULL is a type with a single value (also called NULL), and the value MISSING is more analogous to SQL's NULL. To support the SQL-92 "null predicate", MongoSQL has the following semantics for IS NULL and IS MISSING.

A	A IS NULL	A IS MISSING
NULL	TRUE	FALSE
MISSING	TRUE	TRUE

Therefore, in MongoSQL, the operation <expr> IS NOT? NULL is not strictly the type-checking operation described at the beginning of this section; it is more like the value-checking operation defined by SQL-92.

The inverse, IS NOT, is syntactic sugar for the negation of the result of IS. As in,

e IS NOT t

is rewritten as:

NOT (e IS t)

Grammar

<unary operator> ::= + | - | NOT

<comparison operator> ::= <
                                             | <=
                                             | <>
                                             | !=
                                             | =
                                             | >
                                             | >=

<is operator expression> ::= <expression> IS NOT? (<type> | MISSING)

<like operator expression> ::= <expression> NOT? LIKE <expression> (ESCAPE <string literal>)?

<between operator expression> ::= <expression> NOT? BETWEEN <expression> AND <expression>

<case operator expression> ::= CASE <when clause>+ <else clause>? END
| CASE <expression> <when clause>+ <else clause>? END

Examples

See query tests, rewrite tests, and type constraint tests here.

Open Questions

What are the semantics for comparing different numeric types? Is NumberDouble(20) the same as NumberDecimal(20)?
- Proposed Answer: Numeric values will be compared according to their true mathematical value.
Should we support the SQL-92 <null predicate> operator "<expr> IS [NOT] NULL"?
- Proposed Answer: we can simply support IS NULL as a type-checking operator, which would work since NULL is a type with a single value (also called NULL).

Rejected Alternatives/Competitive Analysis

MongoSQL does not support the SQL-92 <boolean test> operator IS. Specifically, MongoSQL does not support <expr> IS NOT? <bool value>. This operator is not strictly required for SQL compatibility. One benefit of not supporting it is that IS can unambiguously be used as the type-check operator.

Initially, we rejected comparisons for structured data (documents and arrays). There were several questions to work through for structured data comparisons (such as how to handle NULL elements). Since structured data types are not part of SQL-92, we didn't need to address this immediately. However, we have now implemented support for document and array comparisons.

We also rejected polymorphic comparisons. Comparisons are not required to be polymorphic for SQL compatibility and being restrictive now does not preclude us from allowing this in the future. In addition to those reasons, there are other considerations that may impact this decision. For example, if MongoSQL needs to support TIME and DATE types, then cross-type comparisons will not easily be able to follow BSON type ordering. Also, if MongoDB ever introduces type collations then prescribing a type ordering in the spec may prevent us from supporting such a feature in the future. Most competitor documentation does not elaborate on type constraints on comparison operators:

PrestoDB, N1QL, SQL++, and PartiQL do not specify type constraints
MySQL allows polymorphic comparisons via implicit conversions
PostgreSQL does not allow polymorphic comparisons
Rockset allows polymorphic comparisons but returns null if the two operands have different types.

This specification does not require that implementations rewrite BETWEEN operations as ANDs of less than and greater than operations. This is left up to implementations since prescribing the rewrite would necessitate that the first operand be copied multiple times (one for each comparison) which could have performance implications which this spec aims to avoid.

Similarly, it does not require that implementations rewrite simple CASE expressions to searched CASE expressions. This is also left up to implementations since prescribing the rewrite would necessitate that the first operand be copied multiple times (one for each when clause).

Scalar Functions

Behavioral Description

MongoSQL requires most scalar functions defined in the SQL-92 spec, with a few exceptions highlighted below. In addition to the SQL-92 functions, MongoSQL also specifies a general syntactic construct that implementations can use to introduce additional scalar functions. For example, not all scalar functions from MongoDB aggregation are required by this spec, but any could be included by an implementation.

The required scalar functions are described below. They are separated into groups based on their behavior, input types, or output types. The descriptions cover the basics of what the functions do as well as any static type constraints on their arguments (where relevant); see the spec tests for full behavior details.

Conditional Scalar Functions

Conditional scalar functions are those which specify a conditional value. SQL-92 requires the NULLIF and COALESCE conditional functions. These functions are equivalent to CASE operations, though this specification does not require that they be rewritten as such.

The NULLIF(v1, v2) scalar function returns NULL if v1 = v2 is TRUE, and otherwise returns v1. Since the arguments are compared for equality, they must be statically comparable according to the rules described in the Comparison Operators section. NULLIF(v1, v2) is equivalent to the following searched CASE operation:

CASE WHEN v1 = v2 THEN NULL ELSE v1 END

The COALESCE(v1, v2, ..., vn) scalar function returns the first non-NULL argument, or NULL if there are no non-NULL arguments.

COALESCE(v1, v2) is equivalent to the following searched CASE operation:

CASE WHEN v1 IS NOT NULL THEN v1 ELSE v2 END

and COALESCE(v1, v2, ..., vn) is equivalent to:

CASE WHEN v1 IS NOT NULL THEN v1 ELSE COALESCE(v2, ..., vn) END

Type Conversion Scalar Function

The type conversion scalar function CAST converts an expression to a specified type. See the Type Conversions section for details about the CAST scalar function.

Array Scalar Functions

Array scalar functions are those which operate on arrays. MongoSQL specifies two such scalar functions, SLICE and SIZE. See the Array, Indexing, and Slicing Expressions section for details about the SLICE scalar function.

The SIZE(array) scalar function counts and returns the total number of items in an array. The semantics of the SIZE function mostly match those of the aggregation operator $size. Notably, the type constraints are more relaxed than $size. The argument must statically have type ARRAY or NULL, and may be missing. If the argument is NULL or MISSING, the result is NULL.

Numeric Value Scalar Functions

Numeric value scalar functions are those which return numeric values. SQL-92 requires the POSITION, CHAR_LENGTH, OCTET_LENGTH, BIT_LENGTH, and EXTRACT numeric value functions.

The POSITION(substring IN string) scalar function returns the position of the first occurrence of substring in the string, or -1 if it does not occur. MongoSQL uses 0-indexing, so the first character is at position 0, the second at position 1, and so on. The arguments must statically have type STRING or NULL, and may be missing. If either argument is NULL or MISSING, the result is NULL.

The function CHAR_LENGTH may also be written as CHARACTER_LENGTH. MongoSQL will rewrite the latter into the former, so that CHAR_LENGTH is the canonical form of the function name.

The length functions CHAR_LENGTH(string), OCTET_LENGTH(string), and BIT_LENGTH(string) return the length of string arguments in terms of characters, bytes, and bits, respectively. The semantics of CHARACTER_LENGTH match those of the aggregation operator $strLenCP; the semantics of OCTET_LENGTH match those of the aggregation operator $strLenBytes; the semantics of BIT_LENGTH match those of $strLenBytes, as well, except the result is multiplied by 8. For each, the argument must statically have type STRING or NULL, and may be missing. If the argument is NULL or MISSING, the result is NULL.

The EXTRACT(field FROM source) scalar function extracts a component of a datetime value. The source argument must statically have type TIMESTAMP or NULL, and may be missing. If it is NULL or MISSING, the result is NULL.

Addititional Numeric Scalar Functions

The following list of functions operate on numeric values and return numeric values. All arguments to these functions must statically have type NULL or a numeric type — INT, LONG, DOUBLE, or DECIMAL — and may evaluate to MISSING. If an argument evaluates to NULL or MISSING, the result of the operation is NULL. If an argument resolves to NaN, the result is NaN. Some functions have additional special behaviors detailed below.

The ABS ( number ) scalar function returns the absolute value of the given number. The result has the same type as its operand. If the argument is Infinity or -Infinity, the result is Infinity.

The CEIL ( number ) scalar function returns a number to the nearest whole number of equal or greater value. The result has the same type as its operand. If the argument evaluates to negative or positive Infinity, the result is negative or positive Infinity respectively.

The COS ( number ) scalar function returns the cosine of an angle. The argument is specified in radians. If the argument is of type DECIMAL the return type is DECIMAL. All other numeric argument types return DOUBLE. If the argument evaluates to negative or positive Infinity, the result of the operation is NULL.

The DEGREES ( number ) scalar function converts a given number in radians to degrees. If the argument is of type DECIMAL the return type is DECIMAL. All other numeric argument types return DOUBLE. If the argument evaluates to negative or positive Infinity, the result is negative or positive Infinity respectively.

The FLOOR ( number ) scalar function returns a number to the nearest whole number of equal or lesser value. The result has the same type as its operand. If the argument evaluates to negative or positive Infinity, the result is negative or positive Infinity respectively.

The LOG ( number, base ) scalar function returns the logarithm of a number for the given base. Number must be a non-negative number, base must be a positive number greater than 1. Arguments outside of the supported ranges will return NULL. If the number or base is of type DECIMAL the return type is DECIMAL. All other numeric types return DOUBLE. If number evaluates to Infinity, the result is Infinity. If base evaluates to Infinity, the result is 0.

The MOD ( number, divisor ) scalar function divides number by divisor and returns the remainder. The result has the same type as its operands except when it cannot be represented accurately in that type. In these cases:

INT will be converted to a LONG if the result is representable as a LONG.
INT will be converted to a DOUBLE if the result is not representable as a LONG.
LONG will be converted to DOUBLE if the result is not representable as a LONG.
A DECIMAL divisor will have a result of type DECIMAL

If number evaluates to negative or positive Infinity, the result is NaN. If divisor evaluates to negative or positive Infinity, the result is number. A divisor of 0 will result in NULL.

The POW ( number, exponent ) scalar function returns the number raised to the specified exponent. The result has the same type as its operands except when it cannot be represented accurately in that type. In these cases:

INT will be converted to a LONG if the result is representable as a LONG.
INT will be converted to a DOUBLE if the result is not representable as a LONG.
LONG will be converted to DOUBLE if the result is not representable as a LONG.

If either number or power arguments evaluate to Infinity, the result is Infinity. If power evaluates to -Infinity, the result is 0. If number is -Infinity and power is even, the result is Infinity otherwise if power is odd the result is -Infinity.

The RADIANS ( number ) scalar function returns the given number converted from degrees to radians. If the argument is of type DECIMAL the return type is DECIMAL. All other numeric argument types return DOUBLE. If the argument evaluates to negative or positive Infinity, the result is negative or positive Infinity respectively.

The ROUND ( number, decimals ) scalar function rounds number to a specified number of digits. The decimals argument specifies how many decimal points of precision to include in the final result. The decimals argument must be an integer between -20 and 100. Arguments outside of the supported ranges will return NULL. The result has the same type as the first operand. If the first argument resolves to negative or positive infinity, the result is negative or positive infinity respectively.

The SIN( expression ) scalar function returns the sine of “expression”, where “expression” is an angle expressed in radians. The argument must statically have type DOUBLE, INT, LONG, DECIMAL or NULL, and may evaluate to MISSING. If the argument is NULL or MISSING, the result is NULL. If the argument evaluates to negative or positive Infinity, the result of the operation is NULL.

The SQRT( number ) scalar function returns the square root of a positive number. If the argument is negative, the operation will return NULL. If the argument evaluates to Infinity, the result is Infinity.

The TAN( number ) scalar function returns the tangent of an angle, specified in radians. If the argument is of type DECIMAL the return type is DECIMAL. All other numeric argument types return DOUBLE. If the argument evaluates to negative or positive Infinity, the result of the operation is NULL.

String Value Scalar Functions

String value scalar functions are those which return string values. SQL-92 requires the SUBSTRING, UPPER, LOWER, CONVERT, TRANSLATE, and TRIM string value functions; however, MongoSQL does not require CONVERT or TRANSLATE.

The SUBSTRING(string FROM start FOR length) scalar function returns a substring of the first argument starting from start and extending either the specified length, if provided, or to the end of the string, if length is not provided or if it is negative. SQL-92 specifies a runtime error when the length is negative; MongoSQL deviates from this to be consistent with MongoDB aggregation by returning the substring extending to the end of the string. Also, note that MongoSQL uses 0-indexing, so start should be 0 to start at the first character, 1 to start at the second, and so on. The unit of the start and length arguments is codepoint, not byte. This allows SUBSTRING to work in a reasonable way for strings containing multibyte characters. The string argument must statically have type STRING or NULL, and may be missing. The start and length arguments must statically have type INT or NULL, and may be missing. If any argument is NULL or MISSING, the result is NULL. Note that the characters are 0-indexed, for consistency with MongoDB.

The UPPER(string) and LOWER(string) scalar functions, collectively known as "fold" functions, change the casing of their arguments to be all uppercase or all lowercase, respectively. Similar to the aggregation operators $toLower and $toUpper, these functions only have well-defined behavior for strings of ASCII characters. For each, the argument must statically have type STRING or NULL, and may be MISSING. If the argument is NULL or MISSING, the result is NULL.

The TRIM(spec charset FROM string) scalar function removes leading and/or trailing spaces, or an optionally specified charset, from the argued string. The first two arguments are optional; by default, they are BOTH and the space character, respectively. The latter two arguments must statically have type STRING or NULL, and may be missing. If either is NULL or MISSING, the result is NULL.

A charset is a string of unordered characters that will all be trimmed from the beginning and/or end of string as defined by spec.

As noted, the first two arguments for TRIM are optional and have the default values BOTH and the space character, respectively. Invocations with one or both arguments omitted will be rewritten to include the defaults. Specifically:

TRIM(str)
will be rewritten as:
TRIM(BOTH ' ' FROM str)

and

TRIM(charset FROM str)
will be rewritten as:
TRIM(BOTH charset FROM str)

and

TRIM(spec FROM str)
will be rewritten as:
TRIM(spec ' ' FROM str)

Additional String Functions

SPLIT(string, delimiter, token number) => string
- Returns a substring from a string, using a delimiter character to divide the string into a sequence of tokens.
The string is interpreted as an alternating sequence of delimiters and tokens. So for the string "abc-defgh-i-jkl", where the delimiter character is ‘-‘, the tokens are abc, defgh, i, and jlk. Think of these as tokens 1 through 4. SPLIT returns the token corresponding to the token number. When the token number is positive, tokens are counted starting from the left end of the string; when the token number is negative, tokens are counted starting from the right.

If token number is greater than the number of tokens, the result is an empty string.

The string argument must statically have type STRING or NULL, and may evaluate to MISSING. token number must statically have type INT or NULL, and may evaluate to MISSING. If any argument is NULL or MISSING, or a delimiter evaluates to an empty string, the result is NULL.
REPLACE(string, redex, replacement) => string
- Searches string for occurrences of redex, and replaces those occurrences with replacement. This is part of the ODBC standard
As with all String functions, the redex and replacement are case sensitive.

Datetime Value Scalar Functions

Datetime value scalar functions are those which return datetime values. SQL-92 requires the CURRENT_DATE, CURRENT_TIME, and CURRENT_TIMESTAMP datetime value functions; however, MongoSQL does not require CURRENT_DATE or CURRENT_TIME.

The CURRENT_TIMESTAMP scalar function returns the current timestamp, "with the time zone displacement equal to the current time zone displacement of the SQL-session" (SQL-92 6.8).

The remaining semantics for MongoSQL's CURRENT_TIMESTAMP are consistent with the behavior described by section 6.8 of the SQL-92 Specification, which we will quote here instead of attempting to rephrase:

2) If specified, <timestamp precision> ... determines the precision of the ... timestamp value returned. (Default value is 6 if none is provided). 3) If an SQL-statement generally contains more than one reference to one or more <datetime value function>s, then all such references are effectively evaluated simultaneously. The time of evaluation of the <datetime value function> during the execution of the SQL-statement is implementation-dependent.

Additional Date Functions

DATEADD(date_part, interval, date) => datetime
- Returns the specified date with the specified number of interval added to the specified date_part of that date.
DATEDIFF(date_part, date1, date2, [start_of_week]) => int
- Returns the difference between date1 and date2 expressed in units of date_part.
DATETRUNC(date_part, date, [start_of_week]) => datetime
- Truncates the specified date to the accuracy specified by date_part, and returns a new date.

The date argument must be a MongoSQL datetime value and date_part must be a valid token. The default start_of_week is "sunday" (case insensitive), for DATEDIFF and DATETRUNC. If an argument evaluates to NULL or MISSING, the result of the operation is NULL.

Examples

See query tests, rewrite tests, and type constraint tests here.

Grammar

<nullif function> ::= NULLIF "(" <expression> "," <expression> ")"

<coalesce function> ::= COALESCE "(" <expression> ("," <expression>)* ")"

<size function> ::= SIZE "(" <expression> ")"

<position function> ::= POSITION "(" <expression> IN <expression> ")"

<character length function> ::= (CHAR_LENGTH | CHARACTER_LENGTH) "(" <expression> ")"

<octet length function> ::= OCTET_LENGTH "(" <expression> ")"

<bit length function> ::= BIT_LENGTH "(" <expression> ")"

<extract function> ::= EXTRACT "(" <extract field> FROM <expression> ")"

<substring function> ::= SUBSTRING "(" <expression> FROM <expression> (FOR <expression>)? ")"
| SUBSTRING "(" <expression> "," <expression> ("," <expression>)? ")"

<fold function> ::= (UPPER | LOWER) "(" <expression> ")"

<trim function> ::= TRIM "(" <trim options>? <expression> ")"

<trim options> ::= <trim specification>? <expression> FROM⁷

<trim specification> ::= LEADING | TRAILING | BOTH

<current timestamp function> ::= CURRENT_TIMESTAMP ("(" <expression> ")")?

<split function> ::= SPLIT "(" <expression> "," <expression> "," <expression> ")"

< abs function > ::= ABS "(" <expression> ")"

< ceiling function > ::= CEIL "(" <expression> ")"

< cos function > ::= COS "(" <expression> ")"

< degrees function > ::= DEGREES "(" <expression> ")"

< floor function > ::= FLOOR "(" <expression> ")"

< log function > ::= LOG "(" <expression>, <expression> ")"

< mod function > ::= MOD "(" <expression>, <expression> ")"

< power function > ::= POW "(" <expression>, <expression> ")"

< radians function > ::= RADIANS "(" <expression> ")"

< round function > ::= ROUND "(" <expression>, <expression> ")"

< sin function > ::= SIN "(" <expression> ")"

< sqrt function > ::= SQRT "(" <expression> ")"

< tan function > ::= TAN "(" <expression> ")"

<date_part> ::= <standard date part> | QUARTER

<dateadd> ::= DATEADD "(" <date_part>"," <expression> "," <expression> ")"

<datediff> ::= DATEDIFF "(" <date_part> "," <expression> "," <expression> ("," <expression>)? ")"

<datetrunc> ::= DATETRUNC "(" <date_part> "," <expression> ("," <expression>)? ")"

⁷ Note, the SQL-92 grammar actually specifies that TRIM(FROM <string>) is a syntactically valid invocation of the TRIM function, so we preserve that in MongoSQL grammar (SQL-92 6.7).

Open Questions

Do we want to support the aggregation functions (AVG, MIN, MAX, COUNT, SUM) as scalar functions that operate on arrays?
- Proposed Answer: This could be a nice feature to support eventually, but does not seem necessary at this time since it is outside the scope of SQL-92.

Rejected Alternatives/Competitive Analysis

This specification does not require that implementations rewrite NULLIF or COALESCE functions as CASE operations. This is left up to implementations since prescribing the rewrite would necessitate that one or more operands be copied multiple times which could have performance implications which this spec aims to avoid.

We rejected the SQL-92 specified string value functions CONVERT and TRANSLATE for several reasons, including their conformance-level restrictions in SQL-92 and competitors' lack of support for them.

We also rejected the SQL-92 specified datetime value functions CURRENT_DATE and CURRENT_TIME because MongoSQL does not support the DATE or TIME types at this time.

Subquery Expressions

Behavioral Description

A subquery is a SQL query within a query.

A subquery may occur anywhere an expression can be used.
When executing a subquery, we set ρ₀ for the subquery to the current ρ for the outer scope in which the subquery is executed. This means values from the current executing row are available to the subquery expression.

There are three types of subqueries defined in SQL-92:

Scalar subquery, which returns a result set with zero or one row and one column. This is the simplest form of a subquery, and can be used in most places a literal or single column value is valid
Row subquery, which returns a result set with zero or one row and more than 1 column
Table subquery, which returns 0 or more rows and one or more columns

MongoSQL only supports scalar subquery and table subquery. The reason for not supporting the row subquery is described in the below section. Besides, only traditional SQL style SELECT is allowed in subquery. Users cannot write (SELECT VALUE ...) in a subquery.

A table subquery will only be used at the right hand side(rhs) of a [multiple-row operator such as IN, ANY, SOME, ALL, EXISTS. Their behavior is described in the section below.

The subquery is required to statically prove its degree is equal to the expression on the other side of the operator. The degree of the subquery should either be derived from the SELECT clause expression list or in the subquery (SELECT * FROM foo), such information needs to be derived from foo's schema. Otherwise, the compilation will fail.

[All other subqueries where table subquery is not applicable are required to be scalar subquery. For scalar subquery, static check will be done to make sure the degree of the subquery is 1 and the cardinality is no larger than 1. Otherwise, the compilation will fail.

A scalar subquery expression evaluates to the single value returned by the contained query. If the result of the query expression is empty, the subquery will return MISSING.

The current environment from the outer query is inherited by the subquery. This is described in the Scoping section. Thus we will be able to support correlated queries (a subquery using values from an outer query) such as:

SELECT employee_number, department
FROM employees AS emp
WHERE salary > (
                              SELECT AVG(salary)
                              FROM employees
                              WHERE department = emp.department
                              );

SQL-92 clearly defines the resolution strategy for unqualified identifiers in subquery. In MongoSQL, we need to handle queries without a schema. We decided not to allow subquery reference an unqualified identifier in the schema-less setting with the following two exceptions. The reason is elaborated in the below section.

If an unqualified name is referring to an identifier in subquery's local scope, it has to be statically proven the identifier exists in subquery's data source.

For example:

SELECT * FROM foo WHERE EXISTS(SELECT * FROM [{a: 1, b: 1}] AS bar WHERE bar.b = a)

is allowed because we can statically prove field "a" exists in datasource bar and according to SQL standard, the unqualified identifier should be resolved to the most local scope which contains it.

SELECT * FROM foo where EXISTS(SELECT * FROM bar WHERE bar.b = a)

will get a compiling error because we cannot statically prove field "a" exists in bar.

If an unqualified name is referring to an identifier in subquery's outer scope, it has to be statically proven the identifier doesn't exist in subquery's local scope and it exists in subquery's outer scope.

For example:

SELECT * FROM [{a: 1}] AS foo WHERE EXISTS(SELECT * FROM [{b: 1}] AS bar WHERE bar.b = a)

is allowed because we can statically prove field "a" only exists in datasource foo but not in bar.

SELECT * FROM foo where EXISTS(SELECT * FROM [{b: 1}] AS bar WHERE bar.b = a)

SELECT * FROM [{a: 1}] AS foo WHERE EXISTS(SELECT * FROM bar AS bar WHERE bar.b = a)

will get compiling errors because in the first query, we cannot statically prove field "a" exists in foo. And in the second query, we cannot statically prove field "a" doesn't exist in bar.

Grammar

<subquery expression> ::= "(" <select query> ")"

Rejected Alternatives/Competitive Analysis

Row constructors

We chose not to support row constructors for now. Some databases including SQL server don't support row constructor, so it's very likely this feature is not required by BI tools. An additional consequence of this decision is that we do not need to distinguish table subqueries and row subqueries, since row subqueries are only different when compared to row constructors.. It would be easy to add support for row constructors and row subqueries in the future if necessary.

Unqualified identifier

We choose to only support unqualified identifiers in subquery in the cases listed above, due to the ambiguity generated by the missing qualifiers. For example, in following query and catalog:

SELECT * FROM foo where EXISTS(SELECT * FROM bar WHERE bar.b = a)

ρ_c = {
  test: {
    foo: [
      { a: 1}
    ],
    bar: [
      { b: 1}
      { a: 2, b: 2}
    ]
  }
}

We can interpret this "a" in the subquery as either foo.a or bar.a by evaluating either one of the records in bar. If we read record { b: 1} first, according to SQL standard, the unqualified identifier "a" should be resolved to foo.a. Otherwise, it should be resolved to bar.a. Such undefined behavior cannot be accepted in MongoSQL design.

Behavior and Grammar for IN, ANY, SOME, ALL, EXISTS

An ANY predicate determines whether a target value matches any value in the subquery for a provided comparison operator. The result is equivalent to the logical disjunction of the comparisons between the target value and each value in the subquery result set. If one of the comparison predicates returns TRUE, the result of the ANY predicate is TRUE. Else, if any of the comparison predicates returns NULL, the result is NULL. In all other cases, the result is FALSE.

Worth mentioning, each comparison follows the same rule from the Comparison Operators section, which means the types on both sides of the quantifier should be statically proven to be comparable. This applies to the other predicates in this section which also involves comparison operators.

For example:
1 = ANY(SELECT a FROM [{a: 1}, {a: NULL}] AS arr)