This document formally describes the Anvil language grammar and its semantics. Anvil is a dataflow-oriented DSL where programs describe pipelines of tools operating on tabular data. Programs are parsed into an AST and then transformed into an execution graph.
The grammar is implemented using Pest and is reproduced here with explanations and examples.
Whitespace is generally insignificant and may appear between tokens. Comments begin with # and run to the end of the line.
COMMENT = _{ "#" ~ (!NEWLINE ~ ANY)* ~ NEWLINE? }
WHITESPACE = _{ " " | "\t" | NEWLINE | COMMENT }
NEWLINE = _{ "\r\n" | "\n" }
An Anvil program is a sequence of statements.
PROGRAM = { SOI ~ STATEMENT* ~ EOI }
Each statement represents a complete dataflow expression and must end with a semicolon.
STATEMENT = { FLOW ~ BRANCH_BLOCK? ~ OUTPUT_BINDING? ~ ";" }
A statement may:
- consist solely of a flow
- fan out into named branches
- bind its final output to a variable
A flow is a left-to-right pipeline of tools and/or variables.
FLOW = { (TOOL_REF | VARIABLE) ~ (PIPE ~ (TOOL_REF | VARIABLE))* }
The pipe operator (|) connects the output of one step to the input of the next.
[input: './data/file.parquet'] | [schema] | [print];
Variables may appear anywhere a tool can appear.
df | [select: id='id', name='name'] | [print];
A statement may bind its final output to a variable using the bind operator (>).
OUTPUT_BINDING = { BIND ~ VARIABLE }
BIND = { ">" }
[input: './data/file.parquet'] | [select: id='id'] > df;
Branching allows a tool to emit multiple named outputs.
BRANCH_BLOCK = { ":" ~ BRANCHES }
BRANCHES = { BRANCH ~ ("," ~ BRANCH)* }
BRANCH = { IDENTIFIER ~ "=>" ~ TARGET }
TARGET = { VARIABLE | FLOW ~ OUTPUT_BINDING? }
- Branching always applies to the immediately preceding flow
- Branch names must correspond to outputs produced by the tool
- Branches do not implicitly rejoin
- Merging must be done explicitly via tools such as
join,union, orintersect
[input: './data/messy.parquet']
| [filter: '$three == true']:
true => [print],
false => df;
df | [print];
Tools are invoked using square brackets.
TOOL_REF = { "[" ~ IDENTIFIER ~ (":" ~ TOOL_ARGS)? ~ "]" }
Tools may accept positional arguments, keyword arguments, or both.
TOOL_ARGS = {
POSITIONAL ~ ("," ~ POSITIONAL)* ~ ("," ~ KEYWORD ~ ("," ~ KEYWORD)*)? |
KEYWORD ~ ("," ~ KEYWORD)*
}
KEYWORD = { IDENTIFIER ~ "=" ~ VALUE }
POSITIONAL = { !(IDENTIFIER ~ "=") ~ VALUE }
Positional arguments must appear before keyword arguments.
Tool arguments may be literals, identifiers, or embedded flows.
VALUE = { LITERAL | IDENTIFIER | "(" ~ FLOW ~ ")" }
Flows wrapped in parentheses are parsed as subflows. These allow tools to accept entire pipelines as inputs.
[join:
df_lt=(left_df),
df_rt=([input: './data/right.parquet'] | [select: id, name='name'])
]
LITERAL = { STRING | NUMBER | BOOLEAN }
STRING = @{ "'" ~ (!"'" ~ ANY)* ~ "'" }
NUMBER = @{ "-"? ~ ASCII_DIGIT+ }
BOOLEAN = { "true" | "false" }
Identifiers and variables share the same lexical form but differ semantically.
IDENTIFIER = @{ (ASCII_ALPHANUMERIC | "_")+ }
VARIABLE = @{ (ASCII_ALPHANUMERIC | "_")+ }
The following tools are currently available:
input— read files into a dataframeoutput— write dataframes to diskregister— register a dataframe as a SQL tableschema— produce a dataframe describing schemadescribe— produce dataset metadataselect— select columns using DataFusion expressionsfilter— filter rows using expressionsprint— write dataframe to stdoutlimit— limit number of rowsunion— union dataframesintersect— intersect dataframesjoin— join dataframessort— sort using expressionsproject— compute new columns from expressionssql— execute SQL against registered tables
Tool arity and semantics are validated during execution graph construction and execution, not during parsing.
- The grammar enforces syntax only; semantic validation is deferred
- Variables remain first-class nodes in the execution graph
- Branching represents fan-out only
- Fan-in must be modeled explicitly with tools
- Parenthesized flows enable graph composition without grammar-level grouping constructs
This grammar is intentionally conservative to keep parsing deterministic and error messages readable while allowing expressive dataflow graphs.