Anvil Grammar Specification

This document formally describes the Anvil language grammar and its semantics. Anvil is a dataflow-oriented DSL where programs describe pipelines of tools operating on tabular data. Programs are parsed into an AST and then transformed into an execution graph.

The grammar is implemented using Pest and is reproduced here with explanations and examples.

Lexical Structure

Whitespace and Comments

Whitespace is generally insignificant and may appear between tokens. Comments begin with # and run to the end of the line.

COMMENT    = _{ "#"  ~ (!NEWLINE ~ ANY)* ~ NEWLINE? }
WHITESPACE = _{ " " | "\t" | NEWLINE | COMMENT }
NEWLINE    = _{ "\r\n" | "\n" }

Programs and Statements

An Anvil program is a sequence of statements.

PROGRAM    = { SOI ~ STATEMENT* ~ EOI }

Each statement represents a complete dataflow expression and must end with a semicolon.

STATEMENT  = { FLOW ~ BRANCH_BLOCK? ~ OUTPUT_BINDING? ~ ";" }

A statement may:

consist solely of a flow
fan out into named branches
bind its final output to a variable

Flows

A flow is a left-to-right pipeline of tools and/or variables.

FLOW = { (TOOL_REF | VARIABLE) ~ (PIPE ~ (TOOL_REF | VARIABLE))* }

The pipe operator (|) connects the output of one step to the input of the next.

Example

[input: './data/file.parquet'] | [schema] | [print];

Variables may appear anywhere a tool can appear.

df | [select: id='id', name='name'] | [print];

Output Binding

A statement may bind its final output to a variable using the bind operator (>).

OUTPUT_BINDING = { BIND ~ VARIABLE }
BIND = { ">" }

Example

[input: './data/file.parquet'] | [select: id='id'] > df;

Branching (Fan-Out)

Branching allows a tool to emit multiple named outputs.

BRANCH_BLOCK  = { ":" ~ BRANCHES }
BRANCHES      = { BRANCH ~ ("," ~ BRANCH)* }
BRANCH        = { IDENTIFIER ~ "=>" ~ TARGET }
TARGET        = { VARIABLE | FLOW ~ OUTPUT_BINDING? }

Semantics

Branching always applies to the immediately preceding flow
Branch names must correspond to outputs produced by the tool
Branches do not implicitly rejoin
Merging must be done explicitly via tools such as join, union, or intersect

Example

[input: './data/messy.parquet']
| [filter: '$three == true']:
    true  => [print],
    false => df;

df | [print];

Tool Invocation

Tools are invoked using square brackets.

TOOL_REF = { "[" ~ IDENTIFIER ~ (":" ~ TOOL_ARGS)? ~ "]" }

Tool Arguments

Tools may accept positional arguments, keyword arguments, or both.

TOOL_ARGS = {
    POSITIONAL ~ ("," ~ POSITIONAL)* ~ ("," ~ KEYWORD ~ ("," ~ KEYWORD)*)? |
    KEYWORD ~ ("," ~ KEYWORD)*
}

KEYWORD    = { IDENTIFIER ~ "=" ~ VALUE }
POSITIONAL = { !(IDENTIFIER ~ "=") ~ VALUE }

Positional arguments must appear before keyword arguments.

Values

Tool arguments may be literals, identifiers, or embedded flows.

VALUE = { LITERAL | IDENTIFIER | "(" ~ FLOW ~ ")" }

Flow Values

Flows wrapped in parentheses are parsed as subflows. These allow tools to accept entire pipelines as inputs.

[join:
  df_lt=(left_df),
  df_rt=([input: './data/right.parquet'] | [select: id, name='name'])
]

Literals and Identifiers

LITERAL     = { STRING | NUMBER | BOOLEAN }
STRING      = @{ "'" ~ (!"'" ~ ANY)* ~ "'" }
NUMBER      = @{ "-"? ~ ASCII_DIGIT+ }
BOOLEAN     = { "true" | "false" }

Identifiers and variables share the same lexical form but differ semantically.

IDENTIFIER  = @{ (ASCII_ALPHANUMERIC | "_")+ }
VARIABLE    = @{ (ASCII_ALPHANUMERIC | "_")+ }

Tools

The following tools are currently available:

input — read files into a dataframe
output — write dataframes to disk
register — register a dataframe as a SQL table
schema — produce a dataframe describing schema
describe — produce dataset metadata
select — select columns using DataFusion expressions
filter — filter rows using expressions
print — write dataframe to stdout
limit — limit number of rows
union — union dataframes
intersect — intersect dataframes
join — join dataframes
sort — sort using expressions
project — compute new columns from expressions
sql — execute SQL against registered tables

Tool arity and semantics are validated during execution graph construction and execution, not during parsing.

Design Notes

The grammar enforces syntax only; semantic validation is deferred
Variables remain first-class nodes in the execution graph
Branching represents fan-out only
Fan-in must be modeled explicitly with tools
Parenthesized flows enable graph composition without grammar-level grouping constructs

This grammar is intentionally conservative to keep parsing deterministic and error messages readable while allowing expressive dataflow graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anvil Grammar Specification

Lexical Structure

Whitespace and Comments

Programs and Statements

Flows

Example

Output Binding

Example

Branching (Fan-Out)

Semantics

Example

Tool Invocation

Tool Arguments

Values

Flow Values

Literals and Identifiers

Tools

Design Notes

FilesExpand file tree

GRAMMAR.md

Latest commit

History

GRAMMAR.md

File metadata and controls

Anvil Grammar Specification

Lexical Structure

Whitespace and Comments

Programs and Statements

Flows

Example

Output Binding

Example

Branching (Fan-Out)

Semantics

Example

Tool Invocation

Tool Arguments

Values

Flow Values

Literals and Identifiers

Tools

Design Notes