Warning
I'm new to Python internals, so the tutorial may contain mistakes and clunky solutions.
The goal is to add an operator to perform bash-like chain calls to CPython:
>>> [1, 2] |> map(lambda x: x * 2) |> list()
[2, 4]According to the CPython devguide, the compilation of source code involves several steps:
Tokenize the source code
Parse the stream of tokens into an Abstract Syntax Tree.
Transform AST into an instruction sequence
Construct a Control Flow Graph and apply optimizations to it.
Emit bytecode based on the Control Flow Graph.
To implement |>, we are going to change the first three steps: modify the parsing and compilation processes.
Clone the cpython repo and checkout to a new branch.
$ git clone git@github.com:python/cpython.git && cd cpython
$ git checkout tags/v3.12.1 -b pipesLet's start with tokenization. Tokenization is a process of splitting an input string into a sequence of tokens.
Add a new PIPE token to the Grammar/Tokens
ELLIPSIS '...'
COLONEQUAL ':='
EXCLAMATION '!'
+PIPE '|>'Next, run make regen-token to regenerate pycore_token.h, Parser/token.c, Lib/token.py.
For example, look at the changes in Parser/token.c. regen-token added a switch case to the parser code.
case '|':
switch (c2) {
case '=': return VBAREQUAL;
+ case '>': return PIPE;
}
break;
}Next, move to Parser/Python.asdl. ASDL is a blueprint for Python AST nodes. Each AST node is defined by the ASDL.
Let's add a new node – PipeOP expression. It contains two children – right and left expressions.
expr = BoolOp(boolop op, expr* values)
| NamedExpr(expr target, expr value)
| BinOp(expr left, operator op, expr right)
+ | PipeOp(expr left, expr right)
| UnaryOp(unaryop op, expr operand)
| Lambda(arguments args, expr body)
| IfExp(expr test, expr body, expr orelse)Then run make regen-ast to regenerate pycore_ast.h and Python/Python-ast.c.
Look at the changes in Python/Python-ast.c. regen-ast generated the constructor function for the pipe operator expression.
expr_ty
_PyAST_PipeOp(expr_ty left, expr_ty right, int lineno, int col_offset, int
end_lineno, int end_col_offset, PyArena *arena)
{
expr_ty p;
if (!left) {
PyErr_SetString(PyExc_ValueError,
"field 'left' is required for PipeOp");
return NULL;
}
if (!right) {
PyErr_SetString(PyExc_ValueError,
"field 'right' is required for PipeOp");
return NULL;
}
p = (expr_ty)_PyArena_Malloc(arena, sizeof(*p));
if (!p)
return NULL;
p->kind = PipeOp_kind;
p->v.PipeOp.left = left;
p->v.PipeOp.right = right;
p->lineno = lineno;
p->col_offset = col_offset;
p->end_lineno = end_lineno;
p->end_col_offset = end_col_offset;
return p;
}Now let's move to the Grammar/python.gram.
This file contains the whole python grammar. In short: it describes how to construct an abstract syntax tree using the grammar rules.
Add a pipe_op expression definition.
power[expr_ty]:
| a=await_primary '**' b=factor { _PyAST_BinOp(a, Pow, b, EXTRA) }
+ | pipe_op
+
+pipe_op[expr_ty]:
+ | a=pipe_op '|>' b=await_primary { _PyAST_PipeOp(a, b, EXTRA) }
| await_primaryIt means "When matching a await_primary after |> token, construct a AST node using _PyAST_PipeOp function"
Then run make regen-pegen to regenerate Parser/parser.c.
The whole AST parser is already generated by regen-pegen.
We only need to update AST validation. Add a case for PipeOp_kind to Python/ast.c/validate_expr.
+ case PipeOp_kind:
+ if (exp->v.PipeOp.right->kind != Call_kind) {
+ PyErr_SetString(PyExc_TypeError,
+ "Pipe op arg must be a function call");
+ return 0;
+ }
+ ret = validate_expr(state, exp->v.PipeOp.left, Load) &&
+ validate_expr(state, exp->v.PipeOp.right, Load);
+ break;Recompile CPython with make -j4 and test the parser with ast.parse module:
>>> import ast
>>> tree = ast.parse('1 |> f()')
>>> ast.dump(tree)
"Module(body=[Expr(value=PipeOp(left=Constant(value=1), right=Call(func=Name(id='f', ctx=Load()), args=[], keywords=[])))], type_ignores=[])"Parsing is working, so we can move to the compilation part.
The next stage is compilation. Now we need to translate an AST to a sequence of commands for the Python VM.
a |> f(b) is another way of saying f(b, a)
Let's examine how f(b, a) is compiled into bytecode instructions using the dis module:
>>> import dis
>>> dis.dis("f(b, a)")
0 0 RESUME 0
1 2 PUSH_NULL
4 LOAD_NAME 0 (f)
6 LOAD_NAME 1 (b)
8 LOAD_NAME 2 (a)
10 CALL 2
18 RETURN_VALUEThese instructions are telling an interpreter to:
- Load a function to a stack using
LOAD_NAME - Load value of
bto the stack usingLOAD_NAME - Load value of
ato the stack usingLOAD_NAME - Call the function using
CALLwith2arguments.
So, to implement the pipe, we need to add an additional argument to the stack, between a function load and a function call, during compilation.
The starting point of compilation is the Python/compile.c file.
Let's look to the Python/compile.c/compiler_visit_expr1 .
The function describes compilation of an expr_ty AST node to with a simple switch. For example, to compile a binary operation like a + b compiler recursively visits left and right nodes, and adds binary operation to control sequence.
static int
compiler_visit_expr1(struct compiler *c, expr_ty e)
{
location loc = LOC(e);
switch (e->kind) {
case NamedExpr_kind:
VISIT(c, expr, e->v.NamedExpr.value);
ADDOP_I(c, loc, COPY, 1);
VISIT(c, expr, e->v.NamedExpr.target);
break;
case BoolOp_kind:
return compiler_boolop(c, e);
case BinOp_kind:
VISIT(c, expr, e->v.BinOp.left);
VISIT(c, expr, e->v.BinOp.right);
ADDOP_BINARY(c, loc, e->v.BinOp.op);
break;
...Add a new case with for PipeOp_kind. Let's start with a copy of regular function call.
+ case PipeOp_kind:
+ return compiler_call(c, e->>v.PipeOp.right);
case Lambda_kind:
return compiler_lambda(c, e);Recompile Cpython with make -j4 and try new operator:
>>> 1 |> f()
Assertion failed: (scope || PyUnicode_READ_CHAR(name, 0) == '_'), function compiler_nameop, file compile.c, line 4209Seems like compiler can't resolve a f symbol. Let's fix that.
The purpose of a symbol table is to resolving scopes.
Look at Python/symtable.c/symtable_visit_expr. The function recursively visits the AST and builds symbol tables.
static int
symtable_visit_expr(struct symtable *st, expr_ty e)
{
...
switch (e->kind) {
case NamedExpr_kind:
if (!symtable_raise_if_annotation_block(st, "named expression", e)) {
VISIT_QUIT(st, 0);
}
if(!symtable_handle_namedexpr(st, e))
VISIT_QUIT(st, 0);
break;
case BoolOp_kind:
VISIT_SEQ(st, expr, e->v.BoolOp.values);
break;
....Add a case for a PipeOP_kind:
case BinOp_kind:
VISIT(st, expr, e->v.BinOp.left);
VISIT(st, expr, e->v.BinOp.right);
+ break;
+ case PipeOp_kind:
+ VISIT(st, expr, e->v.PipeOp.left);
+ VISIT(st, expr, e->v.PipeOp.right);
break;
case UnaryOp_kind:
VISIT(st, expr, e->v.UnaryOp.operand);Recompile CPython with make -j4 and test symbols with symtables module:
>>> import symtable
>>> st = symtable.symtable('1 |> f()', "example.py", "exec")
>>> [ (symbol.get_name(), symbol.is_global() ) for symbol in st.get_symbols() ]
[('f', True)]Move back to the Python/compile.c/compiler_visit_expr1 and redefine case for PipeOp_kind with compiler_pipe_call:
case PipeOp_kind:
- return compiler_call(c, e->>v.PipeOp.right);
+ return compiler_pipe_call(c, e);
case Lambda_kind:
return compiler_lambda(c, e);And define a new compiler_pipe_call function. Start with a modified copy of compiler_call, which calls the right expression:
static int compiler_pipe_call(struct compiler *c, expr_ty e) {
expr_ty func_e = e->v.PipeOp.right;
expr_ty arg_e = e->v.PipeOp.left;
RETURN_IF_ERROR(validate_keywords(c, func_e->v.Call.keywords));
int ret = maybe_optimize_method_call(c, func_e);
if (ret < 0) {
return ERROR;
}
if (ret == 1) {
return SUCCESS;
}
RETURN_IF_ERROR(check_caller(c, func_e->v.Call.func));
location loc = LOC(func_e->v.Call.func);
ADDOP(c, loc, PUSH_NULL);
VISIT(c, expr, func_e->v.Call.func);
loc = LOC(func_e);
return compiler_call_helper(c, loc, 0,
func_e->v.Call.args,
func_e->v.Call.keywords);
}Recompile CPython and check that nothing is broken:
>>> f = lambda x:x
>>> 1 |> f()
TypeError: <lambda>() missing 1 required positional argument: 'x'Fine. Finally, we need to add e->v.PipeOp.left to v.Call.args
Let's create a new argument sequence using _Py_asdl_expr_seq_new and fill it with original args and left pipe expression:
static int compiler_pipe_call(struct compiler *c, expr_ty e) {
expr_ty func_e = e->v.PipeOp.right;
expr_ty arg_e = e->v.PipeOp.left;
RETURN_IF_ERROR(validate_keywords(c, func_e->v.Call.keywords));
int ret = maybe_optimize_method_call(c, func_e);
if (ret < 0) {
return ERROR;
}
if (ret == 1) {
return SUCCESS;
}
RETURN_IF_ERROR(check_caller(c, func_e->v.Call.func));
location loc = LOC(func_e->v.Call.func);
ADDOP(c, loc, PUSH_NULL);
VISIT(c, expr, func_e->v.Call.func);
loc = LOC(func_e);
+ Py_ssize_t original_len = asdl_seq_LEN(func_e->v.Call.args);
+ asdl_expr_seq *new_args = _Py_asdl_expr_seq_new(
+ original_len+1, c->c_arena);
+ for (Py_ssize_t i = 0; i < original_len; i++) {
+ asdl_seq_SET(new_args, i, asdl_seq_GET(func_e->v.Call.args, i));
+ }
+ asdl_seq_SET(new_args, original_len, arg_e);
return compiler_call_helper(c, loc, 0,
- func_e->v.Call.args,
+ new_args,
func_e->v.Call.keywords);
}
Thats it!
Recompile Cpython with make -j4 and try new operator:
>>> [1, 2] |> map(lambda x: x * 2) |> list()
[2, 4]Yaaaaay!
>>> import dis
>>> dis.dis("a |> f(b)")
0 0 RESUME 0
1 2 PUSH_NULL
4 LOAD_NAME 0 (f)
6 LOAD_NAME 1 (b)
8 LOAD_NAME 2 (a)
10 CALL 2
18 RETURN_VALUEThe bytecode is the same to a f(b, a).