diff --git a/ACTION_PLAN.md b/ACTION_PLAN.md new file mode 100644 index 0000000..cd9e0f4 --- /dev/null +++ b/ACTION_PLAN.md @@ -0,0 +1,322 @@ +# PyJapt v1.0 — Action Plan + +> **Current version:** 0.4.1 +> **Target version:** 1.0.0 +> **Status:** Planning phase + +This document is the authoritative roadmap for bringing PyJapt to a stable v1.0 release. Items are grouped by category and ordered by priority within each category. + +--- + +## 1. Critical Bug Fixes + +These bugs affect correctness and must be resolved before v1. + +### 1.1 Lexer state not reset on repeated calls + +**File:** `pyjapt/lexing.py` — `Lexer.__call__` + +`Lexer.__call__` resets `lineno`, `column`, `position`, `text`, and `token`, but it does **not** reset: +- `self._errors` — errors from a previous run accumulate +- `self.contain_errors` — flag is stale after the first run with errors + +**Fix:** Reset `_errors = []` and `contain_errors = False` at the start of `__call__`. + +--- + +### 1.2 `errors` property signature is broken + +**Files:** `pyjapt/lexing.py:102`, `pyjapt/parsing.py:1072` + +Both `Lexer.errors` and `ShiftReduceParser.errors` are decorated with `@property` but carry a `clean: bool = True` parameter: + +```python +@property +def errors(self, clean: bool = True): # ← wrong: properties don't accept arguments +``` + +Python silently ignores the `clean` parameter and always calls it as a property. The `clean` branch (returning tuples with row/column) is therefore unreachable. + +**Fix:** Replace with two separate accessors: +```python +@property +def errors(self) -> List[str]: + return [m for _, _, m in sorted(self._errors)] + +@property +def errors_with_location(self) -> List[Tuple[int, int, str]]: + return sorted(self._errors) +``` + +--- + +### 1.3 `ShiftReduceParser` class-level mutable state + +**File:** `pyjapt/parsing.py:1032-1033` + +```python +class ShiftReduceParser: + contains_errors: bool = False # ← shared across all instances + current_token: Optional[Token] = None # ← shared across all instances +``` + +Class-level attributes are shared across all instances of a class. Two parsers used in the same program would corrupt each other's state. + +**Fix:** Move both to `__init__`. + +--- + +### 1.4 LALR(1) lookahead algorithm mixes strings and Symbol objects + +**File:** `pyjapt/parsing.py` — `determining_lookaheads` and `build_lalr1_automaton` + +The propagation sentinel `"#"` is a plain string that is mixed into lookahead sets normally occupied by `Symbol` objects. This works coincidentally because `"#"` doesn't collide with any symbol name, but it is fragile and will silently break if a grammar ever names a symbol `#`. + +**Fix:** Use a dedicated `PropagationTerminal` singleton (already defined in the file as `PropagationTerminal`) instead of the magic string, or use a `None` sentinel that is excluded from lookahead propagation explicitly. + +--- + +### 1.5 `Grammar.augmented_grammar` semantic rule is wrong + +**File:** `pyjapt/parsing.py:462` + +```python +new_start_symbol %= start_symbol + grammar.EPSILON, lambda x: x +``` + +The lambda receives a `RuleList`, not the symbol value directly. The production `S' -> start_symbol` (with epsilon swallowed) should return `s[1]`, not the `RuleList` itself: + +```python +new_start_symbol %= start_symbol + grammar.EPSILON, lambda s: s[1] +``` + +--- + +### 1.6 `Grammar.__getitem__` returns `None` for missing keys + +**File:** `pyjapt/parsing.py:592-599` + +When a production string references a symbol that doesn't exist, `Grammar.__getitem__` silently returns `None`. This produces a confusing `AttributeError` deep in the call chain rather than a clear `GrammarError`. + +**Fix:** Raise `GrammarError` with the unknown symbol name. + +--- + +## 2. Code Quality & Modernisation + +### 2.1 Move `flake8` to dev dependencies + +**File:** `pyproject.toml` + +`flake8` is listed under `[tool.poetry.dependencies]` (runtime). It is a linter and must move to `[tool.poetry.dev-dependencies]`. + +--- + +### 2.2 Update deprecated Poetry build backend + +**File:** `pyproject.toml` + +```toml +# current (deprecated) +build-backend = "poetry.masonry.api" + +# correct +build-backend = "poetry.core.masonry.api" +``` + +--- + +### 2.3 Add `mkdocs` and `mkdocs-material` as dev dependencies + +**File:** `pyproject.toml` + +Documentation builds are part of the development workflow. + +--- + +### 2.4 Rename `pyjapt/typing.py` + +The module name `typing` shadows Python's standard-library `typing` module inside the package. Rename it to `pyjapt/types.py` or `pyjapt/_types.py` and update the import in `tests/test_arithmetic_grammar.py`. + +--- + +### 2.5 Export `RuleList` and parsers from the top-level `__init__.py` + +**File:** `pyjapt/__init__.py` + +`RuleList` and individual parser classes (`SLRParser`, `LR1Parser`, `LALR1Parser`) are not exported from the package root. Users must import from internal submodules. Add them to `__init__.py`: + +```python +from pyjapt.parsing import ( + ShiftReduceParser, SLRParser, LR1Parser, LALR1Parser, Grammar, RuleList +) +``` + +--- + +### 2.6 Add type annotations to public API + +Currently many method signatures lack return type annotations. Add full annotations to: +- `Grammar.get_lexer`, `Grammar.get_parser`, `Grammar.serialize_*` +- `Lexer.__call__`, `Lexer.tokenize` +- `ShiftReduceParser.__call__` +- All `add_*` methods on `Grammar` + +--- + +### 2.7 Replace bare `assert` with proper exceptions + +Bare `assert` statements are silently disabled when Python runs with the `-O` (optimise) flag: + +- `pyjapt/parsing.py:835` — `assert len(grammar.start_symbol.productions) == 1` +- `pyjapt/parsing.py:906` — `assert not lookaheads.contains_epsilon` +- `pyjapt/lexing.py` — several in `Grammar.add_terminal` + +Replace with `if not ...: raise GrammarError(...)`. + +--- + +### 2.8 Serialised parser resets `augmented_grammar` fields + +**File:** `pyjapt/serialization.py` + +The serialised parser template does not call `_build_automaton` or compute `firsts`/`follows`, which is correct. But it also doesn't set `augmented_grammar`, `firsts`, or `follows`, which means the serialised parser cannot be safely extended. Document this limitation and add a guard. + +--- + +### 2.9 CI: test against Python 3.10, 3.11, and 3.12 + +**File:** `.github/workflows/python-test-app.yml` + +Add a matrix strategy to test against all supported Python versions. + +--- + +## 3. Testing Improvements + +### 3.1 Add tests for LR(1) and LALR(1) parsers + +`tests/test_arithmetic_grammar.py` only tests the SLR parser. Add `test_lr1` and `test_lalr1` parameterised over the same set of inputs. + +### 3.2 Add tests for lexer error handling + +- Unknown character → default error handler → `errors` list populated +- Custom `lexical_error` decorator +- `contain_errors` flag is `True` after a failed tokenisation +- Errors reset correctly across multiple calls + +### 3.3 Add tests for parser error handling + +- Syntactic error → `errors` list populated +- `contains_errors` flag is `True` +- Error recovery (`error` terminal / panic mode) +- Custom `parsing_error` decorator + +### 3.4 Add tests for serialisation + +- Round-trip: build grammar → serialise lexer and parser → import → parse identical inputs → same result + +### 3.5 Add edge-case tests + +- Empty grammar (no productions) raises `GrammarError` +- Duplicate terminal/non-terminal name raises immediately +- Production referencing undefined symbol raises `GrammarError` +- Epsilon productions +- Grammars with conflicts produce correct conflict counts + +### 3.6 Measure and enforce coverage + +Add `pytest-cov` and set a minimum coverage threshold (target ≥ 85 %) in CI. + +--- + +## 4. Documentation + +The documentation website is built with **MkDocs + Material theme** and lives under `docs/`. See `mkdocs.yml` for the full configuration. + +| Page | Status | +|------|--------| +| `docs/index.md` | Done | +| `docs/getting-started.md` | Done | +| `docs/defining-grammar.md` | Done | +| `docs/lexer.md` | Done | +| `docs/parser.md` | Done | +| `docs/error-handling.md` | Done | +| `docs/serialization.md` | Done | +| `docs/api-reference.md` | Done | +| `docs/changelog.md` | Done | + +### 4.1 Add `CHANGELOG.md` + +Track every version with date and changes, following [Keep a Changelog](https://keepachangelog.com) format. + +### 4.2 Add `CONTRIBUTING.md` + +Describe: +- How to clone and set up the dev environment +- How to run tests and linting +- Branching and PR conventions +- Code of conduct pointer + +--- + +## 5. Packaging & Release + +### 5.1 Update version to `1.0.0` + +**Files:** `pyjapt/__init__.py` and `pyproject.toml` + +### 5.2 Populate package metadata + +**File:** `pyproject.toml` + +Add: +```toml +license = "MIT" +keywords = ["lexer", "parser", "LR", "LALR", "compiler", "grammar"] +classifiers = [ + "Programming Language :: Python :: 3", + "License :: OSI Approved :: MIT License", + "Topic :: Software Development :: Compilers", +] +repository = "https://github.com/alejandroklever/PyJapt" +documentation = "https://alejandroklever.github.io/PyJapt" +``` + +### 5.3 Add a GitHub Actions workflow to build and deploy docs + +Publish the MkDocs site to GitHub Pages on every push to `main`. + +### 5.4 Tag and publish to PyPI + +After all items above are resolved: +1. Bump version to `1.0.0` in `__init__.py` (the `build.py` script syncs `pyproject.toml` automatically). +2. Push a `v1.0.0` git tag. +3. Create a GitHub Release — the existing publish workflow triggers on `release: published`. + +--- + +## 6. Future Work (Post-v1) + +The following are explicitly out of scope for v1 but should be tracked: + +| Feature | Notes | +|---------|-------| +| Operator precedence declarations | Resolves SR conflicts declaratively (like `%left`, `%right` in Yacc) | +| LL(1) parser support | Mentioned in README as future work | +| Grammar visualisation | Export automata as DOT / SVG | +| Incremental parsing | Re-lex only changed regions | +| Better conflict reporting | Show the conflicting items and lookaheads in a human-readable table | +| Unicode identifiers in grammars | Non-ASCII symbol names | +| Async tokenisation | Yield tokens lazily for very large inputs | + +--- + +## Priority Order Summary + +| Priority | Item | +|----------|------| +| P0 — Must fix before v1 | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 | +| P1 — Fix before v1 | 2.1, 2.2, 2.4, 2.5, 3.1, 3.2, 3.3 | +| P2 — Nice-to-have before v1 | 2.3, 2.6, 2.7, 2.8, 2.9, 3.4, 3.5, 3.6, 5.1 – 5.4 | +| P3 — Post-v1 | Section 6 | diff --git a/docs/api-reference.md b/docs/api-reference.md new file mode 100644 index 0000000..7b307f4 --- /dev/null +++ b/docs/api-reference.md @@ -0,0 +1,425 @@ +# API Reference + +This page lists every public class and method exported by PyJapt. + +--- + +## Top-Level Exports + +```python +from pyjapt import ( + Grammar, + Lexer, + Token, + ShiftReduceParser, + SLRParser, + LR1Parser, + LALR1Parser, + RuleList, +) +``` + +--- + +## `Grammar` + +The central object for defining a language. + +```python +from pyjapt import Grammar +g = Grammar() +``` + +### Terminals + +--- + +#### `Grammar.add_terminal(name, regex=None, rule=None) -> Terminal` + +Create and register a terminal symbol. + +| Parameter | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Unique terminal name. Must be a valid string. | +| `regex` | `str \| None` | Regular expression. If `None`, the regex is `re.escape(name)` (literal match). | +| `rule` | `Callable[[Lexer], Optional[Token]] \| None` | Rule function invoked when this token is matched. | + +Returns the new `Terminal` object. + +Raises `AssertionError` if `name` is already defined. + +--- + +#### `Grammar.add_terminals(names) -> Tuple[Terminal, ...]` + +Convenience wrapper. Splits `names` on whitespace and calls `add_terminal` for each. + +```python +plus, minus, star = g.add_terminals('+ - *') +``` + +--- + +#### `Grammar.terminal(name, regex) -> Callable` + +Decorator factory. Creates the terminal **and** registers the decorated function as its rule. + +```python +@g.terminal('int', r'\d+') +def int_rule(lexer): + lexer.position += len(lexer.token.lex) + lexer.column += len(lexer.token.lex) + return lexer.token +``` + +--- + +#### `Grammar.add_terminal_error()` + +Registers the built-in `error` terminal for use in error-recovery productions. Call this before writing any production that contains `error`. + +--- + +### Non-Terminals + +--- + +#### `Grammar.add_non_terminal(name, start_symbol=False) -> NonTerminal` + +Create and register a non-terminal symbol. + +| Parameter | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Unique non-terminal name. | +| `start_symbol` | `bool` | Mark as the start symbol. Only one allowed per grammar. | + +Raises `Exception` if a second `start_symbol=True` is provided. + +--- + +#### `Grammar.add_non_terminals(names) -> Tuple[NonTerminal, ...]` + +Splits `names` on whitespace and calls `add_non_terminal` for each. + +```python +stmt, expr, term = g.add_non_terminals('stmt expr term') +``` + +--- + +### Productions + +--- + +#### `Grammar.production(*production_strings) -> Callable` + +Decorator factory that registers the decorated function as the semantic action for one or more productions. + +The production string format is `'head -> body'` where `body` is a space-separated list of symbol names. + +```python +@g.production('expr -> expr + term', 'expr -> expr - term') +def additive(s): + return s[1] + s[3] if s[2] == '+' else s[1] - s[3] +``` + +--- + +#### `NonTerminal.__imod__(other) -> NonTerminal` + +Operator `%=` overload for adding productions to a non-terminal. + +```python +# Unattributed +expr %= 'expr + term' + +# With semantic action +expr %= 'expr + term', lambda s: s[1] + s[3] + +# Epsilon +expr %= '' +``` + +`other` can be: +- A `str` (space-separated symbol names) +- A `Symbol` or `Sentence` (built from Symbol objects with `+`) +- A `tuple` of `(str | Sentence, callable)` for attributed productions +- A `SentenceList` (built with `|`) for multiple alternatives + +--- + +### Error Handlers + +--- + +#### `Grammar.lexical_error(handler) -> handler` + +Decorator. Registers a custom lexical error handler. + +```python +@g.lexical_error +def lex_error(lexer): + lexer.add_error(lexer.lineno, lexer.column, + f'unexpected "{lexer.token.lex}"') + lexer.position += 1 + lexer.column += 1 +``` + +--- + +#### `Grammar.parsing_error(handler) -> handler` + +Decorator. Registers a custom syntactic error handler. + +```python +@g.parsing_error +def parse_error(parser): + tok = parser.current_token + parser.add_error(tok.line, tok.column, f'unexpected "{tok.lex}"') +``` + +--- + +### Generating the Lexer and Parser + +--- + +#### `Grammar.get_lexer() -> Lexer` + +Build and return a `Lexer` for this grammar. + +--- + +#### `Grammar.get_parser(name, verbose=False) -> ShiftReduceParser` + +Build and return a parser. + +| `name` | Parser type | +|--------|-------------| +| `'slr'` | Simple LR | +| `'lalr1'` | LALR(1) | +| `'lr1'` | Canonical LR(1) | + +Raises `ValueError` for unknown names. + +--- + +### Serialization + +--- + +#### `Grammar.serialize_lexer(class_name, grammar_module_name, grammar_variable_name='G')` + +Generate `lexertab.py` in the current working directory. + +--- + +#### `Grammar.serialize_parser(parser_type, class_name, grammar_module_name, grammar_variable_name='G')` + +Generate `parsertab.py` in the current working directory. + +--- + +### Utility + +--- + +#### `Grammar.to_json() -> str` + +Serialize the grammar structure (terminals, non-terminals, productions) to a JSON string. Semantic actions and regexes are **not** included. + +--- + +#### `Grammar.from_json(data) -> Grammar` + +Class method. Reconstruct a grammar from the JSON string produced by `to_json()`. + +--- + +#### `Grammar.__getitem__(item) -> Symbol | Production | None` + +Look up a symbol or production by name/repr-string. + +```python +plus_symbol = g['+'] +production = g['expr -> expr + term'] +``` + +--- + +## `Token` + +```python +class Token: + lex: str # lexeme string + token_type: Any # terminal name (str) or Terminal object + line: int # 1-based line number + column: int # 1-based column number +``` + +### Class methods + +#### `Token.empty() -> Token` + +Return an empty sentinel token `Token('', '', 0, 0)`. + +### Properties + +#### `Token.is_valid -> bool` + +Always `True` for a regular token. (Subclasses may override for error tokens.) + +--- + +## `Lexer` + +```python +class Lexer: + lineno: int # current line (1-based) + column: int # current column (1-based) + position: int # byte offset in input + text: str # full input string + token: Token # token being processed + contain_errors: bool # True after first error +``` + +### `Lexer.__call__(text) -> List[Token]` + +Tokenise `text`. Resets all internal state before each call. Appends an EOF token at the end. + +### `Lexer.tokenize(text) -> Generator[Token, None, None]` + +Low-level generator. Does **not** reset state. Prefer `__call__` for normal use. + +### `Lexer.errors -> List[str]` + +Sorted list of error message strings accumulated during the last call. + +### `Lexer.add_error(line, col, message)` + +Append an error entry. Intended for use inside custom terminal rules and error handlers. + +--- + +## `ShiftReduceParser` + +Base class for all three parser variants. Do not instantiate directly; use `Grammar.get_parser`. + +```python +class ShiftReduceParser: + SHIFT = 'SHIFT' + REDUCE = 'REDUCE' + OK = 'OK' +``` + +### `ShiftReduceParser.__call__(tokens) -> Any` + +Parse a list of `Token` objects and return the semantic value of the start symbol, or `None` if parsing failed. + +### `ShiftReduceParser.errors -> List[str]` + +Sorted list of syntactic error messages. + +### `ShiftReduceParser.add_error(line, column, message)` + +Append an error entry from inside a semantic action or error handler. + +### `ShiftReduceParser.contains_errors -> bool` + +`True` if any parsing error has been detected. + +### `ShiftReduceParser.current_token -> Token` + +The token being processed at the time the most recent error occurred. + +### `ShiftReduceParser.conflicts -> List[Tuple]` + +List of detected conflicts, each a `('SR' | 'RR', prod_a, prod_b)` tuple. + +### `ShiftReduceParser.shift_reduce_count -> int` + +Number of shift-reduce conflicts. + +### `ShiftReduceParser.reduce_reduce_count -> int` + +Number of reduce-reduce conflicts. + +--- + +## `SLRParser` + +```python +class SLRParser(ShiftReduceParser): ... +``` + +Uses the LR(0) automaton and Follow sets for lookaheads. + +--- + +## `LR1Parser` + +```python +class LR1Parser(ShiftReduceParser): ... +``` + +Uses the canonical LR(1) automaton with per-item lookaheads. + +--- + +## `LALR1Parser` + +```python +class LALR1Parser(LR1Parser): ... +``` + +Uses the merged LALR(1) automaton. Same states as SLR, same power as LR(1) for most grammars. + +--- + +## `RuleList` + +Passed to every semantic action as `s`. 1-indexed over the production body. + +### `RuleList.__getitem__(index) -> Any` + +`s[0]` — head value (output). +`s[1]` … `s[n]` — body symbol values. + +### `RuleList.add_error(index, message)` + +Report an error at the position of `s[index]` (int) or at an explicit `(line, column)` tuple. + +### `RuleList.force_parsing_error()` + +Mark the parse as failed without adding an error message. + +--- + +## `NonTerminal` + +Represents a grammar non-terminal. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Symbol name | +| `productions` | `List[Production]` | Productions where this symbol is the head | + +--- + +## `Terminal` + +Represents a grammar terminal. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Symbol name | + +--- + +## `Production` + +| Attribute | Type | Description | +|-----------|------|-------------| +| `left` | `NonTerminal` | Production head | +| `right` | `Sentence` | Production body | +| `rule` | `Callable \| None` | Semantic action | diff --git a/docs/changelog.md b/docs/changelog.md new file mode 100644 index 0000000..9cab546 --- /dev/null +++ b/docs/changelog.md @@ -0,0 +1,92 @@ +# Changelog + +All notable changes to PyJapt are documented here. +This project follows [Semantic Versioning](https://semver.org) and the +[Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. + +--- + +## [Unreleased] — v1.0.0 + +### Planned — Bug Fixes +- Reset `_errors` and `contain_errors` in `Lexer.__call__` so repeated calls don't accumulate stale errors. +- Fix `errors` property signature on `Lexer` and `ShiftReduceParser` (properties cannot accept arguments). +- Move `contains_errors` and `current_token` from class-level to instance-level in `ShiftReduceParser`. +- Fix `Grammar.augmented_grammar` semantic action (`lambda s: s[1]` instead of `lambda x: x`). +- Fix `s.Name` → `s.name` in `Grammar.to_json()` (case mismatch causes `AttributeError`). +- Raise `GrammarError` in `Grammar.__getitem__` instead of returning `None` for missing symbols. +- Replace bare `assert` statements with proper `GrammarError` exceptions. + +### Planned — Improvements +- Move `flake8` from runtime to dev dependencies. +- Update build backend to `poetry.core.masonry.api` (replaces deprecated `poetry.masonry.api`). +- Export `RuleList`, `SLRParser`, `LR1Parser`, `LALR1Parser` from `pyjapt.__init__`. +- Rename `pyjapt/typing.py` to `pyjapt/types.py` to avoid shadowing stdlib `typing`. +- Add full type annotations to the public API. +- Expand CI matrix to Python 3.10, 3.11, and 3.12. + +### Planned — Testing +- Add tests for LR(1) and LALR(1) parsers. +- Add tests for lexer and parser error handling. +- Add tests for serialization round-trips. +- Add edge-case tests (empty grammar, duplicate symbols, epsilon productions). +- Enforce minimum test coverage threshold. + +### Planned — Documentation +- Full MkDocs site with Material theme. +- Getting-started guide and user-guide sections. +- Complete API reference. +- Changelog (this file). + +--- + +## [0.4.1] — 2024-03-25 + +### Fixed +- Updated README with corrected examples and improved prose. + +--- + +## [0.4.0] — 2023-02-17 + +### Added +- GitHub Actions workflow for publishing to PyPI on release. +- `requirements.txt` for legacy `pip install` support. + +--- + +## [0.3.0] — 2021-03-?? + +### Added +- Default error report in the shift-reduce parser (panic-mode recovery). +- Improved `RuleList` error API. + +### Fixed +- Reset lexer parameters when analyzing a new string (`Lexer.__call__`). + +--- + +## [0.2.9] — 2021-??-?? + +### Fixed +- Minor fix in parsing default error detection. + +--- + +## [0.2.x] — 2020 + +### Added +- SLR, LR(1), and LALR(1) parsers. +- Serialization of lexer and parser to Python source files. +- `@g.terminal` decorator for inline rule definition. +- `@g.production` decorator for inline production rules. +- `@g.lexical_error` and `@g.parsing_error` decorators. +- `add_terminal_error()` and error terminal support in productions. +- `Grammar.to_json()` / `Grammar.from_json()`. +- JSON grammar import/export. + +--- + +## [0.1.x] — 2020 + +- Initial release with basic lexer and SLR parser. diff --git a/docs/defining-grammar.md b/docs/defining-grammar.md new file mode 100644 index 0000000..e289f3a --- /dev/null +++ b/docs/defining-grammar.md @@ -0,0 +1,227 @@ +# Defining a Grammar + +A `Grammar` object is the single source of truth for your language. This page covers all the ways to build one. + +--- + +## Creating the Grammar + +```python +from pyjapt import Grammar + +g = Grammar() +``` + +--- + +## Non-Terminals + +Non-terminals are the syntactic categories of your language (e.g. `expr`, `statement`, `program`). + +### `add_non_terminal(name, start_symbol=False)` + +```python +program = g.add_non_terminal('program', start_symbol=True) +stmt = g.add_non_terminal('stmt') +expr = g.add_non_terminal('expr') +``` + +- `name` — must be a unique, non-empty string. +- `start_symbol=True` — marks this as the grammar's start symbol. Only one non-terminal can carry this flag. + +Returns a `NonTerminal` object that you use to write productions. + +### `add_non_terminals(names)` + +Convenience method: accepts a space-separated string and returns a tuple of `NonTerminal` objects in the same order. + +```python +stmt, expr, term, fact = g.add_non_terminals('stmt expr term fact') +``` + +--- + +## Terminals + +Terminals are the atomic tokens produced by the lexer. + +### `add_terminal(name, regex=None, rule=None)` + +```python +# Literal terminal — the regex is the escaped name +plus = g.add_terminal('+') +minus = g.add_terminal('-') + +# Terminal with a custom regex +num = g.add_terminal('int', regex=r'\d+') + +# Terminal with a custom regex AND a lexer rule +num = g.add_terminal('int', regex=r'\d+', rule=lambda lexer: ...) +``` + +- When `regex` is `None`, the regular expression used is `re.escape(name)`, so `+` matches the literal character `+`. +- `rule` is a function `(Lexer) -> Optional[Token]`. If it returns `None`, the token is discarded. + +### `add_terminals(names)` + +Accepts a space-separated string and returns a tuple. All created terminals use their name as the literal regex. + +```python +plus, minus, star, div, lpar, rpar = g.add_terminals('+ - * / ( )') +``` + +### `@g.terminal(name, regex)` + +A decorator that creates the terminal **and** registers the rule in one step. + +```python +@g.terminal('int', r'\d+') +def int_terminal(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + lexer.token.lex = int(lexer.token.lex) + return lexer.token +``` + +The decorated function receives the `Lexer` instance and must either return the `Token` (possibly modified) or return `None`/nothing to discard it. + +--- + +## Productions + +Productions define how non-terminals are composed from sequences of terminals and non-terminals. + +### Using `%=` with a string (recommended) + +```python +expr %= 'expr + term' # unattributed +expr %= 'expr + term', lambda s: s[1] + s[3] # with semantic action +``` + +The string on the right-hand side is a space-separated list of symbol names. Each name must already be declared in the grammar. + +Inside the semantic action, `s` is a `RuleList`: + +| Index | Meaning | +|-------|---------| +| `s[0]` | The head non-terminal's value (set by returning from the action) | +| `s[1]` | Value of the 1st body symbol | +| `s[2]` | Value of the 2nd body symbol | +| `s[n]` | Value of the nth body symbol | + +For a terminal, the value is the token's lexeme (`str`). +For a non-terminal, the value is whatever its production's semantic action returned. + +### Using `%=` with `Symbol` objects + +```python +expr %= expr + plus + term +expr %= expr + plus + term, lambda s: s[1] + s[3] +``` + +`Symbol` objects support `+` to build `Sentence` objects, so you can construct productions with the original variable references. + +### Epsilon productions + +```python +expr %= '' # empty string → epsilon production +expr %= g.EPSILON # same thing using the EPSILON symbol directly +``` + +### `@g.production(*production_strings)` + +A decorator alternative to `%=`. It binds the decorated function to one or more production strings. + +```python +@g.production('expr -> expr + term') +def expr_add(s): + return s[1] + s[3] +``` + +The string format is `'head -> body'` where `->` separates the head non-terminal from the body symbols. + +You can attach the same function to multiple productions: + +```python +@g.production( + 'expr -> expr + expr', + 'expr -> expr - expr', + 'expr -> expr * expr', + 'expr -> expr / expr', +) +def binary_op(s): + if s[2] == '+': return s[1] + s[3] + if s[2] == '-': return s[1] - s[3] + if s[2] == '*': return s[1] * s[3] + if s[2] == '/': return s[1] // s[3] +``` + +--- + +## Special Terminals + +### `g.EOF` + +The end-of-file terminal (`$`). It is added automatically; you should not declare it yourself. + +### `g.EPSILON` + +Represents the empty word. Use it to write nullable productions. + +### `g.ERROR` + +A special terminal used for error recovery productions. You must register it explicitly before use: + +```python +g.add_terminal_error() +``` + +See [Error Handling](error-handling.md) for full details. + +--- + +## Inspecting the Grammar + +```python +# All non-terminals +print(g.non_terminals) + +# All terminals +print(g.terminals) + +# All productions +print(g.productions) + +# Look up any symbol by name +sym = g['expr'] + +# Look up a production by repr-string +prod = g['expr -> expr + term'] +``` + +### `Grammar.__str__` + +```python +print(g) +# Non-Terminals: +# expr, term, fact +# Terminals: +# +, -, *, /, (, ), int, whitespace +# Productions: +# [expr -> expr + term, ...] +``` + +--- + +## JSON Import / Export + +PyJapt supports a basic JSON representation of the grammar (without semantic actions): + +```python +json_str = g.to_json() + +g2 = Grammar.from_json(json_str) +``` + +!!! note + JSON serialization does not preserve terminal regexes, terminal rules, or semantic actions. It is useful for inspecting grammar structure, not for production use. Use [Python file serialization](serialization.md) for production scenarios. diff --git a/docs/error-handling.md b/docs/error-handling.md new file mode 100644 index 0000000..c07d93f --- /dev/null +++ b/docs/error-handling.md @@ -0,0 +1,177 @@ +# Error Handling + +Good error handling is one of PyJapt's core design goals. This page describes how to report and recover from both lexical and syntactic errors. + +--- + +## Lexical Error Handling + +### Default behavior + +When the lexer encounters a character that matches no terminal pattern, it calls the *lexical error handler*. By default, this adds an error message to the internal errors list and advances past the bad character. + +### Custom handler — `@g.lexical_error` + +Decorate a function with `@g.lexical_error` to replace the default handler: + +```python +@g.lexical_error +def on_lex_error(lexer): + line, col = lexer.lineno, lexer.column + bad_char = lexer.token.lex + + lexer.add_error(line, col, + f'({line}, {col}) - LexicographicError: unexpected character "{bad_char}"') + + # Always advance to avoid an infinite loop + lexer.position += 1 + lexer.column += 1 +``` + +!!! warning "Always advance `lexer.position`" + If your handler does not advance `lexer.position`, the lexer will match the same bad character indefinitely. + +### Reporting errors from a terminal rule + +You can also detect and report errors from inside a terminal rule: + +```python +@g.terminal('comment_error', r'/\*(.|\n)*$') +def eof_in_comment(lexer): + """Match a /* comment that reaches EOF without a closing */""" + lexer.contain_errors = True + lex = lexer.token.lex + for ch in lex: + if ch == '\n': + lexer.lineno += 1 + lexer.column = 1 + else: + lexer.column += 1 + lexer.position += len(lex) + lexer.add_error( + lexer.lineno, lexer.column, + f'({lexer.lineno}, {lexer.column}) - LexicographicError: EOF in comment' + ) +``` + +### Checking lexical errors + +```python +tokens = lexer(source_code) + +if lexer.contain_errors: + for message in lexer.errors: + print(message) +``` + +`lexer.errors` returns a list of error message strings, sorted by position. + +--- + +## Syntactic Error Handling + +### Default behavior + +When the parser cannot find an action for the current `(state, token)` pair it enters *panic-mode recovery*: it calls the error handler and then skips input tokens until it finds one that fits the current state. + +### Custom handler — `@g.parsing_error` + +```python +@g.parsing_error +def on_parse_error(parser): + tok = parser.current_token + parser.add_error( + tok.line, tok.column, + f'({tok.line}, {tok.column}) - SyntacticError: unexpected "{tok.lex}"' + ) +``` + +The handler receives the `ShiftReduceParser` instance. After it returns, the parser automatically skips tokens until it can continue. + +### Error productions + +An *error production* lets you match known error patterns and keep parsing with a valid (possibly incomplete) AST node. This is the most precise error-recovery mechanism. + +**Setup — register the error terminal:** + +```python +g.add_terminal_error() +``` + +**Usage — write productions that include `error`:** + +```python +@g.production('stmt -> let id = expr error') +def missing_semicolon(s): + # s[5] is the Token that triggered the error + s.add_error(5, f'({s[5].line}, {s[5].column}) - SyntacticError: ' + f"expected ';' instead of '{s[5].lex}'") + return LetStatement(s[2], s[4]) +``` + +`s.add_error(index, message)`: + +- If `index` is an `int`, it refers to the position in the rule list — `s[5]` is the token at position 5. +- If `index` is a `(line, column)` tuple, it is used directly as the location. + +When the parser encounters a token that cannot be shifted, and the current state has a transition on the `error` terminal, it replaces the bad token with an `error` token and continues. The `error` token's semantic value is the original `Token` object, so you still have access to `lex`, `line`, and `column`. + +### Forcing a parsing error from a semantic action + +Sometimes you want to mark an input as invalid from inside a semantic action — for example, to reject an empty expression: + +```python +@g.production('expr -> ') +def empty_expr(s): + s.force_parsing_error() + # return nothing or an error sentinel +``` + +`force_parsing_error()` sets `parser.contains_errors = True` without adding an error message. Add an explicit message via `s.add_error(...)` if needed. + +### Checking syntactic errors + +```python +result = parser(tokens) + +if parser.contains_errors: + for message in parser.errors: + print(message) +``` + +--- + +## Combining Both Error Handlers + +A typical setup collects all errors from both the lexer and the parser and prints them sorted by line: + +```python +lexer = g.get_lexer() +parser = g.get_parser('lalr1') + +tokens = lexer(source_code) +result = parser(tokens) + +all_errors = lexer.errors + parser.errors + +if all_errors: + for msg in all_errors: + print(msg) +``` + +--- + +## Error Message Conventions + +PyJapt does not impose a specific error format. A common convention used in compilers is: + +``` +(line, column) - ErrorType: description +``` + +For example: + +``` +(3, 12) - LexicographicError: unexpected character "@" +(5, 1) - SyntacticError: expected ';' instead of '}' +``` diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..8cd87d1 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,182 @@ +# Getting Started + +This guide walks you through installing PyJapt and building your first working lexer and parser. + +--- + +## Prerequisites + +- Python **3.10** or later +- `pip` (any recent version) + +--- + +## Installation + +```sh +pip install pyjapt +``` + +Verify the installation: + +```python +import pyjapt +print(pyjapt.__version__) # e.g. 0.4.1 +``` + +--- + +## Your First Grammar — Arithmetic Expressions + +We will build a complete interpreter for arithmetic expressions that supports `+`, `-`, `*`, `/`, integer literals, and parentheses. + +### Step 1 — Create the Grammar object + +```python +from pyjapt import Grammar + +g = Grammar() +``` + +`Grammar` is the central object. Everything — terminals, non-terminals, productions, and the resulting lexer and parser — comes from this one instance. + +--- + +### Step 2 — Declare non-terminals + +```python +expr = g.add_non_terminal('expr', start_symbol=True) +term, fact = g.add_non_terminals('term fact') +``` + +`add_non_terminal` creates a single non-terminal and returns a `NonTerminal` object. +Pass `start_symbol=True` to mark it as the grammar's start symbol (only one is allowed). + +`add_non_terminals` accepts a space-separated string and returns a tuple. + +--- + +### Step 3 — Declare terminals + +```python +g.add_terminals('+ - / * ( )') # literal terminals +g.add_terminal('int', regex=r'\d+') # terminal with a custom regex +``` + +Terminals declared with `add_terminals` use their name as the regex literally. +`add_terminal` lets you provide a custom regular expression. + +--- + +### Step 4 — Handle whitespace + +Whitespace is not a meaningful token in this grammar, so we skip it by not returning anything from the rule function. + +```python +@g.terminal('whitespace', r' +') +def whitespace(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + # no return → token is discarded +``` + +--- + +### Step 5 — Write productions with semantic actions + +Productions are attached to non-terminal objects using the `%=` operator. +The second element of the tuple is a *semantic action* — a function (or lambda) that receives the `RuleList` for that production and returns the production's semantic value. + +```python +# expr → expr + term | expr - term | term +expr %= 'expr + term', lambda s: s[1] + s[3] +expr %= 'expr - term', lambda s: s[1] - s[3] +expr %= 'term', lambda s: s[1] + +# term → term * fact | term / fact | fact +term %= 'term * fact', lambda s: s[1] * s[3] +term %= 'term / fact', lambda s: s[1] // s[3] +term %= 'fact', lambda s: s[1] + +# fact → ( expr ) | int +fact %= '( expr )', lambda s: s[2] +fact %= 'int', lambda s: int(s[1]) +``` + +Inside a semantic action `s` is a `RuleList`. +`s[0]` is the synthesised value of the production's *head* (i.e. what you return). +`s[1]`, `s[2]`, … are the values of each symbol in the production's *body* (1-indexed). + +--- + +### Step 6 — Generate the lexer and parser + +```python +lexer = g.get_lexer() +parser = g.get_parser('slr') # 'slr', 'lr1', or 'lalr1' +``` + +The lexer is a callable that turns a string into a list of `Token` objects. +The parser is a callable that takes that list and applies the grammar rules, returning the final semantic value. + +--- + +### Step 7 — Parse an expression + +```python +tokens = lexer('(2 + 2) * 2 + 2') +result = parser(tokens) +print(result) # 10 +``` + +Or more concisely: + +```python +print(parser(lexer('(2 + 2) * 2 + 2'))) # 10 +``` + +--- + +## Full Source + +```python +from pyjapt import Grammar + +g = Grammar() +expr = g.add_non_terminal('expr', start_symbol=True) +term, fact = g.add_non_terminals('term fact') +g.add_terminals('+ - / * ( )') +g.add_terminal('int', regex=r'\d+') + +@g.terminal('whitespace', r' +') +def whitespace(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + +expr %= 'expr + term', lambda s: s[1] + s[3] +expr %= 'expr - term', lambda s: s[1] - s[3] +expr %= 'term', lambda s: s[1] + +term %= 'term * fact', lambda s: s[1] * s[3] +term %= 'term / fact', lambda s: s[1] // s[3] +term %= 'fact', lambda s: s[1] + +fact %= '( expr )', lambda s: s[2] +fact %= 'int', lambda s: int(s[1]) + +lexer = g.get_lexer() +parser = g.get_parser('slr') + +print(parser(lexer('(2 + 2) * 2 + 2'))) # 10 +print(parser(lexer('1 + 2 * 5 - 4'))) # 7 +print(parser(lexer('((3 + 4) * 5) - 6 / 2'))) # 32 +``` + +--- + +## Next Steps + +- [Defining a Grammar](defining-grammar.md) — all grammar construction options in detail. +- [Configuring the Lexer](lexer.md) — terminal priority, token rules, and ignored tokens. +- [Building a Parser](parser.md) — SLR vs LR(1) vs LALR(1) and how to pick one. +- [Error Handling](error-handling.md) — how to report lexical and syntactic errors. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..6a16168 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,95 @@ +# PyJapt + +**PyJapt** — *Just Another Parsing Tool Written in Python* — is a lexer and LR parser generator that lets you define a language grammar in pure Python and immediately produce a working tokeniser and parser from it. + +
+
+