From fc5f4f858a9e55f7791b73ce94c08ad72155cae4 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 26 Feb 2026 09:40:09 +0000 Subject: [PATCH 1/6] docs: add v1 action plan and MkDocs documentation site - ACTION_PLAN.md: comprehensive roadmap covering 6 critical bug fixes, code-quality improvements, testing gaps, packaging fixes, and a prioritised delivery order for the v1.0.0 release - mkdocs.yml: Material-theme configuration with tabbed navigation, syntax highlighting, and GitHub Pages publishing settings - docs/index.md: project overview, feature table, and quick-start example - docs/getting-started.md: step-by-step guide for first-time users - docs/defining-grammar.md: complete grammar construction reference - docs/lexer.md: terminal ordering, keyword handling, and Lexer API - docs/parser.md: SLR/LR1/LALR1 comparison, conflicts, semantic actions - docs/error-handling.md: lexical and syntactic error handling patterns - docs/serialization.md: pre-building and caching parsing tables - docs/api-reference.md: full public API documentation - docs/changelog.md: version history and planned v1 changes https://claude.ai/code/session_01Vouz5MejqT7sFvTEy8MXz1 --- ACTION_PLAN.md | 322 +++++++++++++++++++++++++++++ docs/api-reference.md | 424 +++++++++++++++++++++++++++++++++++++++ docs/changelog.md | 92 +++++++++ docs/defining-grammar.md | 227 +++++++++++++++++++++ docs/error-handling.md | 177 ++++++++++++++++ docs/getting-started.md | 182 +++++++++++++++++ docs/index.md | 95 +++++++++ docs/lexer.md | 177 ++++++++++++++++ docs/parser.md | 144 +++++++++++++ docs/serialization.md | 170 ++++++++++++++++ mkdocs.yml | 78 +++++++ 11 files changed, 2088 insertions(+) create mode 100644 ACTION_PLAN.md create mode 100644 docs/api-reference.md create mode 100644 docs/changelog.md create mode 100644 docs/defining-grammar.md create mode 100644 docs/error-handling.md create mode 100644 docs/getting-started.md create mode 100644 docs/index.md create mode 100644 docs/lexer.md create mode 100644 docs/parser.md create mode 100644 docs/serialization.md create mode 100644 mkdocs.yml diff --git a/ACTION_PLAN.md b/ACTION_PLAN.md new file mode 100644 index 0000000..cd9e0f4 --- /dev/null +++ b/ACTION_PLAN.md @@ -0,0 +1,322 @@ +# PyJapt v1.0 — Action Plan + +> **Current version:** 0.4.1 +> **Target version:** 1.0.0 +> **Status:** Planning phase + +This document is the authoritative roadmap for bringing PyJapt to a stable v1.0 release. Items are grouped by category and ordered by priority within each category. + +--- + +## 1. Critical Bug Fixes + +These bugs affect correctness and must be resolved before v1. + +### 1.1 Lexer state not reset on repeated calls + +**File:** `pyjapt/lexing.py` — `Lexer.__call__` + +`Lexer.__call__` resets `lineno`, `column`, `position`, `text`, and `token`, but it does **not** reset: +- `self._errors` — errors from a previous run accumulate +- `self.contain_errors` — flag is stale after the first run with errors + +**Fix:** Reset `_errors = []` and `contain_errors = False` at the start of `__call__`. + +--- + +### 1.2 `errors` property signature is broken + +**Files:** `pyjapt/lexing.py:102`, `pyjapt/parsing.py:1072` + +Both `Lexer.errors` and `ShiftReduceParser.errors` are decorated with `@property` but carry a `clean: bool = True` parameter: + +```python +@property +def errors(self, clean: bool = True): # ← wrong: properties don't accept arguments +``` + +Python silently ignores the `clean` parameter and always calls it as a property. The `clean` branch (returning tuples with row/column) is therefore unreachable. + +**Fix:** Replace with two separate accessors: +```python +@property +def errors(self) -> List[str]: + return [m for _, _, m in sorted(self._errors)] + +@property +def errors_with_location(self) -> List[Tuple[int, int, str]]: + return sorted(self._errors) +``` + +--- + +### 1.3 `ShiftReduceParser` class-level mutable state + +**File:** `pyjapt/parsing.py:1032-1033` + +```python +class ShiftReduceParser: + contains_errors: bool = False # ← shared across all instances + current_token: Optional[Token] = None # ← shared across all instances +``` + +Class-level attributes are shared across all instances of a class. Two parsers used in the same program would corrupt each other's state. + +**Fix:** Move both to `__init__`. + +--- + +### 1.4 LALR(1) lookahead algorithm mixes strings and Symbol objects + +**File:** `pyjapt/parsing.py` — `determining_lookaheads` and `build_lalr1_automaton` + +The propagation sentinel `"#"` is a plain string that is mixed into lookahead sets normally occupied by `Symbol` objects. This works coincidentally because `"#"` doesn't collide with any symbol name, but it is fragile and will silently break if a grammar ever names a symbol `#`. + +**Fix:** Use a dedicated `PropagationTerminal` singleton (already defined in the file as `PropagationTerminal`) instead of the magic string, or use a `None` sentinel that is excluded from lookahead propagation explicitly. + +--- + +### 1.5 `Grammar.augmented_grammar` semantic rule is wrong + +**File:** `pyjapt/parsing.py:462` + +```python +new_start_symbol %= start_symbol + grammar.EPSILON, lambda x: x +``` + +The lambda receives a `RuleList`, not the symbol value directly. The production `S' -> start_symbol` (with epsilon swallowed) should return `s[1]`, not the `RuleList` itself: + +```python +new_start_symbol %= start_symbol + grammar.EPSILON, lambda s: s[1] +``` + +--- + +### 1.6 `Grammar.__getitem__` returns `None` for missing keys + +**File:** `pyjapt/parsing.py:592-599` + +When a production string references a symbol that doesn't exist, `Grammar.__getitem__` silently returns `None`. This produces a confusing `AttributeError` deep in the call chain rather than a clear `GrammarError`. + +**Fix:** Raise `GrammarError` with the unknown symbol name. + +--- + +## 2. Code Quality & Modernisation + +### 2.1 Move `flake8` to dev dependencies + +**File:** `pyproject.toml` + +`flake8` is listed under `[tool.poetry.dependencies]` (runtime). It is a linter and must move to `[tool.poetry.dev-dependencies]`. + +--- + +### 2.2 Update deprecated Poetry build backend + +**File:** `pyproject.toml` + +```toml +# current (deprecated) +build-backend = "poetry.masonry.api" + +# correct +build-backend = "poetry.core.masonry.api" +``` + +--- + +### 2.3 Add `mkdocs` and `mkdocs-material` as dev dependencies + +**File:** `pyproject.toml` + +Documentation builds are part of the development workflow. + +--- + +### 2.4 Rename `pyjapt/typing.py` + +The module name `typing` shadows Python's standard-library `typing` module inside the package. Rename it to `pyjapt/types.py` or `pyjapt/_types.py` and update the import in `tests/test_arithmetic_grammar.py`. + +--- + +### 2.5 Export `RuleList` and parsers from the top-level `__init__.py` + +**File:** `pyjapt/__init__.py` + +`RuleList` and individual parser classes (`SLRParser`, `LR1Parser`, `LALR1Parser`) are not exported from the package root. Users must import from internal submodules. Add them to `__init__.py`: + +```python +from pyjapt.parsing import ( + ShiftReduceParser, SLRParser, LR1Parser, LALR1Parser, Grammar, RuleList +) +``` + +--- + +### 2.6 Add type annotations to public API + +Currently many method signatures lack return type annotations. Add full annotations to: +- `Grammar.get_lexer`, `Grammar.get_parser`, `Grammar.serialize_*` +- `Lexer.__call__`, `Lexer.tokenize` +- `ShiftReduceParser.__call__` +- All `add_*` methods on `Grammar` + +--- + +### 2.7 Replace bare `assert` with proper exceptions + +Bare `assert` statements are silently disabled when Python runs with the `-O` (optimise) flag: + +- `pyjapt/parsing.py:835` — `assert len(grammar.start_symbol.productions) == 1` +- `pyjapt/parsing.py:906` — `assert not lookaheads.contains_epsilon` +- `pyjapt/lexing.py` — several in `Grammar.add_terminal` + +Replace with `if not ...: raise GrammarError(...)`. + +--- + +### 2.8 Serialised parser resets `augmented_grammar` fields + +**File:** `pyjapt/serialization.py` + +The serialised parser template does not call `_build_automaton` or compute `firsts`/`follows`, which is correct. But it also doesn't set `augmented_grammar`, `firsts`, or `follows`, which means the serialised parser cannot be safely extended. Document this limitation and add a guard. + +--- + +### 2.9 CI: test against Python 3.10, 3.11, and 3.12 + +**File:** `.github/workflows/python-test-app.yml` + +Add a matrix strategy to test against all supported Python versions. + +--- + +## 3. Testing Improvements + +### 3.1 Add tests for LR(1) and LALR(1) parsers + +`tests/test_arithmetic_grammar.py` only tests the SLR parser. Add `test_lr1` and `test_lalr1` parameterised over the same set of inputs. + +### 3.2 Add tests for lexer error handling + +- Unknown character → default error handler → `errors` list populated +- Custom `lexical_error` decorator +- `contain_errors` flag is `True` after a failed tokenisation +- Errors reset correctly across multiple calls + +### 3.3 Add tests for parser error handling + +- Syntactic error → `errors` list populated +- `contains_errors` flag is `True` +- Error recovery (`error` terminal / panic mode) +- Custom `parsing_error` decorator + +### 3.4 Add tests for serialisation + +- Round-trip: build grammar → serialise lexer and parser → import → parse identical inputs → same result + +### 3.5 Add edge-case tests + +- Empty grammar (no productions) raises `GrammarError` +- Duplicate terminal/non-terminal name raises immediately +- Production referencing undefined symbol raises `GrammarError` +- Epsilon productions +- Grammars with conflicts produce correct conflict counts + +### 3.6 Measure and enforce coverage + +Add `pytest-cov` and set a minimum coverage threshold (target ≥ 85 %) in CI. + +--- + +## 4. Documentation + +The documentation website is built with **MkDocs + Material theme** and lives under `docs/`. See `mkdocs.yml` for the full configuration. + +| Page | Status | +|------|--------| +| `docs/index.md` | Done | +| `docs/getting-started.md` | Done | +| `docs/defining-grammar.md` | Done | +| `docs/lexer.md` | Done | +| `docs/parser.md` | Done | +| `docs/error-handling.md` | Done | +| `docs/serialization.md` | Done | +| `docs/api-reference.md` | Done | +| `docs/changelog.md` | Done | + +### 4.1 Add `CHANGELOG.md` + +Track every version with date and changes, following [Keep a Changelog](https://keepachangelog.com) format. + +### 4.2 Add `CONTRIBUTING.md` + +Describe: +- How to clone and set up the dev environment +- How to run tests and linting +- Branching and PR conventions +- Code of conduct pointer + +--- + +## 5. Packaging & Release + +### 5.1 Update version to `1.0.0` + +**Files:** `pyjapt/__init__.py` and `pyproject.toml` + +### 5.2 Populate package metadata + +**File:** `pyproject.toml` + +Add: +```toml +license = "MIT" +keywords = ["lexer", "parser", "LR", "LALR", "compiler", "grammar"] +classifiers = [ + "Programming Language :: Python :: 3", + "License :: OSI Approved :: MIT License", + "Topic :: Software Development :: Compilers", +] +repository = "https://github.com/alejandroklever/PyJapt" +documentation = "https://alejandroklever.github.io/PyJapt" +``` + +### 5.3 Add a GitHub Actions workflow to build and deploy docs + +Publish the MkDocs site to GitHub Pages on every push to `main`. + +### 5.4 Tag and publish to PyPI + +After all items above are resolved: +1. Bump version to `1.0.0` in `__init__.py` (the `build.py` script syncs `pyproject.toml` automatically). +2. Push a `v1.0.0` git tag. +3. Create a GitHub Release — the existing publish workflow triggers on `release: published`. + +--- + +## 6. Future Work (Post-v1) + +The following are explicitly out of scope for v1 but should be tracked: + +| Feature | Notes | +|---------|-------| +| Operator precedence declarations | Resolves SR conflicts declaratively (like `%left`, `%right` in Yacc) | +| LL(1) parser support | Mentioned in README as future work | +| Grammar visualisation | Export automata as DOT / SVG | +| Incremental parsing | Re-lex only changed regions | +| Better conflict reporting | Show the conflicting items and lookaheads in a human-readable table | +| Unicode identifiers in grammars | Non-ASCII symbol names | +| Async tokenisation | Yield tokens lazily for very large inputs | + +--- + +## Priority Order Summary + +| Priority | Item | +|----------|------| +| P0 — Must fix before v1 | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 | +| P1 — Fix before v1 | 2.1, 2.2, 2.4, 2.5, 3.1, 3.2, 3.3 | +| P2 — Nice-to-have before v1 | 2.3, 2.6, 2.7, 2.8, 2.9, 3.4, 3.5, 3.6, 5.1 – 5.4 | +| P3 — Post-v1 | Section 6 | diff --git a/docs/api-reference.md b/docs/api-reference.md new file mode 100644 index 0000000..4fa9a3c --- /dev/null +++ b/docs/api-reference.md @@ -0,0 +1,424 @@ +# API Reference + +This page lists every public class and method exported by PyJapt. + +--- + +## Top-Level Exports + +```python +from pyjapt import ( + Grammar, + Lexer, + Token, + ShiftReduceParser, + SLRParser, + LR1Parser, + LALR1Parser, +) +``` + +--- + +## `Grammar` + +The central object for defining a language. + +```python +from pyjapt import Grammar +g = Grammar() +``` + +### Terminals + +--- + +#### `Grammar.add_terminal(name, regex=None, rule=None) -> Terminal` + +Create and register a terminal symbol. + +| Parameter | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Unique terminal name. Must be a valid string. | +| `regex` | `str \| None` | Regular expression. If `None`, the regex is `re.escape(name)` (literal match). | +| `rule` | `Callable[[Lexer], Optional[Token]] \| None` | Rule function invoked when this token is matched. | + +Returns the new `Terminal` object. + +Raises `AssertionError` if `name` is already defined. + +--- + +#### `Grammar.add_terminals(names) -> Tuple[Terminal, ...]` + +Convenience wrapper. Splits `names` on whitespace and calls `add_terminal` for each. + +```python +plus, minus, star = g.add_terminals('+ - *') +``` + +--- + +#### `Grammar.terminal(name, regex) -> Callable` + +Decorator factory. Creates the terminal **and** registers the decorated function as its rule. + +```python +@g.terminal('int', r'\d+') +def int_rule(lexer): + lexer.position += len(lexer.token.lex) + lexer.column += len(lexer.token.lex) + return lexer.token +``` + +--- + +#### `Grammar.add_terminal_error()` + +Registers the built-in `error` terminal for use in error-recovery productions. Call this before writing any production that contains `error`. + +--- + +### Non-Terminals + +--- + +#### `Grammar.add_non_terminal(name, start_symbol=False) -> NonTerminal` + +Create and register a non-terminal symbol. + +| Parameter | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Unique non-terminal name. | +| `start_symbol` | `bool` | Mark as the start symbol. Only one allowed per grammar. | + +Raises `Exception` if a second `start_symbol=True` is provided. + +--- + +#### `Grammar.add_non_terminals(names) -> Tuple[NonTerminal, ...]` + +Splits `names` on whitespace and calls `add_non_terminal` for each. + +```python +stmt, expr, term = g.add_non_terminals('stmt expr term') +``` + +--- + +### Productions + +--- + +#### `Grammar.production(*production_strings) -> Callable` + +Decorator factory that registers the decorated function as the semantic action for one or more productions. + +The production string format is `'head -> body'` where `body` is a space-separated list of symbol names. + +```python +@g.production('expr -> expr + term', 'expr -> expr - term') +def additive(s): + return s[1] + s[3] if s[2] == '+' else s[1] - s[3] +``` + +--- + +#### `NonTerminal.__imod__(other) -> NonTerminal` + +Operator `%=` overload for adding productions to a non-terminal. + +```python +# Unattributed +expr %= 'expr + term' + +# With semantic action +expr %= 'expr + term', lambda s: s[1] + s[3] + +# Epsilon +expr %= '' +``` + +`other` can be: +- A `str` (space-separated symbol names) +- A `Symbol` or `Sentence` (built from Symbol objects with `+`) +- A `tuple` of `(str | Sentence, callable)` for attributed productions +- A `SentenceList` (built with `|`) for multiple alternatives + +--- + +### Error Handlers + +--- + +#### `Grammar.lexical_error(handler) -> handler` + +Decorator. Registers a custom lexical error handler. + +```python +@g.lexical_error +def lex_error(lexer): + lexer.add_error(lexer.lineno, lexer.column, + f'unexpected "{lexer.token.lex}"') + lexer.position += 1 + lexer.column += 1 +``` + +--- + +#### `Grammar.parsing_error(handler) -> handler` + +Decorator. Registers a custom syntactic error handler. + +```python +@g.parsing_error +def parse_error(parser): + tok = parser.current_token + parser.add_error(tok.line, tok.column, f'unexpected "{tok.lex}"') +``` + +--- + +### Generating the Lexer and Parser + +--- + +#### `Grammar.get_lexer() -> Lexer` + +Build and return a `Lexer` for this grammar. + +--- + +#### `Grammar.get_parser(name, verbose=False) -> ShiftReduceParser` + +Build and return a parser. + +| `name` | Parser type | +|--------|-------------| +| `'slr'` | Simple LR | +| `'lalr1'` | LALR(1) | +| `'lr1'` | Canonical LR(1) | + +Raises `ValueError` for unknown names. + +--- + +### Serialisation + +--- + +#### `Grammar.serialize_lexer(class_name, grammar_module_name, grammar_variable_name='G')` + +Generate `lexertab.py` in the current working directory. + +--- + +#### `Grammar.serialize_parser(parser_type, class_name, grammar_module_name, grammar_variable_name='G')` + +Generate `parsertab.py` in the current working directory. + +--- + +### Utility + +--- + +#### `Grammar.to_json() -> str` + +Serialise the grammar structure (terminals, non-terminals, productions) to a JSON string. Semantic actions and regexes are **not** included. + +--- + +#### `Grammar.from_json(data) -> Grammar` + +Class method. Reconstruct a grammar from the JSON string produced by `to_json()`. + +--- + +#### `Grammar.__getitem__(item) -> Symbol | Production | None` + +Look up a symbol or production by name/repr-string. + +```python +plus_symbol = g['+'] +production = g['expr -> expr + term'] +``` + +--- + +## `Token` + +```python +class Token: + lex: str # lexeme string + token_type: Any # terminal name (str) or Terminal object + line: int # 1-based line number + column: int # 1-based column number +``` + +### Class methods + +#### `Token.empty() -> Token` + +Return an empty sentinel token `Token('', '', 0, 0)`. + +### Properties + +#### `Token.is_valid -> bool` + +Always `True` for a regular token. (Subclasses may override for error tokens.) + +--- + +## `Lexer` + +```python +class Lexer: + lineno: int # current line (1-based) + column: int # current column (1-based) + position: int # byte offset in input + text: str # full input string + token: Token # token being processed + contain_errors: bool # True after first error +``` + +### `Lexer.__call__(text) -> List[Token]` + +Tokenise `text`. Resets all internal state before each call. Appends an EOF token at the end. + +### `Lexer.tokenize(text) -> Generator[Token, None, None]` + +Low-level generator. Does **not** reset state. Prefer `__call__` for normal use. + +### `Lexer.errors -> List[str]` + +Sorted list of error message strings accumulated during the last call. + +### `Lexer.add_error(line, col, message)` + +Append an error entry. Intended for use inside custom terminal rules and error handlers. + +--- + +## `ShiftReduceParser` + +Base class for all three parser variants. Do not instantiate directly; use `Grammar.get_parser`. + +```python +class ShiftReduceParser: + SHIFT = 'SHIFT' + REDUCE = 'REDUCE' + OK = 'OK' +``` + +### `ShiftReduceParser.__call__(tokens) -> Any` + +Parse a list of `Token` objects and return the semantic value of the start symbol, or `None` if parsing failed. + +### `ShiftReduceParser.errors -> List[str]` + +Sorted list of syntactic error messages. + +### `ShiftReduceParser.add_error(line, column, message)` + +Append an error entry from inside a semantic action or error handler. + +### `ShiftReduceParser.contains_errors -> bool` + +`True` if any parsing error has been detected. + +### `ShiftReduceParser.current_token -> Token` + +The token being processed at the time the most recent error occurred. + +### `ShiftReduceParser.conflicts -> List[Tuple]` + +List of detected conflicts, each a `('SR' | 'RR', prod_a, prod_b)` tuple. + +### `ShiftReduceParser.shift_reduce_count -> int` + +Number of shift-reduce conflicts. + +### `ShiftReduceParser.reduce_reduce_count -> int` + +Number of reduce-reduce conflicts. + +--- + +## `SLRParser` + +```python +class SLRParser(ShiftReduceParser): ... +``` + +Uses the LR(0) automaton and Follow sets for lookaheads. + +--- + +## `LR1Parser` + +```python +class LR1Parser(ShiftReduceParser): ... +``` + +Uses the canonical LR(1) automaton with per-item lookaheads. + +--- + +## `LALR1Parser` + +```python +class LALR1Parser(LR1Parser): ... +``` + +Uses the merged LALR(1) automaton. Same states as SLR, same power as LR(1) for most grammars. + +--- + +## `RuleList` + +Passed to every semantic action as `s`. 1-indexed over the production body. + +### `RuleList.__getitem__(index) -> Any` + +`s[0]` — head value (output). +`s[1]` … `s[n]` — body symbol values. + +### `RuleList.add_error(index, message)` + +Report an error at the position of `s[index]` (int) or at an explicit `(line, column)` tuple. + +### `RuleList.force_parsing_error()` + +Mark the parse as failed without adding an error message. + +--- + +## `NonTerminal` + +Represents a grammar non-terminal. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Symbol name | +| `productions` | `List[Production]` | Productions where this symbol is the head | + +--- + +## `Terminal` + +Represents a grammar terminal. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `name` | `str` | Symbol name | + +--- + +## `Production` + +| Attribute | Type | Description | +|-----------|------|-------------| +| `left` | `NonTerminal` | Production head | +| `right` | `Sentence` | Production body | +| `rule` | `Callable \| None` | Semantic action | diff --git a/docs/changelog.md b/docs/changelog.md new file mode 100644 index 0000000..72c0293 --- /dev/null +++ b/docs/changelog.md @@ -0,0 +1,92 @@ +# Changelog + +All notable changes to PyJapt are documented here. +This project follows [Semantic Versioning](https://semver.org) and the +[Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. + +--- + +## [Unreleased] — v1.0.0 + +### Planned — Bug Fixes +- Reset `_errors` and `contain_errors` in `Lexer.__call__` so repeated calls don't accumulate stale errors. +- Fix `errors` property signature on `Lexer` and `ShiftReduceParser` (properties cannot accept arguments). +- Move `contains_errors` and `current_token` from class-level to instance-level in `ShiftReduceParser`. +- Fix `Grammar.augmented_grammar` semantic action (`lambda s: s[1]` instead of `lambda x: x`). +- Fix `s.Name` → `s.name` in `Grammar.to_json()` (case mismatch causes `AttributeError`). +- Raise `GrammarError` in `Grammar.__getitem__` instead of returning `None` for missing symbols. +- Replace bare `assert` statements with proper `GrammarError` exceptions. + +### Planned — Improvements +- Move `flake8` from runtime to dev dependencies. +- Update build backend to `poetry.core.masonry.api` (replaces deprecated `poetry.masonry.api`). +- Export `RuleList`, `SLRParser`, `LR1Parser`, `LALR1Parser` from `pyjapt.__init__`. +- Rename `pyjapt/typing.py` to `pyjapt/types.py` to avoid shadowing stdlib `typing`. +- Add full type annotations to the public API. +- Expand CI matrix to Python 3.10, 3.11, and 3.12. + +### Planned — Testing +- Add tests for LR(1) and LALR(1) parsers. +- Add tests for lexer and parser error handling. +- Add tests for serialisation round-trips. +- Add edge-case tests (empty grammar, duplicate symbols, epsilon productions). +- Enforce minimum test coverage threshold. + +### Planned — Documentation +- Full MkDocs site with Material theme. +- Getting-started guide and user-guide sections. +- Complete API reference. +- Changelog (this file). + +--- + +## [0.4.1] — 2024-03-25 + +### Fixed +- Updated README with corrected examples and improved prose. + +--- + +## [0.4.0] — 2023-02-17 + +### Added +- GitHub Actions workflow for publishing to PyPI on release. +- `requirements.txt` for legacy `pip install` support. + +--- + +## [0.3.0] — 2021-03-?? + +### Added +- Default error report in the shift-reduce parser (panic-mode recovery). +- Improved `RuleList` error API. + +### Fixed +- Reset lexer parameters when analysing a new string (`Lexer.__call__`). + +--- + +## [0.2.9] — 2021-??-?? + +### Fixed +- Minor fix in parsing default error detection. + +--- + +## [0.2.x] — 2020 + +### Added +- SLR, LR(1), and LALR(1) parsers. +- Serialisation of lexer and parser to Python source files. +- `@g.terminal` decorator for inline rule definition. +- `@g.production` decorator for inline production rules. +- `@g.lexical_error` and `@g.parsing_error` decorators. +- `add_terminal_error()` and error terminal support in productions. +- `Grammar.to_json()` / `Grammar.from_json()`. +- JSON grammar import/export. + +--- + +## [0.1.x] — 2020 + +- Initial release with basic lexer and SLR parser. diff --git a/docs/defining-grammar.md b/docs/defining-grammar.md new file mode 100644 index 0000000..68d6459 --- /dev/null +++ b/docs/defining-grammar.md @@ -0,0 +1,227 @@ +# Defining a Grammar + +A `Grammar` object is the single source of truth for your language. This page covers all the ways to build one. + +--- + +## Creating the Grammar + +```python +from pyjapt import Grammar + +g = Grammar() +``` + +--- + +## Non-Terminals + +Non-terminals are the syntactic categories of your language (e.g. `expr`, `statement`, `program`). + +### `add_non_terminal(name, start_symbol=False)` + +```python +program = g.add_non_terminal('program', start_symbol=True) +stmt = g.add_non_terminal('stmt') +expr = g.add_non_terminal('expr') +``` + +- `name` — must be a unique, non-empty string. +- `start_symbol=True` — marks this as the grammar's start symbol. Only one non-terminal can carry this flag. + +Returns a `NonTerminal` object that you use to write productions. + +### `add_non_terminals(names)` + +Convenience method: accepts a space-separated string and returns a tuple of `NonTerminal` objects in the same order. + +```python +stmt, expr, term, fact = g.add_non_terminals('stmt expr term fact') +``` + +--- + +## Terminals + +Terminals are the atomic tokens produced by the lexer. + +### `add_terminal(name, regex=None, rule=None)` + +```python +# Literal terminal — the regex is the escaped name +plus = g.add_terminal('+') +minus = g.add_terminal('-') + +# Terminal with a custom regex +num = g.add_terminal('int', regex=r'\d+') + +# Terminal with a custom regex AND a lexer rule +num = g.add_terminal('int', regex=r'\d+', rule=lambda lexer: ...) +``` + +- When `regex` is `None`, the regular expression used is `re.escape(name)`, so `+` matches the literal character `+`. +- `rule` is a function `(Lexer) -> Optional[Token]`. If it returns `None`, the token is discarded. + +### `add_terminals(names)` + +Accepts a space-separated string and returns a tuple. All created terminals use their name as the literal regex. + +```python +plus, minus, star, div, lpar, rpar = g.add_terminals('+ - * / ( )') +``` + +### `@g.terminal(name, regex)` + +A decorator that creates the terminal **and** registers the rule in one step. + +```python +@g.terminal('int', r'\d+') +def int_terminal(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + lexer.token.lex = int(lexer.token.lex) + return lexer.token +``` + +The decorated function receives the `Lexer` instance and must either return the `Token` (possibly modified) or return `None`/nothing to discard it. + +--- + +## Productions + +Productions define how non-terminals are composed from sequences of terminals and non-terminals. + +### Using `%=` with a string (recommended) + +```python +expr %= 'expr + term' # unattributed +expr %= 'expr + term', lambda s: s[1] + s[3] # with semantic action +``` + +The string on the right-hand side is a space-separated list of symbol names. Each name must already be declared in the grammar. + +Inside the semantic action, `s` is a `RuleList`: + +| Index | Meaning | +|-------|---------| +| `s[0]` | The head non-terminal's value (set by returning from the action) | +| `s[1]` | Value of the 1st body symbol | +| `s[2]` | Value of the 2nd body symbol | +| `s[n]` | Value of the nth body symbol | + +For a terminal, the value is the token's lexeme (`str`). +For a non-terminal, the value is whatever its production's semantic action returned. + +### Using `%=` with `Symbol` objects + +```python +expr %= expr + plus + term +expr %= expr + plus + term, lambda s: s[1] + s[3] +``` + +`Symbol` objects support `+` to build `Sentence` objects, so you can construct productions with the original variable references. + +### Epsilon productions + +```python +expr %= '' # empty string → epsilon production +expr %= g.EPSILON # same thing using the EPSILON symbol directly +``` + +### `@g.production(*production_strings)` + +A decorator alternative to `%=`. It binds the decorated function to one or more production strings. + +```python +@g.production('expr -> expr + term') +def expr_add(s): + return s[1] + s[3] +``` + +The string format is `'head -> body'` where `->` separates the head non-terminal from the body symbols. + +You can attach the same function to multiple productions: + +```python +@g.production( + 'expr -> expr + expr', + 'expr -> expr - expr', + 'expr -> expr * expr', + 'expr -> expr / expr', +) +def binary_op(s): + if s[2] == '+': return s[1] + s[3] + if s[2] == '-': return s[1] - s[3] + if s[2] == '*': return s[1] * s[3] + if s[2] == '/': return s[1] // s[3] +``` + +--- + +## Special Terminals + +### `g.EOF` + +The end-of-file terminal (`$`). It is added automatically; you should not declare it yourself. + +### `g.EPSILON` + +Represents the empty word. Use it to write nullable productions. + +### `g.ERROR` + +A special terminal used for error recovery productions. You must register it explicitly before use: + +```python +g.add_terminal_error() +``` + +See [Error Handling](error-handling.md) for full details. + +--- + +## Inspecting the Grammar + +```python +# All non-terminals +print(g.non_terminals) + +# All terminals +print(g.terminals) + +# All productions +print(g.productions) + +# Look up any symbol by name +sym = g['expr'] + +# Look up a production by repr-string +prod = g['expr -> expr + term'] +``` + +### `Grammar.__str__` + +```python +print(g) +# Non-Terminals: +# expr, term, fact +# Terminals: +# +, -, *, /, (, ), int, whitespace +# Productions: +# [expr -> expr + term, ...] +``` + +--- + +## JSON Import / Export + +PyJapt supports a basic JSON representation of the grammar (without semantic actions): + +```python +json_str = g.to_json() + +g2 = Grammar.from_json(json_str) +``` + +!!! note + JSON serialisation does not preserve terminal regexes, terminal rules, or semantic actions. It is useful for inspecting grammar structure, not for production use. Use [Python file serialisation](serialization.md) for production scenarios. diff --git a/docs/error-handling.md b/docs/error-handling.md new file mode 100644 index 0000000..43262e9 --- /dev/null +++ b/docs/error-handling.md @@ -0,0 +1,177 @@ +# Error Handling + +Good error handling is one of PyJapt's core design goals. This page describes how to report and recover from both lexical and syntactic errors. + +--- + +## Lexical Error Handling + +### Default behaviour + +When the lexer encounters a character that matches no terminal pattern, it calls the *lexical error handler*. By default, this adds an error message to the internal errors list and advances past the bad character. + +### Custom handler — `@g.lexical_error` + +Decorate a function with `@g.lexical_error` to replace the default handler: + +```python +@g.lexical_error +def on_lex_error(lexer): + line, col = lexer.lineno, lexer.column + bad_char = lexer.token.lex + + lexer.add_error(line, col, + f'({line}, {col}) - LexicographicError: unexpected character "{bad_char}"') + + # Always advance to avoid an infinite loop + lexer.position += 1 + lexer.column += 1 +``` + +!!! warning "Always advance `lexer.position`" + If your handler does not advance `lexer.position`, the lexer will match the same bad character indefinitely. + +### Reporting errors from a terminal rule + +You can also detect and report errors from inside a terminal rule: + +```python +@g.terminal('comment_error', r'/\*(.|\n)*$') +def eof_in_comment(lexer): + """Match a /* comment that reaches EOF without a closing */""" + lexer.contain_errors = True + lex = lexer.token.lex + for ch in lex: + if ch == '\n': + lexer.lineno += 1 + lexer.column = 1 + else: + lexer.column += 1 + lexer.position += len(lex) + lexer.add_error( + lexer.lineno, lexer.column, + f'({lexer.lineno}, {lexer.column}) - LexicographicError: EOF in comment' + ) +``` + +### Checking lexical errors + +```python +tokens = lexer(source_code) + +if lexer.contain_errors: + for message in lexer.errors: + print(message) +``` + +`lexer.errors` returns a list of error message strings, sorted by position. + +--- + +## Syntactic Error Handling + +### Default behaviour + +When the parser cannot find an action for the current `(state, token)` pair it enters *panic-mode recovery*: it calls the error handler and then skips input tokens until it finds one that fits the current state. + +### Custom handler — `@g.parsing_error` + +```python +@g.parsing_error +def on_parse_error(parser): + tok = parser.current_token + parser.add_error( + tok.line, tok.column, + f'({tok.line}, {tok.column}) - SyntacticError: unexpected "{tok.lex}"' + ) +``` + +The handler receives the `ShiftReduceParser` instance. After it returns, the parser automatically skips tokens until it can continue. + +### Error productions + +An *error production* lets you match known error patterns and keep parsing with a valid (possibly incomplete) AST node. This is the most precise error-recovery mechanism. + +**Setup — register the error terminal:** + +```python +g.add_terminal_error() +``` + +**Usage — write productions that include `error`:** + +```python +@g.production('stmt -> let id = expr error') +def missing_semicolon(s): + # s[5] is the Token that triggered the error + s.add_error(5, f'({s[5].line}, {s[5].column}) - SyntacticError: ' + f"expected ';' instead of '{s[5].lex}'") + return LetStatement(s[2], s[4]) +``` + +`s.add_error(index, message)`: + +- If `index` is an `int`, it refers to the position in the rule list — `s[5]` is the token at position 5. +- If `index` is a `(line, column)` tuple, it is used directly as the location. + +When the parser encounters a token that cannot be shifted, and the current state has a transition on the `error` terminal, it replaces the bad token with an `error` token and continues. The `error` token's semantic value is the original `Token` object, so you still have access to `lex`, `line`, and `column`. + +### Forcing a parsing error from a semantic action + +Sometimes you want to mark an input as invalid from inside a semantic action — for example, to reject an empty expression: + +```python +@g.production('expr -> ') +def empty_expr(s): + s.force_parsing_error() + # return nothing or an error sentinel +``` + +`force_parsing_error()` sets `parser.contains_errors = True` without adding an error message. Add an explicit message via `s.add_error(...)` if needed. + +### Checking syntactic errors + +```python +result = parser(tokens) + +if parser.contains_errors: + for message in parser.errors: + print(message) +``` + +--- + +## Combining Both Error Handlers + +A typical setup collects all errors from both the lexer and the parser and prints them sorted by line: + +```python +lexer = g.get_lexer() +parser = g.get_parser('lalr1') + +tokens = lexer(source_code) +result = parser(tokens) + +all_errors = lexer.errors + parser.errors + +if all_errors: + for msg in all_errors: + print(msg) +``` + +--- + +## Error Message Conventions + +PyJapt does not impose a specific error format. A common convention used in compilers is: + +``` +(line, column) - ErrorType: description +``` + +For example: + +``` +(3, 12) - LexicographicError: unexpected character "@" +(5, 1) - SyntacticError: expected ';' instead of '}' +``` diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..8cd87d1 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,182 @@ +# Getting Started + +This guide walks you through installing PyJapt and building your first working lexer and parser. + +--- + +## Prerequisites + +- Python **3.10** or later +- `pip` (any recent version) + +--- + +## Installation + +```sh +pip install pyjapt +``` + +Verify the installation: + +```python +import pyjapt +print(pyjapt.__version__) # e.g. 0.4.1 +``` + +--- + +## Your First Grammar — Arithmetic Expressions + +We will build a complete interpreter for arithmetic expressions that supports `+`, `-`, `*`, `/`, integer literals, and parentheses. + +### Step 1 — Create the Grammar object + +```python +from pyjapt import Grammar + +g = Grammar() +``` + +`Grammar` is the central object. Everything — terminals, non-terminals, productions, and the resulting lexer and parser — comes from this one instance. + +--- + +### Step 2 — Declare non-terminals + +```python +expr = g.add_non_terminal('expr', start_symbol=True) +term, fact = g.add_non_terminals('term fact') +``` + +`add_non_terminal` creates a single non-terminal and returns a `NonTerminal` object. +Pass `start_symbol=True` to mark it as the grammar's start symbol (only one is allowed). + +`add_non_terminals` accepts a space-separated string and returns a tuple. + +--- + +### Step 3 — Declare terminals + +```python +g.add_terminals('+ - / * ( )') # literal terminals +g.add_terminal('int', regex=r'\d+') # terminal with a custom regex +``` + +Terminals declared with `add_terminals` use their name as the regex literally. +`add_terminal` lets you provide a custom regular expression. + +--- + +### Step 4 — Handle whitespace + +Whitespace is not a meaningful token in this grammar, so we skip it by not returning anything from the rule function. + +```python +@g.terminal('whitespace', r' +') +def whitespace(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + # no return → token is discarded +``` + +--- + +### Step 5 — Write productions with semantic actions + +Productions are attached to non-terminal objects using the `%=` operator. +The second element of the tuple is a *semantic action* — a function (or lambda) that receives the `RuleList` for that production and returns the production's semantic value. + +```python +# expr → expr + term | expr - term | term +expr %= 'expr + term', lambda s: s[1] + s[3] +expr %= 'expr - term', lambda s: s[1] - s[3] +expr %= 'term', lambda s: s[1] + +# term → term * fact | term / fact | fact +term %= 'term * fact', lambda s: s[1] * s[3] +term %= 'term / fact', lambda s: s[1] // s[3] +term %= 'fact', lambda s: s[1] + +# fact → ( expr ) | int +fact %= '( expr )', lambda s: s[2] +fact %= 'int', lambda s: int(s[1]) +``` + +Inside a semantic action `s` is a `RuleList`. +`s[0]` is the synthesised value of the production's *head* (i.e. what you return). +`s[1]`, `s[2]`, … are the values of each symbol in the production's *body* (1-indexed). + +--- + +### Step 6 — Generate the lexer and parser + +```python +lexer = g.get_lexer() +parser = g.get_parser('slr') # 'slr', 'lr1', or 'lalr1' +``` + +The lexer is a callable that turns a string into a list of `Token` objects. +The parser is a callable that takes that list and applies the grammar rules, returning the final semantic value. + +--- + +### Step 7 — Parse an expression + +```python +tokens = lexer('(2 + 2) * 2 + 2') +result = parser(tokens) +print(result) # 10 +``` + +Or more concisely: + +```python +print(parser(lexer('(2 + 2) * 2 + 2'))) # 10 +``` + +--- + +## Full Source + +```python +from pyjapt import Grammar + +g = Grammar() +expr = g.add_non_terminal('expr', start_symbol=True) +term, fact = g.add_non_terminals('term fact') +g.add_terminals('+ - / * ( )') +g.add_terminal('int', regex=r'\d+') + +@g.terminal('whitespace', r' +') +def whitespace(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + +expr %= 'expr + term', lambda s: s[1] + s[3] +expr %= 'expr - term', lambda s: s[1] - s[3] +expr %= 'term', lambda s: s[1] + +term %= 'term * fact', lambda s: s[1] * s[3] +term %= 'term / fact', lambda s: s[1] // s[3] +term %= 'fact', lambda s: s[1] + +fact %= '( expr )', lambda s: s[2] +fact %= 'int', lambda s: int(s[1]) + +lexer = g.get_lexer() +parser = g.get_parser('slr') + +print(parser(lexer('(2 + 2) * 2 + 2'))) # 10 +print(parser(lexer('1 + 2 * 5 - 4'))) # 7 +print(parser(lexer('((3 + 4) * 5) - 6 / 2'))) # 32 +``` + +--- + +## Next Steps + +- [Defining a Grammar](defining-grammar.md) — all grammar construction options in detail. +- [Configuring the Lexer](lexer.md) — terminal priority, token rules, and ignored tokens. +- [Building a Parser](parser.md) — SLR vs LR(1) vs LALR(1) and how to pick one. +- [Error Handling](error-handling.md) — how to report lexical and syntactic errors. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..6836a2b --- /dev/null +++ b/docs/index.md @@ -0,0 +1,95 @@ +# PyJapt + +**PyJapt** — *Just Another Parsing Tool Written in Python* — is a lexer and LR parser generator that lets you define a language grammar in pure Python and immediately produce a working tokeniser and parser from it. + +

+ PyJapt Logo Banner +

+ +--- + +## Why PyJapt? + +| Feature | Description | +|---------|-------------| +| **Pure Python** | No C extensions, no generated files to check in, no build step. | +| **Three LR parser types** | SLR, LR(1), and LALR(1) — choose the power level you need. | +| **Custom error handling** | Lexical and syntactic error handlers are first-class citizens. | +| **Semantic actions** | Attach a lambda or a decorated function to any production rule. | +| **Serialisation** | Pre-build the parsing tables and serialise them to a Python module for faster startup. | +| **Decorator-based API** | Define terminals and production rules without leaving Python. | + +--- + +## Quick Example + +A complete arithmetic expression parser in under 25 lines: + +```python +from pyjapt import Grammar + +g = Grammar() +expr = g.add_non_terminal('expr', True) +term, fact = g.add_non_terminals('term fact') +g.add_terminals('+ - / * ( )') +g.add_terminal('int', regex=r'\d+') + +@g.terminal('whitespace', r' +') +def whitespace(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + +expr %= 'expr + term', lambda s: s[1] + s[3] +expr %= 'expr - term', lambda s: s[1] - s[3] +expr %= 'term', lambda s: s[1] + +term %= 'term * fact', lambda s: s[1] * s[3] +term %= 'term / fact', lambda s: s[1] // s[3] +term %= 'fact', lambda s: s[1] + +fact %= '( expr )', lambda s: s[2] +fact %= 'int', lambda s: int(s[1]) + +lexer = g.get_lexer() +parser = g.get_parser('slr') + +print(parser(lexer('(2 + 2) * 2 + 2'))) # 10 +``` + +--- + +## Installation + +```sh +pip install pyjapt +``` + +PyJapt requires **Python 3.10** or later and has no runtime dependencies. + +--- + +## How It Works + +PyJapt revolves around the `Grammar` class. You describe your language by: + +1. **Declaring non-terminals** — the syntactic categories of your language. +2. **Declaring terminals** — the tokens produced by the lexer. +3. **Writing productions** — rules that describe how non-terminals are composed, with optional semantic actions. +4. **Generating the lexer and parser** — call `get_lexer()` and `get_parser(type)`. + +``` +Grammar definition + │ + ├─► get_lexer() → Lexer (regex-based tokeniser) + │ + └─► get_parser() → ShiftReduceParser (SLR / LR1 / LALR1) +``` + +--- + +## Next Steps + +- Follow the [Getting Started](getting-started.md) guide to build your first language. +- Learn how to [define a grammar](defining-grammar.md) in detail. +- Read about [error handling](error-handling.md) to build robust parsers. +- Check the [API Reference](api-reference.md) for the complete public API. diff --git a/docs/lexer.md b/docs/lexer.md new file mode 100644 index 0000000..0139fd0 --- /dev/null +++ b/docs/lexer.md @@ -0,0 +1,177 @@ +# Configuring the Lexer + +The lexer produced by `g.get_lexer()` is a regex-based tokeniser. Understanding how it orders and applies patterns is essential for writing grammars with keywords, identifiers, and complex token types. + +--- + +## How the Lexer Works + +When called with a string, the lexer scans from left to right trying to match the current position against a single combined regex. The first alternative in that regex that matches wins. + +The alternatives are ordered as follows: + +1. **Ruled terminals** — terminals declared with `@g.terminal(...)` or via `add_terminal(..., rule=...)`, in the order they were declared. +2. **Non-literal terminals** — terminals with a custom `regex` argument but no rule, sorted longest-regex-first. +3. **Literal terminals** — terminals whose regex is their escaped name (declared via `add_terminal(name)` or `add_terminals(...)`), sorted longest-first. + +This ordering means that custom rule functions are checked before pattern-only terminals, and longer patterns take priority over shorter ones within each group. + +--- + +## The Token Class + +```python +class Token: + lex: str # the matched lexeme string + token_type: Any # the terminal's name (str) or Symbol object + line: int # 1-based line number + column: int # 1-based column number +``` + +--- + +## The Lexer Object Inside a Rule + +When a terminal rule function is called, it receives the `Lexer` instance with the following attributes: + +| Attribute | Type | Description | +|-----------|------|-------------| +| `lexer.token` | `Token` | The token that was just matched | +| `lexer.position` | `int` | Current byte offset in the input string | +| `lexer.lineno` | `int` | Current line number (1-based) | +| `lexer.column` | `int` | Current column number (1-based) | +| `lexer.text` | `str` | The full input string | +| `lexer.contain_errors` | `bool` | Set to `True` if any error has occurred | + +**Important:** you are responsible for advancing `lexer.position` and `lexer.column` inside a rule. If you forget, the lexer will match the same input repeatedly. + +--- + +## Common Terminal Patterns + +### Discarding whitespace + +```python +@g.terminal('whitespace', r' +') +def whitespace(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + # return nothing → token is ignored +``` + +### Tracking newlines + +```python +@g.terminal('newline', r'\n+') +def newline(lexer): + lexer.lineno += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + lexer.column = 1 +``` + +### Discarding tabs + +```python +@g.terminal('tabulation', r'\t+') +def tab(lexer): + lexer.column += 4 * len(lexer.token.lex) + lexer.position += len(lexer.token.lex) +``` + +### Modifying the lexeme + +```python +@g.terminal('int', r'\d+') +def int_terminal(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + lexer.token.lex = int(lexer.token.lex) # convert to Python int + return lexer.token +``` + +### Single-line comments + +```python +@g.terminal('comment', r'//[^\n]*') +def line_comment(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + # discard — no return +``` + +### Block comments + +```python +@g.terminal('block_comment', r'/\*(.|\n)*?\*/') +def block_comment(lexer): + lex = lexer.token.lex + for ch in lex: + if ch == '\n': + lexer.lineno += 1 + lexer.column = 1 + else: + lexer.column += 1 + lexer.position += len(lex) +``` + +--- + +## Keywords vs Identifiers + +Suppose your language has keywords (`if`, `else`, `while`) and identifiers (`[a-zA-Z_][a-zA-Z0-9_]*`). A naïve approach would match `if` as an identifier because the identifier regex is broader. + +The correct solution is to declare keywords as literal terminals and write a single rule for identifiers that checks whether the matched text is a keyword: + +```python +from pyjapt import Grammar + +g = Grammar() +keywords = g.add_terminals('if else while return true false') +keyword_names = {t.name for t in keywords} + +@g.terminal('id', r'[a-zA-Z_][a-zA-Z0-9_]*') +def id_terminal(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + if lexer.token.lex in keyword_names: + lexer.token.token_type = lexer.token.lex # reclassify as keyword + return lexer.token +``` + +Because `id_terminal` is a *ruled* terminal it runs first. If the lexeme is a keyword name, the token type is changed to the keyword name, so the parser sees the keyword terminal instead of an identifier. + +--- + +## Calling the Lexer + +```python +lexer = g.get_lexer() + +# tokenise a string +tokens = lexer('x + 42') + +for tok in tokens: + print(tok) +# id: x +# +: + +# int: 42 +# $: $ ← EOF token appended automatically +``` + +`Lexer.__call__` resets all internal state (position, line number, column, error list) before each run, so the same instance can be reused safely. + +--- + +## Checking for Lexical Errors + +After tokenisation, check `lexer.contain_errors` and read `lexer.errors`: + +```python +tokens = lexer(source_code) + +if lexer.contain_errors: + for msg in lexer.errors: + print(msg) +``` + +See [Error Handling](error-handling.md) for custom lexical error handlers. diff --git a/docs/parser.md b/docs/parser.md new file mode 100644 index 0000000..2944285 --- /dev/null +++ b/docs/parser.md @@ -0,0 +1,144 @@ +# Building a Parser + +PyJapt provides three LR parser variants. This page explains how they differ, when to use each, and how to work with the parser object. + +--- + +## Choosing a Parser Type + +```python +parser = g.get_parser('slr') # Simple LR +parser = g.get_parser('lr1') # Canonical LR(1) +parser = g.get_parser('lalr1') # LALR(1) +``` + +| Parser | Power | States | Speed | Best For | +|--------|-------|--------|-------|----------| +| `slr` | Weakest | Fewest | Fastest to build | Simple grammars, prototyping | +| `lalr1` | Middle | Fewest (same as SLR) | Fast to build | Most real-world grammars (e.g. C, Python) | +| `lr1` | Strongest | Most | Slowest to build | Grammars that LALR(1) cannot handle | + +**Rule of thumb:** start with `slr`. If you see shift-reduce or reduce-reduce conflicts that your grammar should not have, try `lalr1`. Use `lr1` only when necessary. + +--- + +## How LR Parsing Works + +LR parsers are bottom-up. They maintain a *stack* and follow one of three actions at each step: + +- **Shift** — push the current input token onto the stack. +- **Reduce** — pop symbols matching a production's body, run the semantic action, push the head non-terminal. +- **Accept** — the start symbol covers the entire input; return the top semantic value. + +The parsing tables (ACTION and GOTO) encode which action to take for every (state, token) pair. + +--- + +## Conflicts + +When two actions are valid for the same (state, lookahead) pair, a conflict arises: + +- **Shift-reduce (SR)** — the parser can either shift or reduce. PyJapt resolves SR conflicts in favour of **shift** (same as most tools, because it handles `if-else` correctly). +- **Reduce-reduce (RR)** — two different reductions are possible. PyJapt keeps whichever was registered first. + +Conflicts are printed to `stderr` and stored in `parser.conflicts`: + +```python +parser = g.get_parser('slr') +# Warning: 1 Shift-Reduce Conflicts +# Warning: 0 Reduce-Reduce Conflicts + +print(parser.shift_reduce_count) # 1 +print(parser.reduce_reduce_count) # 0 +print(parser.conflicts) # [('SR', prod_a, prod_b)] +``` + +--- + +## Semantic Actions + +A semantic action is a callable `(RuleList) -> Any` attached to a production. + +```python +fact %= 'int', lambda s: int(s[1]) +``` + +For longer actions, use `@g.production`: + +```python +@g.production('stmt -> let id = expr ;') +def let_stmt(s): + name = s[2] # id lexeme + value = s[4] # expr semantic value + return LetStatement(name, value) +``` + +The `RuleList` `s` is 1-indexed over the body symbols: + +``` +stmt -> let id = expr ; +s[0] s[1] s[2] s[3] s[4] s[5] +(head) +``` + +`s[0]` is set to whatever your action returns. + +--- + +## Calling the Parser + +The parser is callable: + +```python +result = parser(tokens) # tokens: List[Token] +``` + +`tokens` is the list returned by `lexer(text)`. If you use a different tokeniser, ensure each token has `.token_type` set to the terminal name string. + +The return value is the semantic value of the start symbol, or `None` if parsing failed without a recovery path. + +--- + +## Checking for Parsing Errors + +```python +result = parser(tokens) + +if parser.contains_errors: + for msg in parser.errors: + print(msg) +``` + +--- + +## The `verbose` Flag + +Pass `verbose=True` to `get_parser` to print every shift and reduce operation during parsing. Useful for debugging grammars. + +```python +parser = g.get_parser('slr', verbose=True) +parser(lexer('1 + 2')) +# expr <-> 1 + 2 $ +# +# Shift: ('1', 3) +# ... +``` + +--- + +## Parser Internals + +You can inspect the generated tables directly: + +```python +# ACTION table: {(state_id, Terminal): ('SHIFT', next_state) | ('REDUCE', Production) | ('OK', None)} +print(parser.action) + +# GOTO table: {(state_id, NonTerminal): next_state} +print(parser.goto) + +# The augmented grammar used internally +print(parser.augmented_grammar) +``` + +These are Python dicts and can be serialised — see [Serialisation](serialization.md). diff --git a/docs/serialization.md b/docs/serialization.md new file mode 100644 index 0000000..bf7b3b5 --- /dev/null +++ b/docs/serialization.md @@ -0,0 +1,170 @@ +# Serialisation + +For large grammars, building the parsing tables from scratch on every run can take seconds. PyJapt lets you *serialise* the pre-computed tables into plain Python modules so that subsequent runs skip the construction step entirely. + +--- + +## How It Works + +Calling `serialize_lexer` or `serialize_parser` writes a Python source file (`lexertab.py` / `parsertab.py`) that contains the pre-computed tables as dictionaries. On subsequent runs you import the generated module instead of rebuilding. + +The generated classes extend `Lexer` and `ShiftReduceParser` respectively, so they have the full API of their base classes. + +--- + +## Serialising the Lexer + +```python +import inspect +from pyjapt import Grammar + +g = Grammar() +# ... define grammar ... + +if __name__ == '__main__': + module_name = inspect.getmodulename(__file__) + g.serialize_lexer( + class_name='MyLexer', + grammar_module_name=module_name, + grammar_variable_name='g', + ) +``` + +This writes `lexertab.py` in the current working directory. The generated class looks like: + +```python +# lexertab.py (generated — do not edit by hand) +import re +from pyjapt import Token, Lexer +from my_grammar import g + +class MyLexer(Lexer): + def __init__(self): + self.pattern = re.compile(r'...') + self.token_rules = {key: rule for ...} + self.error_handler = g.lexical_error_handler or self.error + ... +``` + +--- + +## Serialising the Parser + +```python +if __name__ == '__main__': + module_name = inspect.getmodulename(__file__) + g.serialize_parser( + parser_type='lalr1', # 'slr', 'lr1', or 'lalr1' + class_name='MyParser', + grammar_module_name=module_name, + grammar_variable_name='g', + ) +``` + +This writes `parsertab.py`: + +```python +# parsertab.py (generated — do not edit by hand) +from abc import ABC +from pyjapt import ShiftReduceParser +from my_grammar import g + +class MyParser(ShiftReduceParser, ABC): + def __init__(self, verbose=False): + self.grammar = g + self.action = self.__action_table() + self.goto = self.__goto_table() + self.error_handler = g.parsing_error_handler or self.error + ... +``` + +--- + +## Using the Generated Classes + +```python +from lexertab import MyLexer +from parsertab import MyParser + +lexer = MyLexer() +parser = MyParser() + +result = parser(lexer(source_code)) +``` + +--- + +## Full Example + +**`grammar.py`** — define the grammar and conditionally serialise: + +```python +import inspect +from pyjapt import Grammar + +g = Grammar() +expr = g.add_non_terminal('expr', start_symbol=True) +term, fact = g.add_non_terminals('term fact') +g.add_terminals('+ - * / ( )') +g.add_terminal('int', regex=r'\d+') + +@g.terminal('whitespace', r' +') +def ws(lexer): + lexer.column += len(lexer.token.lex) + lexer.position += len(lexer.token.lex) + +expr %= 'expr + term', lambda s: s[1] + s[3] +expr %= 'expr - term', lambda s: s[1] - s[3] +expr %= 'term', lambda s: s[1] +term %= 'term * fact', lambda s: s[1] * s[3] +term %= 'term / fact', lambda s: s[1] // s[3] +term %= 'fact', lambda s: s[1] +fact %= '( expr )', lambda s: s[2] +fact %= 'int', lambda s: int(s[1]) + +if __name__ == '__main__': + module = inspect.getmodulename(__file__) + g.serialize_lexer(class_name='ArithLexer', grammar_module_name=module, grammar_variable_name='g') + g.serialize_parser(parser_type='lalr1', + class_name='ArithParser', grammar_module_name=module, grammar_variable_name='g') +``` + +Run once to generate the tables: + +```sh +python grammar.py +``` + +**`main.py`** — import and use: + +```python +from lexertab import ArithLexer +from parsertab import ArithParser + +lexer = ArithLexer() +parser = ArithParser() + +while True: + line = input('> ') + print(parser(lexer(line))) +``` + +--- + +## Regenerating the Tables + +The generated files must be regenerated whenever the grammar changes. A simple convention is to commit the grammar file (`grammar.py`) but add `lexertab.py` and `parsertab.py` to `.gitignore` and generate them as a build step. + +```gitignore +# .gitignore +lexertab.py +parsertab.py +``` + +--- + +## Caveats + +- **Semantic actions are not serialised.** The generated parser still imports the original grammar module (`grammar_module_name`) at runtime to access production rules and semantic actions. +- **The grammar module must be importable.** Make sure `grammar.py` (or whatever you named it) is on the Python path when running the generated classes. +- **Files are written to the current working directory.** Run the serialisation script from the directory where you want the files to be created. diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..08000be --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,78 @@ +site_name: PyJapt +site_description: A lexer and LR parser generator written in Python +site_author: Alejandro Klever +site_url: https://alejandroklever.github.io/PyJapt +repo_name: alejandroklever/PyJapt +repo_url: https://github.com/alejandroklever/PyJapt +edit_uri: edit/main/docs/ + +theme: + name: material + palette: + - scheme: default + primary: deep purple + accent: purple + toggle: + icon: material/brightness-7 + name: Switch to dark mode + - scheme: slate + primary: deep purple + accent: purple + toggle: + icon: material/brightness-4 + name: Switch to light mode + features: + - navigation.tabs + - navigation.sections + - navigation.top + - navigation.footer + - toc.integrate + - search.suggest + - search.highlight + - content.code.annotate + - content.code.copy + icon: + repo: fontawesome/brands/github + +nav: + - Home: index.md + - Getting Started: getting-started.md + - User Guide: + - Defining a Grammar: defining-grammar.md + - Configuring the Lexer: lexer.md + - Building a Parser: parser.md + - Error Handling: error-handling.md + - Serialization: serialization.md + - API Reference: api-reference.md + - Changelog: changelog.md + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format + - pymdownx.highlight: + anchor_linenums: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.tabbed: + alternate_style: true + - attr_list + - md_in_html + - toc: + permalink: true + +plugins: + - search + +extra: + social: + - icon: fontawesome/brands/github + link: https://github.com/alejandroklever/PyJapt + - icon: fontawesome/brands/python + link: https://pypi.org/project/pyjapt/ + version: + provider: mike From 131f20c398da8590065dbcd51fd7a69755bf4e30 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 26 Feb 2026 15:59:33 +0000 Subject: [PATCH 2/6] Initial plan From 29b2bcc7c32637e16e6d3db8909e51e610831ad5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 26 Feb 2026 16:00:04 +0000 Subject: [PATCH 3/6] Initial plan From e7e830e3cabb2fcaa976eb5eaab34d421575ae4b Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 26 Feb 2026 16:01:03 +0000 Subject: [PATCH 4/6] Fix British spellings to American in all docs Co-authored-by: alejandroklever <45394625+alejandroklever@users.noreply.github.com> --- docs/api-reference.md | 4 ++-- docs/changelog.md | 6 +++--- docs/defining-grammar.md | 2 +- docs/error-handling.md | 4 ++-- docs/index.md | 2 +- docs/parser.md | 4 ++-- docs/serialization.md | 14 +++++++------- 7 files changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/api-reference.md b/docs/api-reference.md index 4fa9a3c..836ed20 100644 --- a/docs/api-reference.md +++ b/docs/api-reference.md @@ -203,7 +203,7 @@ Raises `ValueError` for unknown names. --- -### Serialisation +### Serialization --- @@ -225,7 +225,7 @@ Generate `parsertab.py` in the current working directory. #### `Grammar.to_json() -> str` -Serialise the grammar structure (terminals, non-terminals, productions) to a JSON string. Semantic actions and regexes are **not** included. +Serialize the grammar structure (terminals, non-terminals, productions) to a JSON string. Semantic actions and regexes are **not** included. --- diff --git a/docs/changelog.md b/docs/changelog.md index 72c0293..9cab546 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -28,7 +28,7 @@ This project follows [Semantic Versioning](https://semver.org) and the ### Planned — Testing - Add tests for LR(1) and LALR(1) parsers. - Add tests for lexer and parser error handling. -- Add tests for serialisation round-trips. +- Add tests for serialization round-trips. - Add edge-case tests (empty grammar, duplicate symbols, epsilon productions). - Enforce minimum test coverage threshold. @@ -62,7 +62,7 @@ This project follows [Semantic Versioning](https://semver.org) and the - Improved `RuleList` error API. ### Fixed -- Reset lexer parameters when analysing a new string (`Lexer.__call__`). +- Reset lexer parameters when analyzing a new string (`Lexer.__call__`). --- @@ -77,7 +77,7 @@ This project follows [Semantic Versioning](https://semver.org) and the ### Added - SLR, LR(1), and LALR(1) parsers. -- Serialisation of lexer and parser to Python source files. +- Serialization of lexer and parser to Python source files. - `@g.terminal` decorator for inline rule definition. - `@g.production` decorator for inline production rules. - `@g.lexical_error` and `@g.parsing_error` decorators. diff --git a/docs/defining-grammar.md b/docs/defining-grammar.md index 68d6459..e289f3a 100644 --- a/docs/defining-grammar.md +++ b/docs/defining-grammar.md @@ -224,4 +224,4 @@ g2 = Grammar.from_json(json_str) ``` !!! note - JSON serialisation does not preserve terminal regexes, terminal rules, or semantic actions. It is useful for inspecting grammar structure, not for production use. Use [Python file serialisation](serialization.md) for production scenarios. + JSON serialization does not preserve terminal regexes, terminal rules, or semantic actions. It is useful for inspecting grammar structure, not for production use. Use [Python file serialization](serialization.md) for production scenarios. diff --git a/docs/error-handling.md b/docs/error-handling.md index 43262e9..c07d93f 100644 --- a/docs/error-handling.md +++ b/docs/error-handling.md @@ -6,7 +6,7 @@ Good error handling is one of PyJapt's core design goals. This page describes ho ## Lexical Error Handling -### Default behaviour +### Default behavior When the lexer encounters a character that matches no terminal pattern, it calls the *lexical error handler*. By default, this adds an error message to the internal errors list and advances past the bad character. @@ -70,7 +70,7 @@ if lexer.contain_errors: ## Syntactic Error Handling -### Default behaviour +### Default behavior When the parser cannot find an action for the current `(state, token)` pair it enters *panic-mode recovery*: it calls the error handler and then skips input tokens until it finds one that fits the current state. diff --git a/docs/index.md b/docs/index.md index 6836a2b..6a16168 100644 --- a/docs/index.md +++ b/docs/index.md @@ -16,7 +16,7 @@ | **Three LR parser types** | SLR, LR(1), and LALR(1) — choose the power level you need. | | **Custom error handling** | Lexical and syntactic error handlers are first-class citizens. | | **Semantic actions** | Attach a lambda or a decorated function to any production rule. | -| **Serialisation** | Pre-build the parsing tables and serialise them to a Python module for faster startup. | +| **Serialization** | Pre-build the parsing tables and serialize them to a Python module for faster startup. | | **Decorator-based API** | Define terminals and production rules without leaving Python. | --- diff --git a/docs/parser.md b/docs/parser.md index 2944285..6ffa344 100644 --- a/docs/parser.md +++ b/docs/parser.md @@ -38,7 +38,7 @@ The parsing tables (ACTION and GOTO) encode which action to take for every (stat When two actions are valid for the same (state, lookahead) pair, a conflict arises: -- **Shift-reduce (SR)** — the parser can either shift or reduce. PyJapt resolves SR conflicts in favour of **shift** (same as most tools, because it handles `if-else` correctly). +- **Shift-reduce (SR)** — the parser can either shift or reduce. PyJapt resolves SR conflicts in favor of **shift** (same as most tools, because it handles `if-else` correctly). - **Reduce-reduce (RR)** — two different reductions are possible. PyJapt keeps whichever was registered first. Conflicts are printed to `stderr` and stored in `parser.conflicts`: @@ -141,4 +141,4 @@ print(parser.goto) print(parser.augmented_grammar) ``` -These are Python dicts and can be serialised — see [Serialisation](serialization.md). +These are Python dicts and can be serialized — see [Serialization](serialization.md). diff --git a/docs/serialization.md b/docs/serialization.md index bf7b3b5..dec32b8 100644 --- a/docs/serialization.md +++ b/docs/serialization.md @@ -1,6 +1,6 @@ -# Serialisation +# Serialization -For large grammars, building the parsing tables from scratch on every run can take seconds. PyJapt lets you *serialise* the pre-computed tables into plain Python modules so that subsequent runs skip the construction step entirely. +For large grammars, building the parsing tables from scratch on every run can take seconds. PyJapt lets you *serialize* the pre-computed tables into plain Python modules so that subsequent runs skip the construction step entirely. --- @@ -12,7 +12,7 @@ The generated classes extend `Lexer` and `ShiftReduceParser` respectively, so th --- -## Serialising the Lexer +## Serializing the Lexer ```python import inspect @@ -48,7 +48,7 @@ class MyLexer(Lexer): --- -## Serialising the Parser +## Serializing the Parser ```python if __name__ == '__main__': @@ -96,7 +96,7 @@ result = parser(lexer(source_code)) ## Full Example -**`grammar.py`** — define the grammar and conditionally serialise: +**`grammar.py`** — define the grammar and conditionally serialize: ```python import inspect @@ -165,6 +165,6 @@ parsertab.py ## Caveats -- **Semantic actions are not serialised.** The generated parser still imports the original grammar module (`grammar_module_name`) at runtime to access production rules and semantic actions. +- **Semantic actions are not serialized.** The generated parser still imports the original grammar module (`grammar_module_name`) at runtime to access production rules and semantic actions. - **The grammar module must be importable.** Make sure `grammar.py` (or whatever you named it) is on the Python path when running the generated classes. -- **Files are written to the current working directory.** Run the serialisation script from the directory where you want the files to be created. +- **Files are written to the current working directory.** Run the serialization script from the directory where you want the files to be created. From b1cc15b38a501c76b0c758288e1a1b8332f5949b Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 26 Feb 2026 16:01:45 +0000 Subject: [PATCH 5/6] Add RuleList to pyjapt top-level exports and update API reference docs Co-authored-by: alejandroklever <45394625+alejandroklever@users.noreply.github.com> --- docs/api-reference.md | 1 + pyjapt/__init__.py | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/api-reference.md b/docs/api-reference.md index 4fa9a3c..50d83cd 100644 --- a/docs/api-reference.md +++ b/docs/api-reference.md @@ -15,6 +15,7 @@ from pyjapt import ( SLRParser, LR1Parser, LALR1Parser, + RuleList, ) ``` diff --git a/pyjapt/__init__.py b/pyjapt/__init__.py index 3252226..426c9bc 100644 --- a/pyjapt/__init__.py +++ b/pyjapt/__init__.py @@ -1,4 +1,4 @@ from pyjapt.lexing import Lexer, Token -from pyjapt.parsing import ShiftReduceParser, LR1Parser, LALR1Parser, SLRParser, Grammar +from pyjapt.parsing import ShiftReduceParser, LR1Parser, LALR1Parser, SLRParser, Grammar, RuleList __version__ = "0.4.1" From c8a74fe12d9a2f36c1e6983881675c56b6b3d9bb Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 3 Mar 2026 17:04:22 +0000 Subject: [PATCH 6/6] Initial plan