Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration by morph-modelcode-ai[bot] · Pull Request #5 · schetle/sumy

morph-modelcode-ai · 2026-06-11T20:59:40Z

Status

Milestone successfully completed. All scope items from the Milestone Plan were implemented across 11 commits:

Build system migrated from setup.py/setup.cfg/MANIFEST.in to pyproject.toml
Python 2 compatibility layer (_compat.py) deleted; all __future__ imports and coding declarations removed from every source file
NLTK tokenizer path updated from punkt to punkt_tab
Both CLI entry points (sumy/__main__.py, sumy/evaluation/__main__.py) migrated from docopt to argparse
Modern Python 3.10+ idioms applied across all 38 source files and 17 test files
ruff linting and mypy type checking configured and passing
Obsolete tests (test_compat.py, test_unicode_compatible_class.py) deleted; CLI tests rewritten for argparse

Deviation from Milestone Plan: The breadability dependency is intentionally retained in this milestone, as noted in the Milestone Plan's risk section. The full HTML parser migration to readability-lxml is deferred to Milestone 2. The html.py source file was modernized (removed _compat imports, __future__ imports, updated class syntax) but retains the breadability.readable.Article-based parsing logic.

Feature Overview

This milestone establishes the modernized foundation for the sumy project. After these changes:

Python 3.10+ only: The project targets Python 3.10, 3.11, 3.12, and 3.13. All Python 2 compatibility code has been removed.
Single build configuration: All package metadata, dependencies, tool configs (pytest, ruff, mypy), and entry points are consolidated in pyproject.toml.
Modernized CLI: Users invoke summarization via sumy <method> [options] using argparse (stdlib), removing the dependency on the unmaintained docopt library. The CLI semantics are preserved — e.g., sumy luhn --file <path> --language czech --length 3 produces summarized output.
NLTK 3.9+ compatibility: The tokenizer uses the new punkt_tab resource path, which is required by NLTK 3.9+ and resolves CVE-2024-39705.
Static analysis: ruff (linting) and mypy (type checking) are configured in pyproject.toml and pass cleanly.

Testing

Automated Testing

The existing test suite (55+ tests) validates the modernized codebase:

CLI tests (tests/test_main.py): Rewritten to test argparse-based parsing — valid method invocation, missing arguments (SystemExit), invalid methods, --version flag, handle_arguments with file/stdin/URL inputs, and invalid format rejection.
Tokenizer tests (tests/test_tokenizers.py): Verify punkt_tab tokenizer output. One expected token tuple was updated ('but' now correctly extracted from 'but..' by the punkt_tab tokenizer).
Summarizer tests (tests/test_summarizers/): All seven summarizer algorithms plus Random are tested against existing Czech and English test documents. KL summarizer includes a fix for a KeyError on capitalized words (tested via existing document fixtures).
Parser tests (tests/test_parsers.py): PlaintextParser bytes-decoding fix is covered by existing parser tests.
Model tests (tests/test_models/): DOM and TF model tests pass with modernized __str__ methods (replacing __unicode__).
Evaluation tests (tests/test_evaluation.py): ROUGE, coselection, and content-based metric tests pass unchanged.
Deleted tests: test_compat.py (87 LOC) and test_unicode_compatible_class.py (55 LOC) were removed — they tested the deleted _compat.py module.

Run the test suite:

cd sumy
pip install -e ".[dev]"
python -m nltk.downloader punkt_tab
pytest

Manual Testing

Verify end-to-end CLI summarization:

cd sumy
sumy luhn --file tests/data/snippets/prevko.txt --language czech --length 3

This should output 3 Czech-language summary sentences extracted from the test document. Confirm the output contains coherent Czech text (not error messages or encoding artifacts).

Verify argparse help and version:

sumy --help
sumy --version

--help should display the argparse-generated usage with method choices (luhn, edmundson, lsa, text-rank, lex-rank, sum-basic, kl) and all options. --version should print the version string (sumy 0.3.0).

Architecture

Overview

graph TD
    subgraph Legend
        L1[Modified]
        L2[Deleted]
    end

    BUILD["pyproject.toml\n(Build Config)"]
    CLI["CLI Entry Points\n(__main__.py, eval/__main__.py)"]
    NLP["NLP Pipeline\n(Tokenizer, Stemmers)"]
    PARSERS["Parsers\n(PlaintextParser, HtmlParser)"]
    MODELS["Models / DOM\n(Sentence, Paragraph, Document)"]
    TF["TF Model\n(TfDocumentModel)"]
    SUMMARIZERS["Summarizers\n(7 Algorithms + Random)"]
    EVAL["Evaluation\n(ROUGE, Coselection, Content)"]
    UTILS["Utilities\n(cached_property, stopwords)"]
    COMPAT["_compat.py\n(Python 2 Shim)"]

    CLI --> PARSERS
    CLI --> SUMMARIZERS
    PARSERS --> MODELS
    PARSERS --> NLP
    SUMMARIZERS --> MODELS
    SUMMARIZERS --> TF
    SUMMARIZERS --> NLP
    EVAL --> TF
    MODELS --> UTILS
    CLI --> UTILS

    style BUILD fill:#ffff99,stroke:#999
    style CLI fill:#ffff99,stroke:#999
    style NLP fill:#ffff99,stroke:#999
    style PARSERS fill:#ffff99,stroke:#999
    style MODELS fill:#ffff99,stroke:#999
    style TF fill:#ffff99,stroke:#999
    style SUMMARIZERS fill:#ffff99,stroke:#999
    style EVAL fill:#ffff99,stroke:#999
    style UTILS fill:#ffff99,stroke:#999
    style COMPAT fill:#ff9999,stroke:#999
    style L1 fill:#ffff99,stroke:#999
    style L2 fill:#ff9999,stroke:#999

Changes

Build Config (`pyproject.toml`)

Created pyproject.toml with setuptools build backend, consolidating all metadata from setup.py, tool configs from setup.cfg, and package data rules from MANIFEST.in. Key sections: [project] (name, version, Python 3.10+ classifiers, dependencies), [project.optional-dependencies] (LSA/LexRank numpy extras, dev tools), [project.scripts] (sumy, sumy_eval entry points), [tool.pytest.ini_options], [tool.ruff], [tool.mypy]. Deleted setup.py (105 LOC), setup.cfg (14 LOC), MANIFEST.in (4 LOC). Updated Makefile to remove legacy py.test-2.6/py.test-3.2 targets and old publish/bump commands. Updated .gitignore for modern Python packaging artifacts.

CLI Entry Points (`sumy/main.py`, `sumy/evaluation/main.py`)

Replaced docopt with argparse. Both CLIs use a single positional method argument with choices to select the summarizer algorithm, preserving the existing sumy <method> [options] invocation syntax. Options (--length, --language, --stopwords, --format, --url, --file, --version) are defined as explicit add_argument calls. The AVAILABLE_METHODS dict maps method names to summarizer classes directly (replacing the old iterate-and-check-boolean-flag pattern). The evaluation CLI adds a reference_summary positional argument and includes random in method choices.

NLP Pipeline (`sumy/nlp/tokenizers.py`, `sumy/nlp/stemmers/`)

Updated the tokenizer resource path from tokenizers/punkt/{lang}.pickle to tokenizers/punkt_tab/{lang}.pickle for NLTK 3.9+ compatibility. The _params.abbrev_types API for extra abbreviations remains compatible. Stemmers module modernized (removed _compat imports, modern class syntax) but the NLTK Snowball API is unchanged. Czech stemmer (czech.py) modernized with f-strings and class Foo: syntax.

Parsers (`sumy/parsers/`)

PlaintextParser: Fixed bytes-decoding — when files are opened in binary mode, input_stream.read() returns bytes. The old to_unicode() handled this transparently, but the replacement str(text) produces "b'...'" instead of decoding. Added explicit isinstance(text, bytes) check with .decode("utf-8"). HtmlParser: Modernized imports and class syntax; breadability-based parsing logic retained for Milestone 2 migration. Base DocumentParser class: removed _compat imports, modernized class syntax.

Models / DOM (`sumy/models/dom/`)

Sentence, Paragraph, ObjectDocumentModel: Replaced @unicode_compatible decorator with direct __str__ methods (renamed from __unicode__). Updated __repr__ to use f-strings. __slots__ retained on Sentence and Paragraph for memory efficiency. Sentence.__hash__ and __eq__ contracts preserved. All to_unicode()/to_string() calls replaced with native str().

TF Model (`sumy/models/tf.py`)

Fixed from collections import Sequence to from collections.abc import Sequence (the old import was removed in Python 3.10). Modernized class syntax and string formatting.

Summarizers (`sumy/summarizers/`)

All eight summarizer modules modernized: removed _compat imports, __future__ imports, coding declarations. Applied class Foo: syntax, super() calls, f-strings. AbstractSummarizer._get_best_sentences: converted lambda to named function for ruff E731 compliance. KLSummarizer._compute_ratings: fixed a KeyError bug — raw (unnormalized) summary words were compared against the normalized frequency dictionary; now uses _get_all_content_words_in_doc for consistent normalization. LsaSummarizer and LexRankSummarizer: numpy optional imports restructured with typing.Any annotations for mypy compatibility; import ordering fixed for ruff E402. Class-level _stop_words attributes annotated as frozenset[str] across TextRankSummarizer, LuhnSummarizer, LsaSummarizer, and LexRankSummarizer.

Evaluation (`sumy/evaluation/`)

rouge.py: Converted from tab indentation to spaces (fixing ruff W191/E101). Modernized all string formatting and class syntax. content_based.py, coselection.py: Removed _compat imports, applied f-strings. __init__.py: Fixed re-exports with explicit as symbol syntax for ruff F401.

Utilities (`sumy/utils.py`)

Retained the custom cached_property decorator (required for __slots__-based classes) with an explanatory comment noting why functools.cached_property cannot be used. Replaced to_string()/to_unicode() calls with native str() and .decode("utf-8"). ItemsCount and read_stop_words modernized with f-strings and direct str/bytes operations.

Deleted: `_compat.py` (Python 2 Shim)

Deleted sumy/_compat.py (109 LOC). This module provided Python 2/3 dual-compatibility: to_unicode, to_string, to_bytes, unicode_compatible decorator, Counter polyfill, urllib shim, ffilter, string_types, and PY3 flag. All 17 source files that imported from _compat were updated to use native Python 3 equivalents.

Design Decisions

1. Slot-Compatible Caching Strategy

The custom cached_property decorator in sumy/utils.py was retained rather than replaced with functools.cached_property. This is because Sentence and Paragraph use __slots__ for memory efficiency, and functools.cached_property requires __dict__ (incompatible with __slots__). A single caching implementation is used across all classes for consistency, avoiding behavioral divergence between two different caching mechanisms.

2. argparse CLI Structure

A single positional argument with choices was chosen over subparsers or --method flag. This preserves the existing sumy luhn ... invocation syntax (backward-compatible with docopt-era usage), keeps the CLI flat and simple, and avoids the boilerplate of per-algorithm subparsers. The AVAILABLE_METHODS dict provides a clean mapping from method name to summarizer class.

3. Build Configuration Consolidation

Version-suffixed entry points (sumy-3.4, sumy_eval-3.4) were dropped — these are a Python 2-era convention for side-by-side Python version installations and are unnecessary with Python 3.10+ as the sole target. bumpversion config was not migrated. Only sumy and sumy_eval are registered as console scripts.

4. unicode_compatible Decorator Removal

__unicode__ methods on Sentence, Paragraph, and ObjectDocumentModel were renamed to __str__, and __repr__ methods were updated to use f-strings. No __bytes__ methods were retained — they were a Python 2 artifact. The __str__ contract (returning the text content) is preserved exactly.

Suggested Order of Review

pyproject.toml — Start here to understand the new build configuration, dependencies, and tool settings. This is the foundation for all other changes.
sumy/utils.py — Review the retained cached_property decorator and modernized utility functions. Understanding this is needed for the DOM model changes.
sumy/models/dom/_sentence.py — Core data structure. See __slots__ retention, __str__ replacement for __unicode__, and cached_property usage.
sumy/models/dom/_paragraph.py — Same pattern as _sentence.py.
sumy/models/dom/_document.py — ObjectDocumentModel modernization.
sumy/models/tf.py — collections.abc.Sequence fix and modernized class syntax.
sumy/nlp/tokenizers.py — punkt_tab resource path change (the critical NLTK upgrade).
sumy/nlp/stemmers/__init__.py and sumy/nlp/stemmers/czech.py — Stemmer modernization.
sumy/parsers/plaintext.py — Bytes-decoding fix for binary file input.
sumy/parsers/html.py and sumy/parsers/parser.py — Parser modernization (breadability retained).
sumy/__main__.py — Main CLI rewrite (docopt to argparse). Core behavioral change.
sumy/evaluation/__main__.py — Evaluation CLI rewrite (same pattern as main CLI).
sumy/summarizers/kl.py — KL summarizer KeyError bugfix (normalization mismatch).
sumy/summarizers/lsa.py and sumy/summarizers/lex_rank.py — numpy optional import pattern with mypy annotations.
Remaining sumy/summarizers/ — Bulk modernization across all other summarizer modules.
sumy/evaluation/rouge.py — Tab-to-space conversion and modernization of the largest evaluation module.
tests/test_main.py — Rewritten CLI tests (argparse assertions replacing docopt assertions).
tests/utils.py — Test utility modernization.
Remaining test files — Bulk _compat removal and to_unicode replacement.
Deleted files — Confirm removal of setup.py, setup.cfg, MANIFEST.in, sumy/_compat.py, tests/test_utils/test_compat.py, tests/test_utils/test_unicode_compatible_class.py.

- Create pyproject.toml with setuptools build backend, project metadata, dependencies (nltk, breadability; docopt removed for argparse migration), optional deps (LSA, LexRank, dev), console scripts, and tool configs (pytest, ruff, mypy) - Update classifiers to Python 3.10-3.13 and Development Status 4 - Beta - Remove version-suffixed entry points (sumy-X.Y, sumy_eval-X.Y) - Delete setup.py, setup.cfg, and MANIFEST.in - Update Makefile with modern pytest target, remove old publish/bump targets - Update .gitignore with patterns for build artifacts, testing, linting, and type checking caches

- Delete sumy/_compat.py (Python 2/3 compatibility shim) - Remove all '# -*- coding: utf8 -*-' lines from every source file - Remove all 'from __future__ import' lines from every source file - Replace _compat imports with Python 3 native equivalents: - to_unicode/to_string -> str() or remove where unnecessary - unicode -> str - string_types -> str - unicode_compatible decorator -> direct __str__ method - ffilter -> itertools.filterfalse - Counter -> collections.Counter - urllib -> urllib.request - PY3 branching -> direct Python 3 code - Modernize class syntax: class Foo(object) -> class Foo - Modernize super calls: super(Class, self) -> super() - Convert % string formatting to f-strings where appropriate - Update collections.Sequence -> collections.abc.Sequence in models/tf.py - Keep custom cached_property (required for __slots__ classes) with explanatory comment

Replace deprecated tokenizers/punkt/{lang}.pickle path with the new tokenizers/punkt_tab/{lang}.pickle path required by NLTK 3.9+. The _params.abbrev_types API remains compatible.

Replace docopt with argparse in sumy/__main__.py and sumy/evaluation/__main__.py. Both CLIs now use a single positional argument with choices for summarizer selection, preserving the existing `sumy <method> [options]` interface. Key changes: - Remove docopt imports and docstring-as-usage-spec blocks - Add argparse.ArgumentParser with explicit argument definitions - Access parsed args as namespace attributes instead of dict keys - Look up summarizer via AVAILABLE_METHODS[args.method] instead of iterating and checking boolean flags - Evaluation CLI includes 'random' in method choices and accepts a positional reference_summary argument

When files are opened in binary mode ("rb"), input_stream.read() returns bytes. The old to_unicode() from _compat handled this, but the modernized str(text) produces "b'...'" instead of decoding. Add explicit bytes decoding in PlaintextParser.__init__.

- Delete test_compat.py and test_unicode_compatible_class.py (tested deleted _compat module) - Rewrite test_main.py to test argparse-based CLI instead of docopt - Update tests/utils.py: remove _compat imports, use plain str and bytes decode - Remove # -*- coding: utf8 -*- and from __future__ imports from all test files - Replace all to_unicode() calls with str() across 9 test files - Remove all sumy._compat imports from test files

Missed during the bulk modernization pass. Converts the warning message in _create_matrix from % formatting to an f-string.

The KL summarizer's _compute_ratings method used raw (unnormalized) words from the summary when computing joint frequencies, but compared them against the normalized word frequency dictionary. This caused a KeyError when a capitalized word appeared in the summary but only its lowercase form existed in the term frequency map. Fixed by using _get_all_content_words_in_doc (which normalizes and filters stop words) instead of _get_all_words_in_doc for the summary word list. Updated test_tokenize_sentence expected tuple to include 'but' — the NLTK punkt_tab tokenizer correctly extracts 'but' from 'but..' as a word token, whereas the legacy punkt pickle tokenizer did not.

- Fix F401 re-exports in __init__.py files using explicit `as symbol` syntax (evaluation, models, models/dom, parsers, summarizers) - Fix W191/E101 tab indentation in evaluation/rouge.py (converted to spaces) - Fix E731 lambda assignment in summarizers/_summarizer.py (use def instead) - Fix E701 multiple statements on one line in summarizers/kl.py - Fix F841 unused variables in tests (test_evaluation.py, test_kl.py) - Fix I001 unsorted imports via ruff --fix - Apply ruff format across all source and test files

- Add ignore_missing_imports=true to [tool.mypy] config for untyped third-party libraries (nltk, breadability) - Add frozenset[str] type annotations to _stop_words class attributes in text_rank.py, luhn.py, lsa.py, and lex_rank.py - Add frozenset[str] type annotation to _EMPTY_SET in edmundson.py - Fix numpy/svd optional import typing in lsa.py and lex_rank.py by declaring module-level `Any` annotations before try/except blocks

Move the relative import of AbstractSummarizer above the numpy try/except blocks so all standard imports precede the optional import guards. This resolves ruff E402 (module-level import not at top of file) introduced by the mypy type annotation fix.

mcode-bot@modelcode.ai added 11 commits June 11, 2026 13:59

Update NLTK tokenizer to use punkt_tab resource path

8b47c6c

Replace deprecated tokenizers/punkt/{lang}.pickle path with the new tokenizers/punkt_tab/{lang}.pickle path required by NLTK 3.9+. The _params.abbrev_types API remains compatible.

Convert remaining % string formatting to f-string in lsa.py

a382e52

Missed during the bulk modernization pass. Converts the warning message in _create_matrix from % formatting to an f-string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration#5

Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration#5
morph-modelcode-ai[bot] wants to merge 11 commits into
morph-mainfrom
sumy-milestone_1-ace71f

morph-modelcode-ai Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

morph-modelcode-ai Bot commented Jun 11, 2026

Table of Contents

Status

Feature Overview

Testing

Automated Testing

Manual Testing

Architecture

Overview

Changes

Build Config (pyproject.toml)

CLI Entry Points (sumy/__main__.py, sumy/evaluation/__main__.py)

NLP Pipeline (sumy/nlp/tokenizers.py, sumy/nlp/stemmers/)

Parsers (sumy/parsers/)

Models / DOM (sumy/models/dom/)

TF Model (sumy/models/tf.py)

Summarizers (sumy/summarizers/)

Evaluation (sumy/evaluation/)

Utilities (sumy/utils.py)

Deleted: _compat.py (Python 2 Shim)

Design Decisions

Suggested Order of Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Build Config (`pyproject.toml`)

CLI Entry Points (`sumy/main.py`, `sumy/evaluation/main.py`)

NLP Pipeline (`sumy/nlp/tokenizers.py`, `sumy/nlp/stemmers/`)

Parsers (`sumy/parsers/`)

Models / DOM (`sumy/models/dom/`)

TF Model (`sumy/models/tf.py`)

Summarizers (`sumy/summarizers/`)

Evaluation (`sumy/evaluation/`)

Utilities (`sumy/utils.py`)

Deleted: `_compat.py` (Python 2 Shim)