Skip to content

Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration#5

Open
morph-modelcode-ai[bot] wants to merge 11 commits into
morph-mainfrom
sumy-milestone_1-ace71f
Open

Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration#5
morph-modelcode-ai[bot] wants to merge 11 commits into
morph-mainfrom
sumy-milestone_1-ace71f

Conversation

@morph-modelcode-ai

Copy link
Copy Markdown

View Milestone


Table of Contents


Status

Milestone successfully completed. All scope items from the Milestone Plan were implemented across 11 commits:

  • Build system migrated from setup.py/setup.cfg/MANIFEST.in to pyproject.toml
  • Python 2 compatibility layer (_compat.py) deleted; all __future__ imports and coding declarations removed from every source file
  • NLTK tokenizer path updated from punkt to punkt_tab
  • Both CLI entry points (sumy/__main__.py, sumy/evaluation/__main__.py) migrated from docopt to argparse
  • Modern Python 3.10+ idioms applied across all 38 source files and 17 test files
  • ruff linting and mypy type checking configured and passing
  • Obsolete tests (test_compat.py, test_unicode_compatible_class.py) deleted; CLI tests rewritten for argparse

Deviation from Milestone Plan: The breadability dependency is intentionally retained in this milestone, as noted in the Milestone Plan's risk section. The full HTML parser migration to readability-lxml is deferred to Milestone 2. The html.py source file was modernized (removed _compat imports, __future__ imports, updated class syntax) but retains the breadability.readable.Article-based parsing logic.


Feature Overview

This milestone establishes the modernized foundation for the sumy project. After these changes:

  • Python 3.10+ only: The project targets Python 3.10, 3.11, 3.12, and 3.13. All Python 2 compatibility code has been removed.
  • Single build configuration: All package metadata, dependencies, tool configs (pytest, ruff, mypy), and entry points are consolidated in pyproject.toml.
  • Modernized CLI: Users invoke summarization via sumy <method> [options] using argparse (stdlib), removing the dependency on the unmaintained docopt library. The CLI semantics are preserved — e.g., sumy luhn --file <path> --language czech --length 3 produces summarized output.
  • NLTK 3.9+ compatibility: The tokenizer uses the new punkt_tab resource path, which is required by NLTK 3.9+ and resolves CVE-2024-39705.
  • Static analysis: ruff (linting) and mypy (type checking) are configured in pyproject.toml and pass cleanly.

Testing

Automated Testing

The existing test suite (55+ tests) validates the modernized codebase:

  • CLI tests (tests/test_main.py): Rewritten to test argparse-based parsing — valid method invocation, missing arguments (SystemExit), invalid methods, --version flag, handle_arguments with file/stdin/URL inputs, and invalid format rejection.
  • Tokenizer tests (tests/test_tokenizers.py): Verify punkt_tab tokenizer output. One expected token tuple was updated ('but' now correctly extracted from 'but..' by the punkt_tab tokenizer).
  • Summarizer tests (tests/test_summarizers/): All seven summarizer algorithms plus Random are tested against existing Czech and English test documents. KL summarizer includes a fix for a KeyError on capitalized words (tested via existing document fixtures).
  • Parser tests (tests/test_parsers.py): PlaintextParser bytes-decoding fix is covered by existing parser tests.
  • Model tests (tests/test_models/): DOM and TF model tests pass with modernized __str__ methods (replacing __unicode__).
  • Evaluation tests (tests/test_evaluation.py): ROUGE, coselection, and content-based metric tests pass unchanged.
  • Deleted tests: test_compat.py (87 LOC) and test_unicode_compatible_class.py (55 LOC) were removed — they tested the deleted _compat.py module.

Run the test suite:

cd sumy
pip install -e ".[dev]"
python -m nltk.downloader punkt_tab
pytest

Manual Testing

Verify end-to-end CLI summarization:

cd sumy
sumy luhn --file tests/data/snippets/prevko.txt --language czech --length 3

This should output 3 Czech-language summary sentences extracted from the test document. Confirm the output contains coherent Czech text (not error messages or encoding artifacts).

Verify argparse help and version:

sumy --help
sumy --version

--help should display the argparse-generated usage with method choices (luhn, edmundson, lsa, text-rank, lex-rank, sum-basic, kl) and all options. --version should print the version string (sumy 0.3.0).


Architecture

Overview

graph TD
    subgraph Legend
        L1[Modified]
        L2[Deleted]
    end

    BUILD["pyproject.toml\n(Build Config)"]
    CLI["CLI Entry Points\n(__main__.py, eval/__main__.py)"]
    NLP["NLP Pipeline\n(Tokenizer, Stemmers)"]
    PARSERS["Parsers\n(PlaintextParser, HtmlParser)"]
    MODELS["Models / DOM\n(Sentence, Paragraph, Document)"]
    TF["TF Model\n(TfDocumentModel)"]
    SUMMARIZERS["Summarizers\n(7 Algorithms + Random)"]
    EVAL["Evaluation\n(ROUGE, Coselection, Content)"]
    UTILS["Utilities\n(cached_property, stopwords)"]
    COMPAT["_compat.py\n(Python 2 Shim)"]

    CLI --> PARSERS
    CLI --> SUMMARIZERS
    PARSERS --> MODELS
    PARSERS --> NLP
    SUMMARIZERS --> MODELS
    SUMMARIZERS --> TF
    SUMMARIZERS --> NLP
    EVAL --> TF
    MODELS --> UTILS
    CLI --> UTILS

    style BUILD fill:#ffff99,stroke:#999
    style CLI fill:#ffff99,stroke:#999
    style NLP fill:#ffff99,stroke:#999
    style PARSERS fill:#ffff99,stroke:#999
    style MODELS fill:#ffff99,stroke:#999
    style TF fill:#ffff99,stroke:#999
    style SUMMARIZERS fill:#ffff99,stroke:#999
    style EVAL fill:#ffff99,stroke:#999
    style UTILS fill:#ffff99,stroke:#999
    style COMPAT fill:#ff9999,stroke:#999
    style L1 fill:#ffff99,stroke:#999
    style L2 fill:#ff9999,stroke:#999
Loading

Changes

Build Config (pyproject.toml)

Created pyproject.toml with setuptools build backend, consolidating all metadata from setup.py, tool configs from setup.cfg, and package data rules from MANIFEST.in. Key sections: [project] (name, version, Python 3.10+ classifiers, dependencies), [project.optional-dependencies] (LSA/LexRank numpy extras, dev tools), [project.scripts] (sumy, sumy_eval entry points), [tool.pytest.ini_options], [tool.ruff], [tool.mypy]. Deleted setup.py (105 LOC), setup.cfg (14 LOC), MANIFEST.in (4 LOC). Updated Makefile to remove legacy py.test-2.6/py.test-3.2 targets and old publish/bump commands. Updated .gitignore for modern Python packaging artifacts.

CLI Entry Points (sumy/__main__.py, sumy/evaluation/__main__.py)

Replaced docopt with argparse. Both CLIs use a single positional method argument with choices to select the summarizer algorithm, preserving the existing sumy <method> [options] invocation syntax. Options (--length, --language, --stopwords, --format, --url, --file, --version) are defined as explicit add_argument calls. The AVAILABLE_METHODS dict maps method names to summarizer classes directly (replacing the old iterate-and-check-boolean-flag pattern). The evaluation CLI adds a reference_summary positional argument and includes random in method choices.

NLP Pipeline (sumy/nlp/tokenizers.py, sumy/nlp/stemmers/)

Updated the tokenizer resource path from tokenizers/punkt/{lang}.pickle to tokenizers/punkt_tab/{lang}.pickle for NLTK 3.9+ compatibility. The _params.abbrev_types API for extra abbreviations remains compatible. Stemmers module modernized (removed _compat imports, modern class syntax) but the NLTK Snowball API is unchanged. Czech stemmer (czech.py) modernized with f-strings and class Foo: syntax.

Parsers (sumy/parsers/)

PlaintextParser: Fixed bytes-decoding — when files are opened in binary mode, input_stream.read() returns bytes. The old to_unicode() handled this transparently, but the replacement str(text) produces "b'...'" instead of decoding. Added explicit isinstance(text, bytes) check with .decode("utf-8"). HtmlParser: Modernized imports and class syntax; breadability-based parsing logic retained for Milestone 2 migration. Base DocumentParser class: removed _compat imports, modernized class syntax.

Models / DOM (sumy/models/dom/)

Sentence, Paragraph, ObjectDocumentModel: Replaced @unicode_compatible decorator with direct __str__ methods (renamed from __unicode__). Updated __repr__ to use f-strings. __slots__ retained on Sentence and Paragraph for memory efficiency. Sentence.__hash__ and __eq__ contracts preserved. All to_unicode()/to_string() calls replaced with native str().

TF Model (sumy/models/tf.py)

Fixed from collections import Sequence to from collections.abc import Sequence (the old import was removed in Python 3.10). Modernized class syntax and string formatting.

Summarizers (sumy/summarizers/)

All eight summarizer modules modernized: removed _compat imports, __future__ imports, coding declarations. Applied class Foo: syntax, super() calls, f-strings. AbstractSummarizer._get_best_sentences: converted lambda to named function for ruff E731 compliance. KLSummarizer._compute_ratings: fixed a KeyError bug — raw (unnormalized) summary words were compared against the normalized frequency dictionary; now uses _get_all_content_words_in_doc for consistent normalization. LsaSummarizer and LexRankSummarizer: numpy optional imports restructured with typing.Any annotations for mypy compatibility; import ordering fixed for ruff E402. Class-level _stop_words attributes annotated as frozenset[str] across TextRankSummarizer, LuhnSummarizer, LsaSummarizer, and LexRankSummarizer.

Evaluation (sumy/evaluation/)

rouge.py: Converted from tab indentation to spaces (fixing ruff W191/E101). Modernized all string formatting and class syntax. content_based.py, coselection.py: Removed _compat imports, applied f-strings. __init__.py: Fixed re-exports with explicit as symbol syntax for ruff F401.

Utilities (sumy/utils.py)

Retained the custom cached_property decorator (required for __slots__-based classes) with an explanatory comment noting why functools.cached_property cannot be used. Replaced to_string()/to_unicode() calls with native str() and .decode("utf-8"). ItemsCount and read_stop_words modernized with f-strings and direct str/bytes operations.

Deleted: _compat.py (Python 2 Shim)

Deleted sumy/_compat.py (109 LOC). This module provided Python 2/3 dual-compatibility: to_unicode, to_string, to_bytes, unicode_compatible decorator, Counter polyfill, urllib shim, ffilter, string_types, and PY3 flag. All 17 source files that imported from _compat were updated to use native Python 3 equivalents.

Design Decisions

1. Slot-Compatible Caching Strategy

The custom cached_property decorator in sumy/utils.py was retained rather than replaced with functools.cached_property. This is because Sentence and Paragraph use __slots__ for memory efficiency, and functools.cached_property requires __dict__ (incompatible with __slots__). A single caching implementation is used across all classes for consistency, avoiding behavioral divergence between two different caching mechanisms.

2. argparse CLI Structure

A single positional argument with choices was chosen over subparsers or --method flag. This preserves the existing sumy luhn ... invocation syntax (backward-compatible with docopt-era usage), keeps the CLI flat and simple, and avoids the boilerplate of per-algorithm subparsers. The AVAILABLE_METHODS dict provides a clean mapping from method name to summarizer class.

3. Build Configuration Consolidation

Version-suffixed entry points (sumy-3.4, sumy_eval-3.4) were dropped — these are a Python 2-era convention for side-by-side Python version installations and are unnecessary with Python 3.10+ as the sole target. bumpversion config was not migrated. Only sumy and sumy_eval are registered as console scripts.

4. unicode_compatible Decorator Removal

__unicode__ methods on Sentence, Paragraph, and ObjectDocumentModel were renamed to __str__, and __repr__ methods were updated to use f-strings. No __bytes__ methods were retained — they were a Python 2 artifact. The __str__ contract (returning the text content) is preserved exactly.


Suggested Order of Review

  1. pyproject.toml — Start here to understand the new build configuration, dependencies, and tool settings. This is the foundation for all other changes.
  2. sumy/utils.py — Review the retained cached_property decorator and modernized utility functions. Understanding this is needed for the DOM model changes.
  3. sumy/models/dom/_sentence.py — Core data structure. See __slots__ retention, __str__ replacement for __unicode__, and cached_property usage.
  4. sumy/models/dom/_paragraph.py — Same pattern as _sentence.py.
  5. sumy/models/dom/_document.pyObjectDocumentModel modernization.
  6. sumy/models/tf.pycollections.abc.Sequence fix and modernized class syntax.
  7. sumy/nlp/tokenizers.pypunkt_tab resource path change (the critical NLTK upgrade).
  8. sumy/nlp/stemmers/__init__.py and sumy/nlp/stemmers/czech.py — Stemmer modernization.
  9. sumy/parsers/plaintext.py — Bytes-decoding fix for binary file input.
  10. sumy/parsers/html.py and sumy/parsers/parser.py — Parser modernization (breadability retained).
  11. sumy/__main__.py — Main CLI rewrite (docopt to argparse). Core behavioral change.
  12. sumy/evaluation/__main__.py — Evaluation CLI rewrite (same pattern as main CLI).
  13. sumy/summarizers/kl.py — KL summarizer KeyError bugfix (normalization mismatch).
  14. sumy/summarizers/lsa.py and sumy/summarizers/lex_rank.py — numpy optional import pattern with mypy annotations.
  15. Remaining sumy/summarizers/ — Bulk modernization across all other summarizer modules.
  16. sumy/evaluation/rouge.py — Tab-to-space conversion and modernization of the largest evaluation module.
  17. tests/test_main.py — Rewritten CLI tests (argparse assertions replacing docopt assertions).
  18. tests/utils.py — Test utility modernization.
  19. Remaining test files — Bulk _compat removal and to_unicode replacement.
  20. Deleted files — Confirm removal of setup.py, setup.cfg, MANIFEST.in, sumy/_compat.py, tests/test_utils/test_compat.py, tests/test_utils/test_unicode_compatible_class.py.

mcode-bot@modelcode.ai added 11 commits June 11, 2026 13:59
- Create pyproject.toml with setuptools build backend, project metadata,
  dependencies (nltk, breadability; docopt removed for argparse migration),
  optional deps (LSA, LexRank, dev), console scripts, and tool configs
  (pytest, ruff, mypy)
- Update classifiers to Python 3.10-3.13 and Development Status 4 - Beta
- Remove version-suffixed entry points (sumy-X.Y, sumy_eval-X.Y)
- Delete setup.py, setup.cfg, and MANIFEST.in
- Update Makefile with modern pytest target, remove old publish/bump targets
- Update .gitignore with patterns for build artifacts, testing, linting,
  and type checking caches
- Delete sumy/_compat.py (Python 2/3 compatibility shim)
- Remove all '# -*- coding: utf8 -*-' lines from every source file
- Remove all 'from __future__ import' lines from every source file
- Replace _compat imports with Python 3 native equivalents:
  - to_unicode/to_string -> str() or remove where unnecessary
  - unicode -> str
  - string_types -> str
  - unicode_compatible decorator -> direct __str__ method
  - ffilter -> itertools.filterfalse
  - Counter -> collections.Counter
  - urllib -> urllib.request
  - PY3 branching -> direct Python 3 code
- Modernize class syntax: class Foo(object) -> class Foo
- Modernize super calls: super(Class, self) -> super()
- Convert % string formatting to f-strings where appropriate
- Update collections.Sequence -> collections.abc.Sequence in models/tf.py
- Keep custom cached_property (required for __slots__ classes) with
  explanatory comment
Replace deprecated tokenizers/punkt/{lang}.pickle path with the new
tokenizers/punkt_tab/{lang}.pickle path required by NLTK 3.9+.
The _params.abbrev_types API remains compatible.
Replace docopt with argparse in sumy/__main__.py and
sumy/evaluation/__main__.py. Both CLIs now use a single positional
argument with choices for summarizer selection, preserving the
existing `sumy <method> [options]` interface.

Key changes:
- Remove docopt imports and docstring-as-usage-spec blocks
- Add argparse.ArgumentParser with explicit argument definitions
- Access parsed args as namespace attributes instead of dict keys
- Look up summarizer via AVAILABLE_METHODS[args.method] instead of
  iterating and checking boolean flags
- Evaluation CLI includes 'random' in method choices and accepts a
  positional reference_summary argument
When files are opened in binary mode ("rb"), input_stream.read() returns
bytes. The old to_unicode() from _compat handled this, but the modernized
str(text) produces "b'...'" instead of decoding. Add explicit bytes
decoding in PlaintextParser.__init__.
- Delete test_compat.py and test_unicode_compatible_class.py (tested deleted _compat module)
- Rewrite test_main.py to test argparse-based CLI instead of docopt
- Update tests/utils.py: remove _compat imports, use plain str and bytes decode
- Remove # -*- coding: utf8 -*- and from __future__ imports from all test files
- Replace all to_unicode() calls with str() across 9 test files
- Remove all sumy._compat imports from test files
Missed during the bulk modernization pass. Converts the warning
message in _create_matrix from % formatting to an f-string.
The KL summarizer's _compute_ratings method used raw (unnormalized)
words from the summary when computing joint frequencies, but compared
them against the normalized word frequency dictionary. This caused a
KeyError when a capitalized word appeared in the summary but only its
lowercase form existed in the term frequency map. Fixed by using
_get_all_content_words_in_doc (which normalizes and filters stop words)
instead of _get_all_words_in_doc for the summary word list.

Updated test_tokenize_sentence expected tuple to include 'but' — the
NLTK punkt_tab tokenizer correctly extracts 'but' from 'but..' as a
word token, whereas the legacy punkt pickle tokenizer did not.
- Fix F401 re-exports in __init__.py files using explicit `as symbol` syntax
  (evaluation, models, models/dom, parsers, summarizers)
- Fix W191/E101 tab indentation in evaluation/rouge.py (converted to spaces)
- Fix E731 lambda assignment in summarizers/_summarizer.py (use def instead)
- Fix E701 multiple statements on one line in summarizers/kl.py
- Fix F841 unused variables in tests (test_evaluation.py, test_kl.py)
- Fix I001 unsorted imports via ruff --fix
- Apply ruff format across all source and test files
- Add ignore_missing_imports=true to [tool.mypy] config for untyped
  third-party libraries (nltk, breadability)
- Add frozenset[str] type annotations to _stop_words class attributes
  in text_rank.py, luhn.py, lsa.py, and lex_rank.py
- Add frozenset[str] type annotation to _EMPTY_SET in edmundson.py
- Fix numpy/svd optional import typing in lsa.py and lex_rank.py by
  declaring module-level `Any` annotations before try/except blocks
Move the relative import of AbstractSummarizer above the numpy
try/except blocks so all standard imports precede the optional
import guards. This resolves ruff E402 (module-level import not
at top of file) introduced by the mypy type annotation fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants