Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration#5
Open
morph-modelcode-ai[bot] wants to merge 11 commits into
Open
Milestone 1: Build System, Python 3 Core Modernization, and CLI Migration#5morph-modelcode-ai[bot] wants to merge 11 commits into
morph-modelcode-ai[bot] wants to merge 11 commits into
Conversation
added 11 commits
June 11, 2026 13:59
- Create pyproject.toml with setuptools build backend, project metadata, dependencies (nltk, breadability; docopt removed for argparse migration), optional deps (LSA, LexRank, dev), console scripts, and tool configs (pytest, ruff, mypy) - Update classifiers to Python 3.10-3.13 and Development Status 4 - Beta - Remove version-suffixed entry points (sumy-X.Y, sumy_eval-X.Y) - Delete setup.py, setup.cfg, and MANIFEST.in - Update Makefile with modern pytest target, remove old publish/bump targets - Update .gitignore with patterns for build artifacts, testing, linting, and type checking caches
- Delete sumy/_compat.py (Python 2/3 compatibility shim) - Remove all '# -*- coding: utf8 -*-' lines from every source file - Remove all 'from __future__ import' lines from every source file - Replace _compat imports with Python 3 native equivalents: - to_unicode/to_string -> str() or remove where unnecessary - unicode -> str - string_types -> str - unicode_compatible decorator -> direct __str__ method - ffilter -> itertools.filterfalse - Counter -> collections.Counter - urllib -> urllib.request - PY3 branching -> direct Python 3 code - Modernize class syntax: class Foo(object) -> class Foo - Modernize super calls: super(Class, self) -> super() - Convert % string formatting to f-strings where appropriate - Update collections.Sequence -> collections.abc.Sequence in models/tf.py - Keep custom cached_property (required for __slots__ classes) with explanatory comment
Replace deprecated tokenizers/punkt/{lang}.pickle path with the new
tokenizers/punkt_tab/{lang}.pickle path required by NLTK 3.9+.
The _params.abbrev_types API remains compatible.
Replace docopt with argparse in sumy/__main__.py and sumy/evaluation/__main__.py. Both CLIs now use a single positional argument with choices for summarizer selection, preserving the existing `sumy <method> [options]` interface. Key changes: - Remove docopt imports and docstring-as-usage-spec blocks - Add argparse.ArgumentParser with explicit argument definitions - Access parsed args as namespace attributes instead of dict keys - Look up summarizer via AVAILABLE_METHODS[args.method] instead of iterating and checking boolean flags - Evaluation CLI includes 'random' in method choices and accepts a positional reference_summary argument
When files are opened in binary mode ("rb"), input_stream.read() returns
bytes. The old to_unicode() from _compat handled this, but the modernized
str(text) produces "b'...'" instead of decoding. Add explicit bytes
decoding in PlaintextParser.__init__.
- Delete test_compat.py and test_unicode_compatible_class.py (tested deleted _compat module) - Rewrite test_main.py to test argparse-based CLI instead of docopt - Update tests/utils.py: remove _compat imports, use plain str and bytes decode - Remove # -*- coding: utf8 -*- and from __future__ imports from all test files - Replace all to_unicode() calls with str() across 9 test files - Remove all sumy._compat imports from test files
Missed during the bulk modernization pass. Converts the warning message in _create_matrix from % formatting to an f-string.
The KL summarizer's _compute_ratings method used raw (unnormalized) words from the summary when computing joint frequencies, but compared them against the normalized word frequency dictionary. This caused a KeyError when a capitalized word appeared in the summary but only its lowercase form existed in the term frequency map. Fixed by using _get_all_content_words_in_doc (which normalizes and filters stop words) instead of _get_all_words_in_doc for the summary word list. Updated test_tokenize_sentence expected tuple to include 'but' — the NLTK punkt_tab tokenizer correctly extracts 'but' from 'but..' as a word token, whereas the legacy punkt pickle tokenizer did not.
- Fix F401 re-exports in __init__.py files using explicit `as symbol` syntax (evaluation, models, models/dom, parsers, summarizers) - Fix W191/E101 tab indentation in evaluation/rouge.py (converted to spaces) - Fix E731 lambda assignment in summarizers/_summarizer.py (use def instead) - Fix E701 multiple statements on one line in summarizers/kl.py - Fix F841 unused variables in tests (test_evaluation.py, test_kl.py) - Fix I001 unsorted imports via ruff --fix - Apply ruff format across all source and test files
- Add ignore_missing_imports=true to [tool.mypy] config for untyped third-party libraries (nltk, breadability) - Add frozenset[str] type annotations to _stop_words class attributes in text_rank.py, luhn.py, lsa.py, and lex_rank.py - Add frozenset[str] type annotation to _EMPTY_SET in edmundson.py - Fix numpy/svd optional import typing in lsa.py and lex_rank.py by declaring module-level `Any` annotations before try/except blocks
Move the relative import of AbstractSummarizer above the numpy try/except blocks so all standard imports precede the optional import guards. This resolves ruff E402 (module-level import not at top of file) introduced by the mypy type annotation fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
View Milestone
Table of Contents
Status
Milestone successfully completed. All scope items from the Milestone Plan were implemented across 11 commits:
setup.py/setup.cfg/MANIFEST.intopyproject.toml_compat.py) deleted; all__future__imports and coding declarations removed from every source filepunkttopunkt_tabsumy/__main__.py,sumy/evaluation/__main__.py) migrated fromdocopttoargparserufflinting andmypytype checking configured and passingtest_compat.py,test_unicode_compatible_class.py) deleted; CLI tests rewritten for argparseDeviation from Milestone Plan: The
breadabilitydependency is intentionally retained in this milestone, as noted in the Milestone Plan's risk section. The full HTML parser migration toreadability-lxmlis deferred to Milestone 2. Thehtml.pysource file was modernized (removed_compatimports,__future__imports, updated class syntax) but retains thebreadability.readable.Article-based parsing logic.Feature Overview
This milestone establishes the modernized foundation for the sumy project. After these changes:
pyproject.toml.sumy <method> [options]using argparse (stdlib), removing the dependency on the unmaintaineddocoptlibrary. The CLI semantics are preserved — e.g.,sumy luhn --file <path> --language czech --length 3produces summarized output.punkt_tabresource path, which is required by NLTK 3.9+ and resolves CVE-2024-39705.ruff(linting) andmypy(type checking) are configured inpyproject.tomland pass cleanly.Testing
Automated Testing
The existing test suite (55+ tests) validates the modernized codebase:
tests/test_main.py): Rewritten to test argparse-based parsing — valid method invocation, missing arguments (SystemExit), invalid methods,--versionflag,handle_argumentswith file/stdin/URL inputs, and invalid format rejection.tests/test_tokenizers.py): Verifypunkt_tabtokenizer output. One expected token tuple was updated ('but'now correctly extracted from'but..'by thepunkt_tabtokenizer).tests/test_summarizers/): All seven summarizer algorithms plus Random are tested against existing Czech and English test documents. KL summarizer includes a fix for aKeyErroron capitalized words (tested via existing document fixtures).tests/test_parsers.py): PlaintextParser bytes-decoding fix is covered by existing parser tests.tests/test_models/): DOM and TF model tests pass with modernized__str__methods (replacing__unicode__).tests/test_evaluation.py): ROUGE, coselection, and content-based metric tests pass unchanged.test_compat.py(87 LOC) andtest_unicode_compatible_class.py(55 LOC) were removed — they tested the deleted_compat.pymodule.Run the test suite:
Manual Testing
Verify end-to-end CLI summarization:
cd sumy sumy luhn --file tests/data/snippets/prevko.txt --language czech --length 3This should output 3 Czech-language summary sentences extracted from the test document. Confirm the output contains coherent Czech text (not error messages or encoding artifacts).
Verify argparse help and version:
--helpshould display the argparse-generated usage with method choices (luhn,edmundson,lsa,text-rank,lex-rank,sum-basic,kl) and all options.--versionshould print the version string (sumy 0.3.0).Architecture
Overview
graph TD subgraph Legend L1[Modified] L2[Deleted] end BUILD["pyproject.toml\n(Build Config)"] CLI["CLI Entry Points\n(__main__.py, eval/__main__.py)"] NLP["NLP Pipeline\n(Tokenizer, Stemmers)"] PARSERS["Parsers\n(PlaintextParser, HtmlParser)"] MODELS["Models / DOM\n(Sentence, Paragraph, Document)"] TF["TF Model\n(TfDocumentModel)"] SUMMARIZERS["Summarizers\n(7 Algorithms + Random)"] EVAL["Evaluation\n(ROUGE, Coselection, Content)"] UTILS["Utilities\n(cached_property, stopwords)"] COMPAT["_compat.py\n(Python 2 Shim)"] CLI --> PARSERS CLI --> SUMMARIZERS PARSERS --> MODELS PARSERS --> NLP SUMMARIZERS --> MODELS SUMMARIZERS --> TF SUMMARIZERS --> NLP EVAL --> TF MODELS --> UTILS CLI --> UTILS style BUILD fill:#ffff99,stroke:#999 style CLI fill:#ffff99,stroke:#999 style NLP fill:#ffff99,stroke:#999 style PARSERS fill:#ffff99,stroke:#999 style MODELS fill:#ffff99,stroke:#999 style TF fill:#ffff99,stroke:#999 style SUMMARIZERS fill:#ffff99,stroke:#999 style EVAL fill:#ffff99,stroke:#999 style UTILS fill:#ffff99,stroke:#999 style COMPAT fill:#ff9999,stroke:#999 style L1 fill:#ffff99,stroke:#999 style L2 fill:#ff9999,stroke:#999Changes
Build Config (
pyproject.toml)Created
pyproject.tomlwith setuptools build backend, consolidating all metadata fromsetup.py, tool configs fromsetup.cfg, and package data rules fromMANIFEST.in. Key sections:[project](name, version, Python 3.10+ classifiers, dependencies),[project.optional-dependencies](LSA/LexRank numpy extras, dev tools),[project.scripts](sumy, sumy_eval entry points),[tool.pytest.ini_options],[tool.ruff],[tool.mypy]. Deletedsetup.py(105 LOC),setup.cfg(14 LOC),MANIFEST.in(4 LOC). UpdatedMakefileto remove legacypy.test-2.6/py.test-3.2targets and old publish/bump commands. Updated.gitignorefor modern Python packaging artifacts.CLI Entry Points (
sumy/__main__.py,sumy/evaluation/__main__.py)Replaced
docoptwithargparse. Both CLIs use a single positionalmethodargument withchoicesto select the summarizer algorithm, preserving the existingsumy <method> [options]invocation syntax. Options (--length,--language,--stopwords,--format,--url,--file,--version) are defined as explicitadd_argumentcalls. TheAVAILABLE_METHODSdict maps method names to summarizer classes directly (replacing the old iterate-and-check-boolean-flag pattern). The evaluation CLI adds areference_summarypositional argument and includesrandomin method choices.NLP Pipeline (
sumy/nlp/tokenizers.py,sumy/nlp/stemmers/)Updated the tokenizer resource path from
tokenizers/punkt/{lang}.pickletotokenizers/punkt_tab/{lang}.picklefor NLTK 3.9+ compatibility. The_params.abbrev_typesAPI for extra abbreviations remains compatible. Stemmers module modernized (removed_compatimports, modern class syntax) but the NLTK Snowball API is unchanged. Czech stemmer (czech.py) modernized with f-strings andclass Foo:syntax.Parsers (
sumy/parsers/)PlaintextParser: Fixed bytes-decoding — when files are opened in binary mode,input_stream.read()returns bytes. The oldto_unicode()handled this transparently, but the replacementstr(text)produces"b'...'"instead of decoding. Added explicitisinstance(text, bytes)check with.decode("utf-8").HtmlParser: Modernized imports and class syntax;breadability-based parsing logic retained for Milestone 2 migration. BaseDocumentParserclass: removed_compatimports, modernized class syntax.Models / DOM (
sumy/models/dom/)Sentence,Paragraph,ObjectDocumentModel: Replaced@unicode_compatibledecorator with direct__str__methods (renamed from__unicode__). Updated__repr__to use f-strings.__slots__retained onSentenceandParagraphfor memory efficiency.Sentence.__hash__and__eq__contracts preserved. Allto_unicode()/to_string()calls replaced with nativestr().TF Model (
sumy/models/tf.py)Fixed
from collections import Sequencetofrom collections.abc import Sequence(the old import was removed in Python 3.10). Modernized class syntax and string formatting.Summarizers (
sumy/summarizers/)All eight summarizer modules modernized: removed
_compatimports,__future__imports, coding declarations. Appliedclass Foo:syntax,super()calls, f-strings.AbstractSummarizer._get_best_sentences: converted lambda to named function for ruff E731 compliance.KLSummarizer._compute_ratings: fixed aKeyErrorbug — raw (unnormalized) summary words were compared against the normalized frequency dictionary; now uses_get_all_content_words_in_docfor consistent normalization.LsaSummarizerandLexRankSummarizer: numpy optional imports restructured withtyping.Anyannotations for mypy compatibility; import ordering fixed for ruff E402. Class-level_stop_wordsattributes annotated asfrozenset[str]acrossTextRankSummarizer,LuhnSummarizer,LsaSummarizer, andLexRankSummarizer.Evaluation (
sumy/evaluation/)rouge.py: Converted from tab indentation to spaces (fixing ruff W191/E101). Modernized all string formatting and class syntax.content_based.py,coselection.py: Removed_compatimports, applied f-strings.__init__.py: Fixed re-exports with explicitas symbolsyntax for ruff F401.Utilities (
sumy/utils.py)Retained the custom
cached_propertydecorator (required for__slots__-based classes) with an explanatory comment noting whyfunctools.cached_propertycannot be used. Replacedto_string()/to_unicode()calls with nativestr()and.decode("utf-8").ItemsCountandread_stop_wordsmodernized with f-strings and directstr/bytesoperations.Deleted:
_compat.py(Python 2 Shim)Deleted
sumy/_compat.py(109 LOC). This module provided Python 2/3 dual-compatibility:to_unicode,to_string,to_bytes,unicode_compatibledecorator,Counterpolyfill,urllibshim,ffilter,string_types, andPY3flag. All 17 source files that imported from_compatwere updated to use native Python 3 equivalents.Design Decisions
1. Slot-Compatible Caching Strategy
The custom
cached_propertydecorator insumy/utils.pywas retained rather than replaced withfunctools.cached_property. This is becauseSentenceandParagraphuse__slots__for memory efficiency, andfunctools.cached_propertyrequires__dict__(incompatible with__slots__). A single caching implementation is used across all classes for consistency, avoiding behavioral divergence between two different caching mechanisms.2. argparse CLI Structure
A single positional argument with
choiceswas chosen over subparsers or--methodflag. This preserves the existingsumy luhn ...invocation syntax (backward-compatible with docopt-era usage), keeps the CLI flat and simple, and avoids the boilerplate of per-algorithm subparsers. TheAVAILABLE_METHODSdict provides a clean mapping from method name to summarizer class.3. Build Configuration Consolidation
Version-suffixed entry points (
sumy-3.4,sumy_eval-3.4) were dropped — these are a Python 2-era convention for side-by-side Python version installations and are unnecessary with Python 3.10+ as the sole target.bumpversionconfig was not migrated. Onlysumyandsumy_evalare registered as console scripts.4.
unicode_compatibleDecorator Removal__unicode__methods onSentence,Paragraph, andObjectDocumentModelwere renamed to__str__, and__repr__methods were updated to use f-strings. No__bytes__methods were retained — they were a Python 2 artifact. The__str__contract (returning the text content) is preserved exactly.Suggested Order of Review
pyproject.toml— Start here to understand the new build configuration, dependencies, and tool settings. This is the foundation for all other changes.sumy/utils.py— Review the retainedcached_propertydecorator and modernized utility functions. Understanding this is needed for the DOM model changes.sumy/models/dom/_sentence.py— Core data structure. See__slots__retention,__str__replacement for__unicode__, andcached_propertyusage.sumy/models/dom/_paragraph.py— Same pattern as_sentence.py.sumy/models/dom/_document.py—ObjectDocumentModelmodernization.sumy/models/tf.py—collections.abc.Sequencefix and modernized class syntax.sumy/nlp/tokenizers.py—punkt_tabresource path change (the critical NLTK upgrade).sumy/nlp/stemmers/__init__.pyandsumy/nlp/stemmers/czech.py— Stemmer modernization.sumy/parsers/plaintext.py— Bytes-decoding fix for binary file input.sumy/parsers/html.pyandsumy/parsers/parser.py— Parser modernization (breadability retained).sumy/__main__.py— Main CLI rewrite (docopt to argparse). Core behavioral change.sumy/evaluation/__main__.py— Evaluation CLI rewrite (same pattern as main CLI).sumy/summarizers/kl.py— KL summarizerKeyErrorbugfix (normalization mismatch).sumy/summarizers/lsa.pyandsumy/summarizers/lex_rank.py— numpy optional import pattern with mypy annotations.sumy/summarizers/— Bulk modernization across all other summarizer modules.sumy/evaluation/rouge.py— Tab-to-space conversion and modernization of the largest evaluation module.tests/test_main.py— Rewritten CLI tests (argparse assertions replacing docopt assertions).tests/utils.py— Test utility modernization._compatremoval andto_unicodereplacement.setup.py,setup.cfg,MANIFEST.in,sumy/_compat.py,tests/test_utils/test_compat.py,tests/test_utils/test_unicode_compatible_class.py.