Skip to content

Releases: kgruiz/PyTokenCounter

v1.9.0

01 Oct 17:13

Choose a tag to compare

Changes

  • New pytokencount CLI entry point
  • Deprecated tokencount and tc
  • Docs updated for new command
  • Version bumped to 1.9.0

Notes

  • Unit tests currently fail on model metadata; fix planned in next patch

v1.8.2

22 Aug 03:19

Choose a tag to compare

  • Fix: Prevent crash when updating progress during nested directory counting. Progress updates now no-op if the task was cleared.
  • Stability: Defensive handling in progress to avoid intermittent ValueErrors.

v1.8.1

22 Aug 03:06

Choose a tag to compare

  • Fixed CLI import error: Resolved ImportError: cannot import name 'GetModelForEncoding' when invoking tc/tokencount.

What's changed

  • core:
    • Added GetModelForEncoding (alias of GetModelForEncodingName).
    • Implemented GetModelMappings() to return an OrderedDict of model→encoding.
    • Exposed GetValidModels() and GetValidEncodings() for consumers.
  • version:
    • Bumped to 1.8.1.

Fixes

  • Ensures PyTokenCounter.__init__ exports align with core.py, restoring compatibility with the CLI and importers.

v1.8.0

22 Aug 03:05

Choose a tag to compare

Summary
Adds stronger encoding handling, hidden/binary file options, full pytest suite, and CI with linting.

Changes

  • Encoding: ReadTextFile uses fallbacks (windows-1252, utf-8, latin-1); clearer UnsupportedEncodingError with clickable file paths.
  • File tokenization: hidden/binary file filters; error handling improved with encoding context.
  • CI: new GitHub Actions workflow runs flake8 + pytest on code changes.
  • Config: bump to v1.8.0, add pytest + flake8, add pytest.ini.
  • Tests: new tests/ suite plus Windows-1252 input sample; expanded coverage.
  • Docs: README explains running tests.

Release notes

  • New hidden file toggle
  • New explicit unsupported encoding error
  • Added CI + linting + tests
  • Improved skip messages for binary files
  • Version 1.8.0

v1.7.0

11 Mar 19:27

Choose a tag to compare

This update improves file handling within the ReadTextFile function by adding a check for empty or blank files. Previously, if a file was empty, chardet would incorrectly raise an error when attempting to detect the encoding, as the content was None. This fix ensures better robustness and prevents unnecessary failures in workflows processing multiple files.

Key Updates & Enhancements

  • Improved File Handling
    • Added a check in ReadTextFile to handle empty or blank files gracefully.
    • Prevents chardet from raising an encoding detection error on empty files.
    • Ensures smoother processing when handling diverse file inputs.

Bug Fixes

  • Fixed an issue where empty files caused encoding detection failures.

v1.6.4

03 Feb 23:05

Choose a tag to compare

This update improves token count aggregation in directory and file processing, ensuring accurate and structured JSON output.

Key Updates & Enhancements

Improved Token Count Aggregation

  • GetNumTokenDir now includes a "numTokens" key at the top level, correctly summing all files and subdirectories while respecting the recursive parameter.
  • GetNumTokenFiles properly propagates mapTokens when processing directories, ensuring accurate aggregation of "numTokens" for nested structures, also following the recursive parameter logic.

v1.6.3

03 Feb 22:36

Choose a tag to compare

This update improves JSON output handling for CLI functions, ensuring consistent and structured formatting across results.

Key Updates & Enhancements

Improved JSON Formatting

  • Functions returning OrderedDict or list are now properly formatted using json.dumps(..., indent=4).
  • Standardized JSON output across CLI functions for improved readability.

Bug Fixes

  • Fixed inconsistencies in output formatting.
  • Prevented unnecessary JSON indentation for non-JSON return values.

v1.6.2

03 Feb 22:13

Choose a tag to compare

This release reintroduces the list-based exclusion mechanism that was removed in v1.6.1. Files with unsupported extensions are now bypassed for enhanced performance, while decoding error handling remains in place to manage unreadable files.

Key Changes

  • List-Based Exclusion: Quickly skips files with known unsupported extensions.
  • Decoding Error Handling: Continues to safely bypass files that trigger errors.
  • Optimized Performance: Balances efficiency and robustness for improved token counting.

Summary

Enhances processing efficiency while maintaining high reliability.

v1.6.1

03 Feb 21:47

Choose a tag to compare

This update improves handling of binary/unreadable files when excludeBinary=True. Instead of relying on a predefined list of extensions, the implementation now catches decoding errors and skips the file if it cannot be read as text.

Key Updates & Enhancements

Improved Binary Handling

  • When excludeBinary=True, files that cannot be decoded are skipped instead of relying on an extension-based exclusion list.
  • Prevents incorrect exclusions and ensures that unreadable binary files are properly skipped.

Bug Fixes

  • Fixed an issue where some files were incorrectly allowed despite being unreadable, causing errors.
  • Resolved cases where certain valid text files were mistakenly excluded.
  • Improved error handling and logging to indicate when a file is skipped due to decoding failure.

v1.6.0

03 Feb 21:14

Choose a tag to compare

This release introduces significant output mapping enhancements aimed at providing users with a more structured and manageable data format for tokenization and token counting. These improvements facilitate easier integration into automated workflows and better data handling.


Key Updates & Enhancements

1. New mapTokens Parameter for Enhanced Output Structure

  • Core Function Integration:

    The boolean parameter mapTokens is now available across the following core functions:

    • Tokenization Functions:

      • TokenizeStr
      • TokenizeFile
      • TokenizeDir
      • TokenizeFiles
    • Counting Functions:

      • GetNumTokenFile
      • GetNumTokenDir
      • GetNumTokenFiles
  • Structured OrderedDict Output:

    When mapTokens is set to True, the functions return a structured OrderedDict that provides detailed information:

    • For Tokenization Functions:
      Returns nested OrderedDicts mapping decoded strings to tokens, or mapping filenames/directory structures to comprehensive token data in a hierarchical format.

    • For Counting Functions:
      Generates nested OrderedDicts mirroring the token structure, with token counts replacing raw token lists. This makes data processing simpler and more intuitive.

  • Directory Output Structure Enhancements:

    For operations involving directories, the output now includes:

    • "tokens" Key:
      Contains a nested OrderedDict with detailed tokenization or counting results for files and subdirectories, organized according to the directory structure.

    • "numTokens" Key:
      Provides the aggregated total token count for the directory (including files and subdirectories when recursive processing is enabled).

  • Updated TokenizeFile Output Format:

    The TokenizeFile function now outputs an OrderedDict with:

    • Top-Level Key:
      The filename of the processed file.

    • Nested Value:
      An OrderedDict containing:

      • "tokens": The complete list of token IDs.
      • "numTokens": The total token count for the file.

CLI Updates for Operational Efficiency

  • New --mapTokens / -M Command-Line Option:

    The CLI subcommands now include the --mapTokens (or -M) option:

    • Affected subcommands:

      • tokenize-str
      • tokenize-file
      • tokenize-files
      • tokenize-dir
      • count-str
      • count-file
      • count-files
      • count-dir
    • Mapped Output via CLI:

      When used, the --mapTokens option produces the structured OrderedDict output format, offering detailed and organized results directly in the command-line interface.

    • Documentation Enhancements:

      Help messages for all CLI subcommands have been updated to document the --mapTokens option and describe the new output structure, including the "tokens" and "numTokens" keys.


Enhanced Data Management & Workflow Integration

  • Structured Data with Nested OrderedDict:

    The new nested OrderedDict format enables easier parsing, navigation, and use in scripts and programmatic workflows.

  • Increased Detail and Clarity:

    Users now have access to both aggregate token counts and file- or subdirectory-specific token data, resulting in a more precise and clear output format.


PyTokenCounter v1.6.0 improves output capabilities significantly, providing enhanced data management and workflow integration for tokenization and counting tasks. Enjoy the improved structure and functionality in your projects!