Releases · kgruiz/PyTokenCounter

01 Oct 17:13

kgruiz

v1.9.0

25a86fd

v1.9.0 Latest

Latest

Changes

New pytokencount CLI entry point
Deprecated tokencount and tc
Docs updated for new command
Version bumped to 1.9.0

Notes

Unit tests currently fail on model metadata; fix planned in next patch

Assets 2

22 Aug 03:19

kgruiz

v1.8.2

06b9f9e

v1.8.2

Fix: Prevent crash when updating progress during nested directory counting. Progress updates now no-op if the task was cleared.
Stability: Defensive handling in progress to avoid intermittent ValueErrors.

Assets 2

22 Aug 03:06

kgruiz

v1.8.1

205901e

v1.8.1

Fixed CLI import error: Resolved ImportError: cannot import name 'GetModelForEncoding' when invoking tc/tokencount.

What's changed

core:
- Added GetModelForEncoding (alias of GetModelForEncodingName).
- Implemented GetModelMappings() to return an OrderedDict of model→encoding.
- Exposed GetValidModels() and GetValidEncodings() for consumers.
version:
- Bumped to 1.8.1.

Fixes

Ensures PyTokenCounter.__init__ exports align with core.py, restoring compatibility with the CLI and importers.

Assets 2

22 Aug 03:05

kgruiz

v1.8.0

2b76fd6

v1.8.0

Summary
Adds stronger encoding handling, hidden/binary file options, full pytest suite, and CI with linting.

Changes

Encoding: ReadTextFile uses fallbacks (windows-1252, utf-8, latin-1); clearer UnsupportedEncodingError with clickable file paths.
File tokenization: hidden/binary file filters; error handling improved with encoding context.
CI: new GitHub Actions workflow runs flake8 + pytest on code changes.
Config: bump to v1.8.0, add pytest + flake8, add pytest.ini.
Tests: new tests/ suite plus Windows-1252 input sample; expanded coverage.
Docs: README explains running tests.

Release notes

New hidden file toggle
New explicit unsupported encoding error
Added CI + linting + tests
Improved skip messages for binary files
Version 1.8.0

Assets 2

11 Mar 19:27

kgruiz

v1.7.0

7a2676b

v1.7.0

This update improves file handling within the ReadTextFile function by adding a check for empty or blank files. Previously, if a file was empty, chardet would incorrectly raise an error when attempting to detect the encoding, as the content was None. This fix ensures better robustness and prevents unnecessary failures in workflows processing multiple files.

Key Updates & Enhancements

Improved File Handling
- Added a check in ReadTextFile to handle empty or blank files gracefully.
- Prevents chardet from raising an encoding detection error on empty files.
- Ensures smoother processing when handling diverse file inputs.

Bug Fixes

Fixed an issue where empty files caused encoding detection failures.

Assets 2

03 Feb 23:05

kgruiz

v1.6.4

0167d3a

v1.6.4

This update improves token count aggregation in directory and file processing, ensuring accurate and structured JSON output.

Key Updates & Enhancements

Improved Token Count Aggregation

GetNumTokenDir now includes a "numTokens" key at the top level, correctly summing all files and subdirectories while respecting the recursive parameter.
GetNumTokenFiles properly propagates mapTokens when processing directories, ensuring accurate aggregation of "numTokens" for nested structures, also following the recursive parameter logic.

Assets 2

03 Feb 22:36

kgruiz

v1.6.3

a97701e

v1.6.3

This update improves JSON output handling for CLI functions, ensuring consistent and structured formatting across results.

Key Updates & Enhancements

Improved JSON Formatting

Functions returning OrderedDict or list are now properly formatted using json.dumps(..., indent=4).
Standardized JSON output across CLI functions for improved readability.

Bug Fixes

Fixed inconsistencies in output formatting.
Prevented unnecessary JSON indentation for non-JSON return values.

Assets 2

03 Feb 22:13

kgruiz

v1.6.2

731da64

v1.6.2

This release reintroduces the list-based exclusion mechanism that was removed in v1.6.1. Files with unsupported extensions are now bypassed for enhanced performance, while decoding error handling remains in place to manage unreadable files.

Key Changes

List-Based Exclusion: Quickly skips files with known unsupported extensions.
Decoding Error Handling: Continues to safely bypass files that trigger errors.
Optimized Performance: Balances efficiency and robustness for improved token counting.

Summary

Enhances processing efficiency while maintaining high reliability.

Assets 2

03 Feb 21:47

kgruiz

v1.6.1

64fee79

v1.6.1

This update improves handling of binary/unreadable files when excludeBinary=True. Instead of relying on a predefined list of extensions, the implementation now catches decoding errors and skips the file if it cannot be read as text.

Key Updates & Enhancements

Improved Binary Handling

When excludeBinary=True, files that cannot be decoded are skipped instead of relying on an extension-based exclusion list.
Prevents incorrect exclusions and ensures that unreadable binary files are properly skipped.

Bug Fixes

Fixed an issue where some files were incorrectly allowed despite being unreadable, causing errors.
Resolved cases where certain valid text files were mistakenly excluded.
Improved error handling and logging to indicate when a file is skipped due to decoding failure.

Assets 2

03 Feb 21:14

kgruiz

v1.6.0

4ac450a

v1.6.0

This release introduces significant output mapping enhancements aimed at providing users with a more structured and manageable data format for tokenization and token counting. These improvements facilitate easier integration into automated workflows and better data handling.

Key Updates & Enhancements

1. New `mapTokens` Parameter for Enhanced Output Structure

Core Function Integration:

The boolean parameter mapTokens is now available across the following core functions:
- Tokenization Functions:
  - TokenizeStr
  - TokenizeFile
  - TokenizeDir
  - TokenizeFiles
- Counting Functions:
  - GetNumTokenFile
  - GetNumTokenDir
  - GetNumTokenFiles
Structured OrderedDict Output:

When mapTokens is set to True, the functions return a structured OrderedDict that provides detailed information:
- For Tokenization Functions:
  Returns nested OrderedDicts mapping decoded strings to tokens, or mapping filenames/directory structures to comprehensive token data in a hierarchical format.
- For Counting Functions:
  Generates nested OrderedDicts mirroring the token structure, with token counts replacing raw token lists. This makes data processing simpler and more intuitive.
Directory Output Structure Enhancements:

For operations involving directories, the output now includes:
- "tokens" Key:
  Contains a nested OrderedDict with detailed tokenization or counting results for files and subdirectories, organized according to the directory structure.
- "numTokens" Key:
  Provides the aggregated total token count for the directory (including files and subdirectories when recursive processing is enabled).
Updated TokenizeFile Output Format:

The TokenizeFile function now outputs an OrderedDict with:
- Top-Level Key:
  The filename of the processed file.
- Nested Value:
  An OrderedDict containing:
  - "tokens": The complete list of token IDs.
  - "numTokens": The total token count for the file.

CLI Updates for Operational Efficiency

New --mapTokens / -M Command-Line Option:

The CLI subcommands now include the --mapTokens (or -M) option:
- Affected subcommands:
  - tokenize-str
  - tokenize-file
  - tokenize-files
  - tokenize-dir
  - count-str
  - count-file
  - count-files
  - count-dir
- Mapped Output via CLI:
  
  When used, the --mapTokens option produces the structured OrderedDict output format, offering detailed and organized results directly in the command-line interface.
- Documentation Enhancements:
  
  Help messages for all CLI subcommands have been updated to document the --mapTokens option and describe the new output structure, including the "tokens" and "numTokens" keys.

Enhanced Data Management & Workflow Integration

Structured Data with Nested OrderedDict:

The new nested OrderedDict format enables easier parsing, navigation, and use in scripts and programmatic workflows.
Increased Detail and Clarity:

Users now have access to both aggregate token counts and file- or subdirectory-specific token data, resulting in a more precise and clear output format.

PyTokenCounter v1.6.0 improves output capabilities significantly, providing enhanced data management and workflow integration for tokenization and counting tasks. Enjoy the improved structure and functionality in your projects!

Assets 2

Releases: kgruiz/PyTokenCounter

v1.9.0

Changes

Notes

Uh oh!

v1.8.2

Uh oh!

v1.8.1

What's changed

Fixes

Uh oh!

v1.8.0

Changes

Release notes

Uh oh!

v1.7.0

Key Updates & Enhancements

Bug Fixes

Uh oh!

v1.6.4

Key Updates & Enhancements

Improved Token Count Aggregation

Uh oh!

v1.6.3

Key Updates & Enhancements

Improved JSON Formatting

Bug Fixes

Uh oh!

v1.6.2

Key Changes

Summary

Uh oh!

v1.6.1

Key Updates & Enhancements

Improved Binary Handling

Bug Fixes

Uh oh!

v1.6.0

Key Updates & Enhancements

1. New mapTokens Parameter for Enhanced Output Structure

CLI Updates for Operational Efficiency

Enhanced Data Management & Workflow Integration

Uh oh!

1. New `mapTokens` Parameter for Enhanced Output Structure