Releases: kgruiz/PyTokenCounter
v1.9.0
v1.8.2
v1.8.1
- Fixed CLI import error: Resolved
ImportError: cannot import name 'GetModelForEncoding'when invokingtc/tokencount.
What's changed
- core:
- Added
GetModelForEncoding(alias ofGetModelForEncodingName). - Implemented
GetModelMappings()to return anOrderedDictof model→encoding. - Exposed
GetValidModels()andGetValidEncodings()for consumers.
- Added
- version:
- Bumped to
1.8.1.
- Bumped to
Fixes
- Ensures
PyTokenCounter.__init__exports align withcore.py, restoring compatibility with the CLI and importers.
v1.8.0
Summary
Adds stronger encoding handling, hidden/binary file options, full pytest suite, and CI with linting.
Changes
- Encoding:
ReadTextFileuses fallbacks (windows-1252,utf-8,latin-1); clearerUnsupportedEncodingErrorwith clickable file paths. - File tokenization: hidden/binary file filters; error handling improved with encoding context.
- CI: new GitHub Actions workflow runs flake8 + pytest on code changes.
- Config: bump to v1.8.0, add pytest + flake8, add
pytest.ini. - Tests: new
tests/suite plus Windows-1252 input sample; expanded coverage. - Docs: README explains running tests.
Release notes
- New hidden file toggle
- New explicit unsupported encoding error
- Added CI + linting + tests
- Improved skip messages for binary files
- Version 1.8.0
v1.7.0
This update improves file handling within the ReadTextFile function by adding a check for empty or blank files. Previously, if a file was empty, chardet would incorrectly raise an error when attempting to detect the encoding, as the content was None. This fix ensures better robustness and prevents unnecessary failures in workflows processing multiple files.
Key Updates & Enhancements
- Improved File Handling
- Added a check in
ReadTextFileto handle empty or blank files gracefully. - Prevents
chardetfrom raising an encoding detection error on empty files. - Ensures smoother processing when handling diverse file inputs.
- Added a check in
Bug Fixes
- Fixed an issue where empty files caused encoding detection failures.
v1.6.4
This update improves token count aggregation in directory and file processing, ensuring accurate and structured JSON output.
Key Updates & Enhancements
Improved Token Count Aggregation
GetNumTokenDirnow includes a"numTokens"key at the top level, correctly summing all files and subdirectories while respecting therecursiveparameter.GetNumTokenFilesproperly propagatesmapTokenswhen processing directories, ensuring accurate aggregation of"numTokens"for nested structures, also following therecursiveparameter logic.
v1.6.3
This update improves JSON output handling for CLI functions, ensuring consistent and structured formatting across results.
Key Updates & Enhancements
Improved JSON Formatting
- Functions returning
OrderedDictorlistare now properly formatted usingjson.dumps(..., indent=4). - Standardized JSON output across CLI functions for improved readability.
Bug Fixes
- Fixed inconsistencies in output formatting.
- Prevented unnecessary JSON indentation for non-JSON return values.
v1.6.2
This release reintroduces the list-based exclusion mechanism that was removed in v1.6.1. Files with unsupported extensions are now bypassed for enhanced performance, while decoding error handling remains in place to manage unreadable files.
Key Changes
- List-Based Exclusion: Quickly skips files with known unsupported extensions.
- Decoding Error Handling: Continues to safely bypass files that trigger errors.
- Optimized Performance: Balances efficiency and robustness for improved token counting.
Summary
Enhances processing efficiency while maintaining high reliability.
v1.6.1
This update improves handling of binary/unreadable files when excludeBinary=True. Instead of relying on a predefined list of extensions, the implementation now catches decoding errors and skips the file if it cannot be read as text.
Key Updates & Enhancements
Improved Binary Handling
- When
excludeBinary=True, files that cannot be decoded are skipped instead of relying on an extension-based exclusion list. - Prevents incorrect exclusions and ensures that unreadable binary files are properly skipped.
Bug Fixes
- Fixed an issue where some files were incorrectly allowed despite being unreadable, causing errors.
- Resolved cases where certain valid text files were mistakenly excluded.
- Improved error handling and logging to indicate when a file is skipped due to decoding failure.
v1.6.0
This release introduces significant output mapping enhancements aimed at providing users with a more structured and manageable data format for tokenization and token counting. These improvements facilitate easier integration into automated workflows and better data handling.
Key Updates & Enhancements
1. New mapTokens Parameter for Enhanced Output Structure
-
Core Function Integration:
The boolean parameter
mapTokensis now available across the following core functions:-
Tokenization Functions:
TokenizeStrTokenizeFileTokenizeDirTokenizeFiles
-
Counting Functions:
GetNumTokenFileGetNumTokenDirGetNumTokenFiles
-
-
Structured
OrderedDictOutput:When
mapTokensis set toTrue, the functions return a structuredOrderedDictthat provides detailed information:-
For Tokenization Functions:
Returns nestedOrderedDicts mapping decoded strings to tokens, or mapping filenames/directory structures to comprehensive token data in a hierarchical format. -
For Counting Functions:
Generates nestedOrderedDicts mirroring the token structure, with token counts replacing raw token lists. This makes data processing simpler and more intuitive.
-
-
Directory Output Structure Enhancements:
For operations involving directories, the output now includes:
-
"tokens"Key:
Contains a nestedOrderedDictwith detailed tokenization or counting results for files and subdirectories, organized according to the directory structure. -
"numTokens"Key:
Provides the aggregated total token count for the directory (including files and subdirectories when recursive processing is enabled).
-
-
Updated
TokenizeFileOutput Format:The
TokenizeFilefunction now outputs anOrderedDictwith:-
Top-Level Key:
The filename of the processed file. -
Nested Value:
AnOrderedDictcontaining:"tokens": The complete list of token IDs."numTokens": The total token count for the file.
-
CLI Updates for Operational Efficiency
-
New
--mapTokens/-MCommand-Line Option:The CLI subcommands now include the
--mapTokens(or-M) option:-
Affected subcommands:
tokenize-strtokenize-filetokenize-filestokenize-dircount-strcount-filecount-filescount-dir
-
Mapped Output via CLI:
When used, the
--mapTokensoption produces the structuredOrderedDictoutput format, offering detailed and organized results directly in the command-line interface. -
Documentation Enhancements:
Help messages for all CLI subcommands have been updated to document the
--mapTokensoption and describe the new output structure, including the"tokens"and"numTokens"keys.
-
Enhanced Data Management & Workflow Integration
-
Structured Data with Nested
OrderedDict:The new nested
OrderedDictformat enables easier parsing, navigation, and use in scripts and programmatic workflows. -
Increased Detail and Clarity:
Users now have access to both aggregate token counts and file- or subdirectory-specific token data, resulting in a more precise and clear output format.
PyTokenCounter v1.6.0 improves output capabilities significantly, providing enhanced data management and workflow integration for tokenization and counting tasks. Enjoy the improved structure and functionality in your projects!