Releases: pipe1os/modelinfo-cli
Release list
v1.4.4
ModelInfo CLI (v1.4.4)
Added
- Added the
--batch-sizeflag (defaulting to 1) for dynamic KV cache footprint calculations. - Added the
--timeoutflag (defaulting to 10s) to configure network timeouts for remote Hugging Face fetches. - Added support for custom Hugging Face endpoints via the
HF_ENDPOINTenvironment variable. - Added auto-discovery and memory capacity mapping for Intel GPUs using
xpu-smi. - Added suggestions for similar GPU names using
difflib.get_close_matcheswhen an unrecognized GPU target is provided.
Changed
- Reorganized local GPU discovery helpers in
hardware.py. - Cleaned up test parser module imports to resolve E402 warnings.
Fixed
- Propagated timeout values to all remote fetch requests in the Hugging Face parser.
What's Changed
- Add CodeRabbit configuration by @pipe1os in #22
- feat(cli): add batch size option by @sarvesh1327 in #24
- test: add hardware module coverage by @Kkkakania in #25
- feat: support custom Hugging Face endpoints via HF_ENDPOINT env var by @molloyzak13 in #31
- feat: add Hugging Face fetch timeout flag by @rupayon123 in #33
- fix(tests): move mid-file import to top of test_parsers.py by @pipe1os in #34
- docs: add missing --timeout flag to command reference table by @pipe1os in #35
- feat: suggest close gpu matches when target name is unknown by @pipe1os in #36
- feat: add intel arc gpu auto-detection via xpu-smi by @pipe1os in #37
- chore: release v1.4.4 by @pipe1os in #38
New Contributors
- @sarvesh1327 made their first contribution in #24
- @Kkkakania made their first contribution in #25
- @molloyzak13 made their first contribution in #31
Full Changelog: v1.4.3...v1.4.4
v1.4.3
ModelInfo CLI (v1.4.3)
Added
- Added the
-v/--versionflag to quickly check the installed modelinfo version. The version lookup is lazily evaluated to guarantee sub-100ms CLI startup times. - Added missing entry-level GPUs to the
KNOWN_GPUShardware discovery dictionary. - Added repository documentation including
CONTRIBUTING.md,CODE_OF_CONDUCT.md, and GitHub issue/PR templates.
Fixed
- Fixed an Out-Of-Memory (OOM) vulnerability during remote inspection by capping the HTTP response read. This protects the CLI from upstream CDN proxies that ignore HTTP
Rangeheaders. - Fixed confusing stack traces when a local directory is passed instead of a file by raising an explicit
IsADirectoryError. - Fixed the CLI to print user-friendly error messages when attempting to inspect gated or non-existent Hugging Face repositories (401 Unauthorized / 404 Not Found).
- Fixed an issue where the main entry point swallowed exceptions too broadly, obscuring critical stack traces during unexpected failures.
What's Changed
- docs: add contributing and code of conduct guidelines by @pipe1os in #8
- chore: add issue and PR templates by @pipe1os in #10
- Fix/add entry level gpus by @JeanBiza in #11
- fix: friendlier error messages for missing/gated Hugging Face repos (401/404) by @laishettikarthik-tech in #16
- fix: add explicit IsADirectoryError for local directory paths by @pipe1os in #19
- fix: cap HTTP response read to prevent OOM on ignored Range headers by @pipe1os in #20
- fix: remove broad exception swallowing in main() by @pipe1os in #18
- Add CLI version flag by @rupayon123 in #17
New Contributors
- @JeanBiza made their first contribution in #11
- @laishettikarthik-tech made their first contribution in #16
- @rupayon123 made their first contribution in #17
Full Changelog: v1.4.2...v1.4.3
v1.4.2
ModelInfo CLI (v1.4.2)
Fixed
- Fixed unused imports (
os,json) in architecture parsing logic. - Fixed a bug where the
--max-vramargument was ignored when evaluating single models without a target GPU. - Fixed a bug where the target GPU's memory limit was ignored in favor of the default max VRAM when rendering the multi-model comparison table.
- Prevented a potential
ValueErrorcrash in the remote fetcher by enforcing a minimum of 1 worker for theThreadPoolExecutor.
v1.4.1
v1.4.0
ModelInfo CLI (v1.4.0)
This release adds multi-GPU hardware topology modeling and a subtractive vLLM memory engine for inference planning. We also overhauled how remote Hub interactions work to speed up metadata fetching.
Added
- Added the
--vllmflag to switch from additive VRAM checks to subtractive "Serving Capacity" estimates. It calculates PagedAttention block limits based on a configurable--gpu-utilratio. - Topology-Aware Overhead Scaling: Added
--topology(nvlink,pcie4,pcie3) and--strategy(tp,pp) flags. The calculator now applies NCCL communication penalties directly to weights and activations instead of using a generic fixed multiplier. - Mapped explicit
ggml_typeenums (0-33) for GGUF files to fix VRAM under-reporting for specific quantization types. - The CLI now does algorithmic estimation via
index.jsonby default. If you need the exact size breakdown of every tensor, pass--tensorsto force it to fetch all remote shards. - Added comprehensive
pytesttest coverage for the new vLLM subtractive math engine, topology penalties, and explicit GGUF byte mappings.
Changed
- Removed KV cache from the distributed overhead multiplier because Tensor Parallelism partitions context blocks rather than duplicating them.
- Changed the network logic to infer metadata directly from
index.jsonandconfig.json. It skips iterative chunk requests for sharded arrays unless--tensorsis passed.
v1.3.0
ModelInfo CLI (v1.3.0)
This release adds comprehensive hardware fit diagnostics, dynamic GPU scaling, and side-by-side model comparison to instantly evaluate operational deployment trade-offs.
Added
- Hardware Discovery Engine: Added the
--gpuflag with multi-vendor normalization (NVIDIA, AMD, Intel) to calculate if a model fits within specific hardware constraints. Supports named GPUs (rtx4090), explicit sizes (24), and native local hardware discovery (auto). - Fragmentation Defense: Implemented a 3-tier UI heuristic (✓ Safe, ⚠ Warning, ✗ Fail) to defend against memory fragmentation and generation-time transient spikes.
- Side-by-Side Comparison: Passing multiple models via the CLI (
modelinfo modelA modelB) now implicitly triggers a dedicated side-by-side comparison table, surfacing parameter counts, context lengths, and VRAM footprints to evaluate architectural trade-offs.
Changed
- Multi-GPU Overhead Scaling: The CUDA context initialization overhead (600 MB) now dynamically scales based on the detected
gpu_countto prevent silent prefill OOMs on multi-GPU deployments. - Mathematical Transparency: Enforced the
Dtypecolumn mathematically into the comparative UI to visualize exactly why quantization scales VRAM footprints downward.
v1.2.0
ModelInfo CLI (v1.2.0)
This release adds remote Hugging Face Hub inspection, dynamic VRAM overhead modeling, and sensible context defaults for operational inference planning.
Added
- Remote Hugging Face Hub Support: Inspect any public or gated model directly via its repo ID (e.g.,
modelinfo meta-llama/Llama-2-7b-hf) without downloading the full checkpoint. Uses concurrentRangerequests (max 8 workers) to extract the first 500KB of safetensors shards to prevent synchronous I/O bottlenecks and bypass CDN rate-limits. - Framework Overhead Modeling: VRAM estimates now include a static 600 MB CUDA context overhead alongside the model weights and KV cache for operational accuracy.
- Hierarchical VRAM UI: Redesigned the output terminal UI to group memory footprints into Weights, KV Cache, and Overhead.
Changed
- Sane Context Defaults: Hard-capped the default
--contextvalue at 8192 tokens. Models with extreme architectural boundaries (e.g., 128k) will still read the native limit and print it in the UI, protecting users from unrealistic default memory calculations. - Authentication Fallback: Remote HTTP fetcher now supports token extraction from the
HF_TOKENenvironment variable,~/.cache/huggingface/token, and the legacy~/.huggingface/tokenpath.
v1.1.0
ModelInfo CLI (v1.1.0)
This release shifts the tool from heuristic guesswork to deterministic binary metadata extraction, introducing native context limit bounds checking and pure, decoupled math engines.
Core Engine
- GGUF Metadata Extraction: The parser now extracts global key-value pairs (e.g.,
general.architecture,attention.head_count_kv) from the binary file before parsing the tensor list, guaranteeing accurate architecture mapping. - Context Limit Warnings: Extracts
max_position_embeddings(Hugging Face) and{gen_arch}.context_length(GGUF) to actively warn users if the requested--contextexceeds the model's native boundary. - SafeTensors Config Fallback: Seamlessly reads adjacent
config.jsonfiles for robust fallback parsing of architectures lacking explicit tensor structures. - Fused Tensor Estimation: Defused mathematical edge cases for GQA, ALiBi, and older MHA models when extracting KV cache dimensionality from fused
qkv_proj.weighttensors.
UI & Architecture
- Decoupled Math Engine: Moved all filesystem operations and config parsing into
cli.pyto maintain Separation of Concerns, keepingcalculator.pyandarchitecture.pyas pure, testable math engines. - Terminal Aesthetics: Solved terminal text-wrapping bugs by constraining
richtable columns and removed non-standard emojis to enforce a clean, professional Unix aesthetic.
Full Changelog: v1.0.0...v1.1.0
V1.0.0
ModelInfo CLI (v1.0.0)
Initial public release of the modelinfo-cli package.
Core Engine
- Zero-Dependency Parsers: Native standard-library (
os,struct,json) binary deserialization for.safetensors,.gguf, and legacy.ptfiles. - Sharded Architecture Support: Auto-detects and gracefully parses
model.safetensors.index.jsonmanifests without crashing on partial downloads. - Hardware Calculator: Calculates exact VRAM footprints (including dynamic KV cache overhead) using precise GGUF block quantization ratios (Q8 through Q2).
- Restricted Unpickler: Defangs arbitrary code execution when inspecting legacy PyTorch checkpoints.
CI/CD & Constraints
- Cross-platform testing matrix (
ubuntu,macos,windowson Python3.10-3.12) verifying binary struct unpacking behavior. - Hardened pipeline constraints enforcing the sub-100ms startup latency by blocking heavy ML ecosystem imports (e.g.,
torch,transformers).