fix(deps): update dependency transformers to v4.52.4#55
Closed
dreadnode-renovate-bot[bot] wants to merge 1 commit into
Closed
fix(deps): update dependency transformers to v4.52.4#55dreadnode-renovate-bot[bot] wants to merge 1 commit into
dreadnode-renovate-bot[bot] wants to merge 1 commit into
Conversation
| datasource | package | from | to | | ---------- | ------------ | ------ | ------ | | pypi | transformers | 4.51.3 | 4.52.4 |
86de884 to
ee3f3bc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
4.51.3->4.52.4Release Notes
huggingface/transformers (transformers)
v4.52.4: Patch release: v4.52.4Compare Source
The following commits are included in that patch release:
v4.52.3: Patch release v4.52.3Compare Source
Patch release v4.52.3
We had to protect the imports again, a series of bad events.
Here are the two prs for the patch:
v4.52.2: Patch release v4.52.2Compare Source
Patch release v4.52.2
We had to revert #37877 because of a missing flag that was overriding the device map. We re-introduced the changes because they allow native 3D parallel training in Transformers. Sorry everyone for the troubles! 🤗
v4.52.1: : Qwen2.5-Omni, SAM-HQ, GraniteMoeHybrid, D-FINE, CSM, BitNet, LlamaGuard, TimesFM, MLCD, Janus, InternVLCompare Source
New models
Qwen2.5-Omni
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
The abstract from the technical report is the following:
SAM-HQ
SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.
The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.
SAM-HQ introduces several key improvements over the original SAM model:
The abstract from the paper is the following:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.
Tips:
GraniteMoeHybrid
The
GraniteMoeHybridmodel builds on top ofGraniteMoeSharedModelandBamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.D-FINE
The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by
Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
CSM
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture:
CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.
BitNet
Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
LlamaGuard
Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.
TimesFM
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
The abstract from the paper is the following:
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
MLCD
The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
Janus
The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
The abstract from the original paper is the following:
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
The abstract from the aforementioned
Janus-Propaper, released afterwards, is the following:In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
InternVL
The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
The abstract from the paper is the following:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.
Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.
Kernel integration
We integrate some kernels in the
transformerslibrary via thekernelspackage: https://github.com/huggingface/kernelsWe start with some kernels in the Llama model, and we iterate to identify the best performance optimizations
TP support
In the previous release, we've added TP support in order to run distributed inference. However, this is not supported for all quantization methods. We are progressively adding support to it. Right now, only compressed-tensors, fp8 and fp8-fbgemm support it.
Quantization
AutoRound
From the AutoRound contributors:
Quantization Documentation
We have added two new sections to better understand and get started with quantization:
GGUF
We've added GGUF support to gemma3 family models.
Fast image processors
Most Vision Models and VLMs in Transformers can now benefit from fast image processors. By utilizing torch/torchvision functional transforms, these processors offer a substantial speedup when processing images compared to PiL/numpy functions, and support processing on both CPU and CUDA.
AutoDocstring
The new
@auto_docstringdecorator makes it easier to add proper documentation when contributing a model without bloating the modeling code:@auto_docstring: AutoDocstringCustom
generateWe now support custom
generatemethods to be loaded frommodel.generate. The customgeneratemethods can be stored on the Hub, enabling quick distribution of experiments regarding new caches, decoding methods, heuristics, ...You can find the docs here, and all custom generation methods by searching for the
custom_generatetag.Chat CLI
The
transformers-clicommand is updated to be simpler and cleaner, specifically for itschatvariant.The following is now possible and recommended:
Additionally, almost any generate flag can now be passed as a positional argument, present and future, as opposed to being limited to a set of hardcoded flags, for example:
chat] generate parameterization powered byGenerationConfigand UX-related changes by @gante in #38047Breaking changes
Deprecations
The agents folder is finally removed from
transformersin favour of usingsmolagents.We are moving away from torch 2.0 as it has been released more than two years ago.
General bugfixes and improvements
init empty weightswithout accelerate by @Cyrilvallez in #37337_init_weightsby @Cyrilvallez in #37341GenerationMixininheritance by default inPreTrainedModelby @gante in #37173_pytree._register_pytree_nodeandtorch.cpu.amp.autocastby @bzhong-solink in #37372kernelsto 0.4.3 by @ArthurZucker in #37419rms_norm_epsfor the L2Norm for Llama4 by @ArthurZucker in #37418tests/models/by @ydshieh in #37415fsspecdependency which isn't directly used by transformers by @cyyever in #37318_init_weights()issues - make it work for composite models by @Cyrilvallez in #37070num_logits_to_keepby @Cyrilvallez in #37149from_pretrainedby @Cyrilvallez in #37216attn_temperature_tuningby @gmlwns2000 in #37501test_offloaded_cache_implementationon XPU by @yao-matrix in #37514as_tensor) by @ydshieh in #37551test_can_load_with_global_device_setusing a subprocess by @ydshieh in #37553Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Renovate Bot.