Open
Conversation
79bb96d to
8a80e22
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==2.9.1→==2.11.0Release Notes
pytorch/pytorch (torch)
v2.11.0: PyTorch 2.11.0 ReleaseCompare Source
PyTorch 2.11.0 Release Notes
Highlights
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Release Engineering
Volta (SM 7.0) GPU support removed from CUDA 12.8 and 12.9 binary builds (#172598)
Starting with PyTorch 2.11, the CUDA 12.8 and 12.9 pre-built binaries no longer include support for Volta GPUs (compute capability 7.0, e.g. V100). This change was necessary to enable updating to CuDNN 9.15.1, which is incompatible with Volta.
Users with Volta GPUs who need CUDA 12.8+ should use the CUDA 12.6 builds, which continue to include Volta support. Alternatively, build PyTorch from source with Volta included in
TORCH_CUDA_ARCH_LIST.Version 2.10:
Version 2.11:
PyPI wheels now ship with CUDA 13.0 instead of CUDA 12.x (#172663, announcement)
Starting with PyTorch 2.11,
pip install torchon PyPI installs CUDA 13.0 wheels by default for both Linux x86_64 and Linux aarch64. Previously, PyPI wheels shipped with CUDA 12.x and only Linux x86_64 CUDA wheels were available on PyPI. Users whose systems have only CUDA 12.x drivers installed may encounter errors when runningpip install torchwithout specifying an index URL.Additionally, CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures on Linux x86_64. Maxwell and Pascal GPUs are no longer supported under CUDA 13.0. Users with these older GPUs should use the CUDA 12.6 builds instead.
CUDA 12.6 and 12.8 binaries remain available via
download.pytorch.org.Version 2.10:
# PyPI wheel used CUDA 12.x pip install torchVersion 2.11:
Python Frontend
torch.hub.list(),torch.hub.load(), andtorch.hub.help()now default thetrust_repoparameter to"check"instead ofNone. Thetrust_repo=Noneoption has been removed. (#174101)Previously, passing
trust_repo=None(or relying on the default) would silently download and run code from untrusted repositories with only a warning. Now, the default"check"behavior will prompt the user for explicit confirmation before running code from repositories not on the trusted list.Users who were explicitly passing
trust_repo=Nonemust update their code. Users who were already passingtrust_repo=True,trust_repo=False, ortrust_repo="check"are not affected.Version 2.10:
Version 2.11:
torch.nn
Add sliding window support to
varlen_attnviawindow_size, making optional arguments keyword-only (#172238)The signature of
torch.nn.attention.varlen_attnhas changed: a*(keyword-only separator) has been inserted before the optional arguments. Previously, optional arguments likeis_causal,return_aux, andscalecould be passed positionally; they must now be passed as keyword arguments. A newwindow_sizekeyword argument has also been added.Remove
is_causalflag fromvarlen_attn(#172245)The
is_causalparameter has been removed fromtorch.nn.attention.varlen_attn. Causal attention is now expressed through thewindow_sizeparameter: usewindow_size=(-1, 0)for causal masking, orwindow_size=(W, 0)for causal attention with a sliding window of sizeW. The defaultwindow_size=(-1, -1)corresponds to full (non-causal) attention.Distributed
DebugInfoWriternow honors$XDG_CACHE_HOMEfor its cache directory in C++ code, consistent with the Python side. Previously it always used~/.cache/torch. (#168232)This avoids issues where
$HOMEis not set or not writable. Users who relied on~/.cache/torchbeing used regardless of$XDG_CACHE_HOMEmay see debug info written to a different location.Version 2.10:
Version 2.11:
DeviceMeshnow stores a process group registry (_pg_registry) directly, enablingtorch.compileto trace throughget_group(). (#172272)This may break code that skips
init_process_group, loads a saved DTensor (constructing a DeviceMesh with no PGs), and later creates PGs separately — duringtorch.compileruntime the PG lookup will fail. Users should ensure process groups are initialized before constructing the DeviceMesh.Version 2.10:
Version 2.11:
Distributed (DTensor)
DTensor.to_local()backward now convertsPartialplacements toReplicateby default whengrad_placementsis not provided. (#173454)Previously, calling
to_local()on aPartialDTensor would preserve thePartialplacement in the backward gradient, which could produce incorrect gradients when combined withfrom_local(). Now, the backward pass automatically mapsPartialforward placements toReplicategradient placements, matching the behavior offrom_local().Users who relied on the previous behavior (where
to_local()backward preservedPartialgradients) may see different gradient values. To ensure correctness, explicitly passgrad_placementstoto_local().Version 2.10:
Version 2.11:
_PhiloxState.seedand_PhiloxState.offsetnow returntorch.Tensorinstead ofint(#173876)The DTensor RNG internal
_PhiloxStateclass changed itsseedandoffsetproperties to return tensors instead of Python ints, and the setters now expect tensors. This makes the RNG state compatible with PT2 tracing (the previous.item()calls were not fake-tensor friendly).Code that directly reads
_PhiloxState.seedor_PhiloxState.offsetand treats them as ints will break. Call.item()to get the int value. When setting, wrap the value in a tensor.Version 2.10:
Version 2.11:
ROCm
caffe2 support is fully removed from ROCm PyTorch's hipify preprocessing. This is known as "hipify v2" behavior. (#174087, #174300, #174388, #174499, #175098)
hipify v1 background
When caffe2 and PyTorch were separate projects, the ROCm support strategies were different. For caffe2, all files and classes would be renamed following the pattern of CUDA to HIP, Cuda to Hip, cuda to hip, and so on. PyTorch did not rename classes, but would create new files following the same renaming pattern (e.g., aten/src/ATen/cuda/CUDABlas.h to aten/src/ATen/hip/HIPBlas.h). As a consequence, caffe2 had a distinct device backend named "HIP" (renamed from "CUDA") while ROCm PyTorch masquerades as the "cuda" device (
torch.empty(1, device="cuda")). Once caffe2 and PyTorch projects were merged, this caused a mismatch between caffe2 expecting to use a "HIP" device while PyTorch expecting a "cuda" device. To alleviate this mismatch, "Masquerading" classes were created under aten/src/ATen/hip/impl.These classes were often transparently utilized during ROCm PyTorch's hipify preprocessing of source files. All files under c10/ and caffe2/ were hipified using the caffe2 renaming behavior, while all other "PyTorch" files used the other strategy. The Masquerading classes would replace their CUDA counterpart during hipify preprocessing. For example, c10/cuda/CUDAStream.h's CUDAStream would be replaced by aten/src/ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h's HIPStreamMasqueradingAsCUDA. These Masquerading classes call the underlying caffe2 code and create "HIP" devices, and the device would be reset to "cuda" by the Masquerading classes.
hipify v2 new behavior
Hipify v2 (#174087, #174300, #174388, #174499, #175098) makes the following changes:
Great care has been taken to make this change backwards compatible. Though PyTorch today builds cleanly using hipify v2 behavior, downstream PyTorch extension projects that explicitly included Masquerading headers or called Masquerading APIs could be affected, resulting in failed builds. As an example, before backwards compatibility was realized, the xformers project had failed to build using the hipify v2 changes. A PR demonstrates the changes that were initially necessary to work around the build failures, but such changes are no longer necessary after hipify v2 BC-breaking behavior was improved.
torch.export
torch.export.export_for_traininghas been removed (#171714)export_for_trainingwas previously available as a separate API for exporting models while preserving training semantics. This function has been removed. Users should usetorch.export.exportinstead, which returns the same graph as the previousexport_for_training.ONNX
Remove the
fallbackoption fromtorch.onnx.export(#173189)The
fallbackparameter has been removed fromtorch.onnx.export(). Previously, whenfallback=True, the exporter would automatically fall back to the legacy TorchScript-based exporter if the dynamo exporter failed. This fallback was removed because it was overly complicated, required different inputs, produced different models, and hid errors from the new exporter.Migration: Remove
fallback=True(orfallback=False) from yourtorch.onnx.export()calls. If you need fallback behavior, implement it explicitly in your own code by catching exceptions and calling the legacy exporter separately.Remove overload matching logic from the ONNX dispatcher (#165083)
The
custom_translation_tableparameter intorch.onnx.export()no longer accepts a list of functions for each torch op. Previously, users could pass a list of overloaded ONNX functions (e.g., one for float tensors, another for bool tensors), and the dispatcher would automatically select the correct overload based on input types. This complex type-matching logic has been removed because torchlib no longer uses overloads for the same opset version.The type of
custom_translation_tablechanged fromdict[Callable, Callable | Sequence[Callable]]todict[Callable, Callable]. Passing aSequenceas a value now raises aTypeError.Migration: Provide a single function per operator instead of a list of overloads. If you need type-dependent behavior, handle it inside the single function.
Quantization
The PT2E quantization flow (
torch.ao.quantization.pt2eandtorch.ao.quantization.quantizer) has been removed from PyTorch and migrated to torchao. (#169151)The following modules and classes have been removed:
torch.ao.quantization.pt2e(includingDuplicateDQPass,PortNodeMetaForQDQ, export utils, graph utils, numeric debugger, lowering utilities)torch.ao.quantization.quantizer(includingComposableQuantizer,EmbeddingQuantizer,X86InductorQuantizer,XPUInductorQuantizer,XNNPACKQuantizer,QuantizationSpec,QuantizationAnnotation,QuantizationConfig, etc.)Users relying on the PT2E quantization flow should migrate to the
torchaopackage, which now hosts these APIs.Version 2.10:
Version 2.11:
Deprecations
Linear Algebra
The MAGMA backend for linear algebra operations is now deprecated and will be removed in a future release. Setting
torch.backends.cuda.preferred_linalg_library("magma")or retrieving a previously-set MAGMA preference will now issue a deprecation warning. cuSOLVER remains the default backend. (#172823)If you see any errors when using cuSOLVER that did not occur with MAGMA, please file an issue on GitHub. To silence the warning, stop explicitly selecting the MAGMA backend:
Version 2.10:
Version 2.11:
torch.linalg.svdno longer dispatches to MAGMA. The MAGMA backend is deprecated and cuSOLVER is now used unconditionally, providing significant speedups (2x–400x depending on matrix size and batch dimensions). (#172824)Previously, setting
torch.backends.cuda.preferred_linalg_library("magma")would route SVD through MAGMA. This setting is now ignored for SVD, and cuSOLVER is always used.Version 2.10:
Version 2.11:
torch.linalg.solve_triangularandtorch.triangular_solveno longer dispatch to MAGMA on CUDA. cuBLAS is now used unconditionally, providing speedups of 2x–24x for most matrix sizes (small matrices may see minor regressions of ~0.6x). (#174109)Version 2.10:
Version 2.11:
torch.linalg.lstsqno longer dispatches to MAGMA. cuSOLVER/cuBLAS are now used unconditionally, providing speedups of 1.7x–620x depending on matrix size and batch dimensions. (#174779)Version 2.10:
Version 2.11:
Distributed
torch.distributed.symmetric_memory.enable_symm_mem_for_groupis deprecated. The store can be retrieved directly viaProcessGroup.getStore()in C++, making this call unnecessary. (#172163)Version 2.10:
Version 2.11:
# No longer needed — store is accessed directly from the ProcessGroupNew features
Python Frontend
Added
native_handleproperty totorch.Stream, providing a unified way to retrieve the backend-specific opaque stream handle (e.g.,cudaStream_tfor CUDA,sycl::queue*for XPU). This is useful for passing stream handles to third-party libraries such as Triton. (#171040)Autograd
Function.clear_saved_tensors_on_accessclass attribute to automatically free saved tensors after they are accessed (#173833)torch.nn
activate_flash_attention_impl(#169866)scalefor softmax to varlen attn (#171199)Distributed
start_methodoption totorch.distributed.debug.start_debug_serverto select the multiprocessing start method (fork,spawn, orforkserver), enabling CUDA-safe server startup (#173196)torch.distributed.debug(#174808)torch.distributed.all_gather) now automatically work withFakeTensorMode— meta implementations are registered atimport torchtime (#162119)SymmetricMemoryas a torch class for use in op definitions (#174019)torchcomms_BackendWrappershim layer in c10d (#174202)CUDA
MPS
ROCm
clock_rate,memory_clock_rate,memory_bus_width,memory_per_block,shared_memory_per_block. (#170572)TORCH_USE_HIP_DSA. (#172679)XPU
torch.compile
Dynamo
torch.compilenow supports tracing throughcontextlib.ExitStackandcontextlib.suppresscontext managers, allowing code that uses these patterns to be compiled without graph breaks (#146506, #147990)torch._dynamo.config.ignore_logging_functionsconfig to skip arbitrary logging callables during tracing without causing graph breaks. Add functions to this set to have Dynamo treat them as no-ops during compilation (#168913)TORCH_DYNAMO_AUTOMATIC_DYNAMIC_SHAPES=0environment variable to globally disable automatic dynamic shapes without modifying Python code (#172334)TORCH_COMPILE_OVERRIDE_BACKENDSenvironment variable for per-graph backend override, enabling binary search to find problematic compiled graphs. Supports filter syntax like">10:eager"or"0-5:aot_eager;6-10:inductor"(#172411)torch._dynamo.decorators.leaf_function, which allows annotating functions as leaf operations that Dynamo and AOTAutograd will not trace into (#170471)register_hookon non-leaf tensors would fail undertorch.compile(#172126)Inductor
torch.conddispatch generation (#167617)ldexplowering withlibdevice.ldexp(CUDA) andstd::ldexp(CPU) codegen (#171721)pin_memoryfortorch.empty(#172578)triton_metato TritonTemplatemaybe_append_choiceAPI for custom template development (#174292)max-autotune-gemm, which overlaps autotuning with lowering/scheduling in a subprocess to reduce compilation overhead (#170407)(BlockWise128x128, BlockWise1x128)scaling support in Inductor Triton templates (#170748)torch.export
torch.export(#174720)ONNX
ExportableModulewrapper for ONNX export (#170810)InputObserverto infer dynamic shapes for export (#172838)InputObserverfor multimodal LLM export (#174964)Foreach
torch.linalg._powsumandtorch._foreach_powsumas fused kernels that computesum(abs(x)**ord)(equivalent tovector_normwithout the root extraction) (#172685)Improvements
Release Engineering
Python Frontend
torch.loadnow produces clearer error messages when encountering miniz errors fromPyTorchStreamReader, explicitly indicating that the checkpoint file is likely corrupt (#170244)torch.load(map_location='meta')no longer reads storage data from the filesystem, improving performance when loading checkpoints onto the meta device (#170619)Composability
check_out_variantandto_out_variantutilities for custom operator out variant validation.check_out_variantverifies that a custom op's out variant is compatible with Inductor's out_variant pass, andto_out_variantconverts anOpOverloadto its out variant. (#174473)torch.nn
remove_duplicateparameter tonn.Module.modules()function (#174383)nn.attention.flex_attention(#171744)C++ Frontend
Float8_e8m0fnuandFloat4_e2m1fn_x2dtypes to stable ABI (#173669)torch::stable::Tensor::layout()(#174735)Distributed
context_parallel_shardmore general (#170200)get_offsetfor symmetric memory (#172044)ProcessGroupNCCL: workaround forreduce_scatterwithworld_size=1(#170922)ProcessGroupWrapper(#171920)pdbonly when user callsbreakpoint()intorch.distributed(#171818)ProcessGroupNCCL: use lowest rank as split color (#173687)DTensor
CPU
CUDA
accscalar_tfor interpolation accumulators (#170661)cuDNN
MPS
index_fillbackward pass (#174238)baddbmmandaddbmmto integer and complex types (#170895)torch.special.erfcx(scaled complementary error function) (#172910)ROCm
addmmbehavior now takes into account preferred BLAS backend instead of forcing hipblaslt. (#174350)Sparse Frontend
torch.view_as_realandtorch.view_as_complexnow support sparse tensors (#164964)XPU
torch.xpu._dump_snapshotAPI (#170186)torch.xpu._record_memory_historyAPI (#169559)torch.xpu.memory_snapshot(#169442)local_mem_sizeto XPU device properties (#172314)torch.accelerator.get_device_capabilityon XPU (#170747)aot_inductor.emit_multi_arch_kernelon XPU (#171432)decompose_kchoice for XPU (#170541)Profiler
a long time to load (#174717). The memory profiler
now exposes a new
skip_actionsflag to filter out specific events (#168183).post_process_timeout_sfield to prevent post processing from blockingfurther execution (#173957).
torch.compile
Dynamo
fullgraph=Truenow recursively disables dynamo on compiled code to prevent unintentional re-invocation oftorch.compile(#173080)Enum.__contains__and constants (#173223)kwargs=True(#172519)objecttype in dynamo tracing (#171457)allow_in_graph(#173611)Inductor
LoweringExceptionoccurs, making debugging easier (#171846)120afor.ptxto.fatbincompilation and cpp codegen (#174162, #172263)cvt_e8m0_rceilprim with PTX lowering for SM100+ GPUs (#172497)launch_cooperative_gridflag for cooperative reduction kernels (#167800)torch.float8_e5m2, enabling mixed FP8 (e4m3fn x e5m2) matrix multiplication (#171167)scaled_dot_product_flash_attention.low_poverload (#172622)record_functionwith_RecordFunctionFastin CompiledFxGraph for reduced profiling overhead (#163976)mutated_inputs, allowing more flexible template usage (#170721)combo_kernels_pointwise_onlyconfig option to exclude reduction kernels from combo kernel fusion (#174894)torch.fx
torch.fx.symbolic_tracenow supports tracingHigherOrderOperators that do not take callable arguments (#173839)hint_inttosize_hint, supportsize_hintin user code. (#171944)_disable_torch_fn_metadata_modeoption tomake_fxandaot_export_joint_with_descriptors(#172087)torch.export
from_nodeprovenance information is now preserved when serializing exported programs (#171726)Quantization
expm1for computing quantized ELU, improving numerical stability (#173968)ONNX
Optimizer
DevX
spin lintcommand now supports pass-through arguments to lintrunner, including--take,--skip, and--tee-jsonflags, giving developers more control over which linters run (#169373)Ahead-Of-Time Inductor (AOTI)
cpp_kernel_nameto public API to match AOTI shim gen; addmm_type_outto AOTI fallback kernel (#174489)Bug fixes
Release Engineering
Python Frontend
torch.loadwithFakeTensorModeorskip_datacontext would compute incorrect storage sizes (#170618)torch.ops.aten.index.Tensorto properly raise anIndexErrorwhen called with an empty indices list, instead of producing undefined behavior (#174009)Autograd
torch.autograd.gradcheckwhenfast_mode=True(#166386)Complex Frontend
torch.view_as_complex()not working on the memory layout produced by.contiguous()after.transpose()(#169780)Composability
torch.bucketizecrash duringtorch.exportwhentest_elementsis a scalar (#170751)MaxUnpoolcrash when input tensors are small (#169359)Dataloader
__getitem__in Subset subclasses (#163961)Nested Tensor (NJT)
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.