Skip to content

Refactor arithmetic ops to use std::variant + std::visit#723

Open
jysh1214 wants to merge 7 commits intomasterfrom
dev/alexchiang/refactor-arithop
Open

Refactor arithmetic ops to use std::variant + std::visit#723
jysh1214 wants to merge 7 commits intomasterfrom
dev/alexchiang/refactor-arithop

Conversation

@jysh1214
Copy link
Copy Markdown
Collaborator

@jysh1214 jysh1214 commented Dec 1, 2025

Remove Arithmetic_internal.hpp/cpp and Ari_ii from linalg_internal_interface

Remove Arithmetic_internal.hpp/cpp and Ari_ii from linalg_internal_interface
@jysh1214 jysh1214 requested a review from IvanaGyro December 1, 2025 14:35
@jysh1214
Copy link
Copy Markdown
Collaborator Author

jysh1214 commented Dec 1, 2025

Not sure what happened in github action:

Collecting package metadata (repodata.json): | ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��/ ��- ��\ ��| ��failed
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/noarch/repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https//conda.anaconda.org/conda-forge/noarch'

https://github.com/Cytnx-dev/Cytnx/actions/runs/19826283869/job/56800387981?pr=723

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 1, 2025

Codecov Report

❌ Patch coverage is 15.72387% with 879 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.18%. Comparing base (e87c3a2) to head (d60ea7f).
⚠️ Report is 32 commits behind head on master.

Files with missing lines Patch % Lines
src/linalg/Div.cpp 4.09% 232 Missing and 2 partials ⚠️
src/linalg/Sub.cpp 7.37% 221 Missing and 5 partials ⚠️
src/linalg/Mod.cpp 3.58% 188 Missing ⚠️
src/linalg/Cpr.cpp 0.00% 108 Missing ⚠️
src/linalg/Mul.cpp 25.18% 96 Missing and 5 partials ⚠️
src/linalg/Add.cpp 80.95% 3 Missing and 9 partials ⚠️
src/linalg/iArithmetic_visit.hpp 80.95% 6 Missing and 2 partials ⚠️
src/linalg/iDiv.cpp 66.66% 1 Missing ⚠️
src/linalg/iMul.cpp 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #723      +/-   ##
==========================================
+ Coverage   32.35%   36.18%   +3.83%     
==========================================
  Files         215      213       -2     
  Lines       36363    30579    -5784     
  Branches    14597    12655    -1942     
==========================================
- Hits        11764    11065     -699     
+ Misses      22659    17679    -4980     
+ Partials     1940     1835     -105     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ianmccul
Copy link
Copy Markdown
Collaborator

ianmccul commented Dec 1, 2025

This is looking good - you've removed about 20,000 lines of code in the last couple of weeks? 😁

There is still a lot of repitition with the Tensor Add<T>(const T &lc, const Tensor &Rt) templates in src/linalg/Add.cpp, is it possible to replace that with something like:

    template <typename T>
    Tensor Add(const T &lc, const Tensor &Rt) {
      auto type = cy_typeid_v<T>;
      Storage Cnst(1, type);
      Cnst.at<T>(0) = lc;
      Tensor out;
      out._impl = Rt._impl->_clone_meta_only();
      out._impl->storage() =
        Storage(Rt._impl->storage().size(), std::min(type, Rt.dtype()), Rt.device());

      if (Rt.device() == Device.cpu) {
        std::visit(
          [&](auto *rptr) {
            using TR = std::remove_pointer_t<decltype(rptr)>;
            cytnx::linalg_internal::AddInternalImpl<type, TR>(
              out._impl->storage()._impl, Cnst._impl, Rt._impl->storage()._impl,
              Rt._impl->storage()._impl->size(), {}, {}, {});
          },
          Rt.ptr());
      } else {
  #ifdef UNI_GPU
        checkCudaErrors(cudaSetDevice(Rt.device()));
        linalg_internal::lii.cuAri_iitType][Rt.dtype()](
          out._impl->storage()._impl, Cnst._impl, Rt._impl->storage()._impl,
          Rt._impl->storage()._impl->size(), {}, {}, {}, 0);
  #else
        cytnx_error_msg(true, "[Add] fatal error, the tensor is on GPU without CUDA support.%s",
                        "\n");
  #endif
      }

Now the type of out looks very suspicious. The old code (equivalent to the std::min() I used above) is not identical to the result of the type_promote() function, which is most likely a bug1. Although they are both broken (eg complex<float> + double would end up as complex<float> under either mechanism), it is absolutely essential that the type selected here matches the output type used in the linalg_internal function, since this is just assumed and there is no checking.

Since AddInternalImpl defines using TOut = cytnx::Scalar_init_interface::type_promote_t<TLin, TRin> the same function must be used here as well, otherwise there will be a mismatch. This would be detectable with mixed types involving unsigned, eg type_promote(uint32, int16) is int32, but min(uint32, int16) is uint32. So addition of Tensor of uint32 and int16 will incorrectly give a result tensor that is an unsigned type. (Note that int32 is also the wrong type here, since converting uint32 to int32 might overflow. It ought to be int64, but that is not fixable until all of the implicitly assumed return types are fixed...)

But cytnx::linalg_internal::AddInternalImpl is now a template anyway, so why convert to a type-erased interface, only to convert back again? Better would be to construct the output tensor and pass that in to a typed interface:

cytnx::linalg_internal::AddInternalImpl<lhs_type, rhs_type, out_type>(lhs, rhs, out);

This involves either a triple dispatch of std::variant (OK, but lots of combinations!), or do a double dispatch on lhs and rhs, and a runtime assert that out.dtype() matches type_promote(lhs, rhs). Or alternatively, move the implementation of AddInternalImpl into Add, since both functions are now fairly simple. (That is a better option I think, because then the output type is localized to the Add function and does not need to be kept in syncronization with anything else.)

Now, for the GPU version, I just noticed:

void cuAdd_internal_u32ti16(boost::intrusive_ptr<Storage_base> &out,
boost::intrusive_ptr<Storage_base> &Lin,
boost::intrusive_ptr<Storage_base> &Rin,
const unsigned long long &len,
const std::vector<cytnx_uint64> &shape,
const std::vector<cytnx_uint64> &invmapper_L,
const std::vector<cytnx_uint64> &invmapper_R) {
cytnx_uint32 *_out = (cytnx_uint32 *)out->data();
cytnx_uint32 *_Lin = (cytnx_uint32 *)Lin->data();
cytnx_int16 *_Rin = (cytnx_int16 *)Rin->data();

ARGH: the output type is hard-coded to be uint32! This is different to what is given by type_promote(uint32, int16) and is just broken2. So for the time being, I'd try

template <typename T>
Tensor Add(const T &lc, const Tensor &Rt) {
  // CPU path: use the templated AddInternalImpl directly, with the output type
  // determined by type_promote().
  if (Rt.device() == Device.cpu) {
    const auto lhs_type = cy_typeid_v<T>;
    const auto rhs_type = Rt.dtype();
    const auto out_type = cytnx::Scalar_init_interface::type_promote(lhs_type, rhs_type);

    // Construct constant tensor and output tensor with correct promoted dtype
    Tensor Cnst({1}, lhs_type);  // one element, rank 1 tensor
    Cnst.at<T>(0) = lc;
    Tensor out;
    out._impl = Rt._impl->_clone_meta_only();
    out._impl->storage() = Storage(Rt._impl->storage().size(), out_type, Rt.device());

    // Dispatch to fully-typed templated kernel
    cytnx::linalg_internal::AddInternalImpl(Cnst, Rt, out);

    return out;
  } else {
#ifdef UNI_GPU
    // GPU path: existing cuAdd kernels assume the output dtype is std::min(lhs_type, rhs_type)) 
    // so we must match this until the kernels
    // are regenerated with correct type promotion semantics.
    const auto lhs_type = cy_typeid_v<T>;
    const auto rhs_type = Rt.dtype();
    const auto out_type = std::min(lhs_type, rhs_type);

    Storage Cnst(1, lhs_type);
    Cnst.at<T>(0) = lc;
    Tensor out;
    out._impl = Rt._impl->_clone_meta_only();
    out._impl->storage() = Storage(Rt._impl->storage().size(), out_type, Rt.device());

    checkCudaErrors(cudaSetDevice(Rt.device()));
    linalg_internal::lii.cuAri_ii[lhs_type][rhs_type](
    out._impl->storage()._impl, Cnst._impl, Rt._impl->storage()._impl,
    Rt._impl->storage()._impl->size(), {}, {}, {}, 0);

    return out;
#else
    cytnx_error_msg(true, "[Add] fatal error: GPU tensor without CUDA support.%s", "\n");
#endif
  }
}

There is nothing we can do about the GPU version; in order to call the existing linalg_internal::lii.cuAri_ii we must use the incorrect type given by min(lhs, rhs). Replacing the GPU linalg_internal::lii.cuAri_ii functions with type-safe templates should be done soon! And in the process, relying on Unified Memory to pass a single number into a CUDA kernel isn't great (paging a 4K block from the CPU to GPU) -- it would be much better to write a separate kernel to add a constant to a tensor, and pass the constant as a parameter to the kernel!

It is also obvious that there are no unit tests covering tricky cases involving type promotions and mixed signed/unsigned etc. Someone should add some, with some annotations that some corner cases are currently expected to fail. For example, addition of a Tensor<uint32> and Tensor<int16> will give Tensor<int32> if running on a CPU, but Tensor<uint32> if running on a GPU. But the actual result should be Tensor<int64>. Also Tensor<complex<float>> + double should give Tensor<complex<double>> but currently gives Tensor<complex<float>>.

Footnotes

  1. For some archaology, my guess is that originally the code used pytorch min(lhs, rhs) for type promotions, but pytorch historically never had unsigned types (it does now I think, incompletely). When unsigned int was added, or it was realized that min(lhs, rhs) doesn't work properly with unsigned, the type_promote mechanism was added, but some of the kernels involving unsigned were never updated...

  2. Well, C/C++ is broken in the same way...

@IvanaGyro
Copy link
Copy Markdown
Collaborator

Not sure what happened in github action:

For HTTP error, it is fine to just rerun the failed test.

@jysh1214
Copy link
Copy Markdown
Collaborator Author

jysh1214 commented Dec 6, 2025

It is also obvious that there are no unit tests covering tricky cases involving type promotions and mixed signed/unsigned etc.

Yeah, you’re right, it’s a critical problem.
I’ve sent PR#730 for the mixed-type unit tests. Please help review it. Thx.

I will solve others issues later.

@ianmccul
Copy link
Copy Markdown
Collaborator

@codex review

I think the best solution here is a larger patch that completely gets rid of the function dispatch tables and replaces it with templates. Then we are guaranteed that the result type of each operation agrees with the type_promote function (which could then be modified, potentially). This would also facilitate adding new types, eg low-precision GPU types, quad precision, etc. Doing it bit by bit is ultimately much more work.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ianmccul
Copy link
Copy Markdown
Collaborator

Well, @codex wasn't much use here. Why didn't it at least say something about the comment I made earlier?

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@yingjerkao
Copy link
Copy Markdown
Collaborator

yingjerkao commented Mar 30, 2026

Addressed Ian’s comment #2 in commit bc6f6d92.

What changed:

  • Refactored src/linalg/Add.cpp to remove per-type scalar overload boilerplate and use a single templated scalar+tensor path with std::visit dispatch on CPU.
  • Added explicit template instantiations for supported scalar types instead of hand-written specializations.
  • Added a shared output-dtype selector in Add:
    • CPU path now uses Type.type_promote(lhs, rhs).
    • GPU path intentionally keeps legacy min(lhs, rhs) behavior to stay compatible with current cuAri_ii kernels.
  • Added regression tests in tests/linalg_test/linalg_test.cpp:
    • Tensor_Add_mixed_dtype_type_promote_cpu
    • Tensor_Add_scalar_mixed_dtype_type_promote_cpu
      These fail under old min(lhs, rhs) CPU allocation and pass with this patch.

Validation run:

  • cmake --build --preset openblas-cpu --target test_main -- -j8
  • ./build/openblas-cpu/tests/test_main --gtest_filter=DenseUniTensorTest.Add_UT_UT:DenseUniTensorTest.Add_UT_Scalar:DenseUniTensorTest.operatorAdd_UT_Scalar:BlockUniTensorTest.Add:linalg_Test.Tensor_Add_mixed_dtype_type_promote_cpu:linalg_Test.Tensor_Add_scalar_mixed_dtype_type_promote_cpu

All listed tests pass.

This gives the template-driven direction on Add and removes a large chunk of dispatch-like boilerplate. The remaining step for Ian’s full request is replacing the GPU lii.cuAri_ii tables with typed kernels (so GPU can also use type_promote semantics).

@ianmccul
Copy link
Copy Markdown
Collaborator

@codex environment created

@chatgpt-codex-connector
Copy link
Copy Markdown

Summary

  • Added a new regression test Tensor_Add_scalar_object_mixed_dtype_type_promote_cpu to cover the Scalar overloads of linalg::Add (Scalar + Tensor and Tensor + Scalar) for mixed dtypes on CPU. The test validates both promoted dtype selection (Type.type_promote) and numeric results. tests/linalg_test/linalg_test.cppL344-L356
  • This complements the existing tensor/tensor and primitive scalar mixed-dtype checks and helps guard against future regressions in Add dtype promotion behavior. tests/linalg_test/linalg_test.cppL321-L342

Testing

  • ⚠️ cmake --preset openblas-cpu (failed in this environment because Boost CMake package files were not found: BoostConfig.cmake / boost-config.cmake).
  • ⚠️ cmake --build --preset openblas-cpu --target test_main -- -j8 (could not run because configure step did not complete, so build/openblas-cpu was not created).

View task →

@yingjerkao yingjerkao requested a review from chiamin March 31, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants