Closes #5089: Add a `balancedSegStringRetrieval` function #5090

1RyanK · 2025-11-28T15:35:52Z

Previously it seemed like the way to divide bytes up appropriately in a SegString was to do something like this:

ref offsets = strOffsetInLocale[here.id].Arr;
offsets[0..#size] = strings.offsets.a.localSlice[offsetsDom];
const end = if offsetsDom.high >= strings.offsets.a.domain.high
            then strings.values.a.size
            else globalOffsets[offsetsDom.high + 1];
const start = offsets[0];
strBytesInLocale[here.id] = new innerArray({0..#(end - start)}, uint(8));
strBytesInLocale[here.id].Arr[0..#(end - start)] = strings.values.a[start..<end];

As you can see, each locale is just grabbing the strings corresponding to the offsets that are local. This is rather easy to do, but maybe not the best way to do things.

This PR makes it easy to do things a little bit differently - try to grab mostly the bytes that are on the current locale. ~~It also turns everything into an innerArray of strings. You can see how simple this makes the concatenateUniqueStrMsg2 function.~~ Also added segStringFromInnerArray to convert back into a SegString easily.

The way that it figures out which strings end up on which locales is as follows: if a string is entirely on one locale, then it belongs to that locale. If it overlaps two or more locales, then the locale that ends up owning it is the locale with the most bytes. If there is a tie, then of the locales with the most bytes of that string, the one with the lowest ID gets it.

I have done some benchmarking on this PR and I found that an innerArray of strings was bad. I think this is largely because strings are probably objects which have a pointer to the bytes, along with some other useful things for the object. So when you try to bulk transfer an array of strings, it can't just transfer the bytes smoothly. So I reverted back to innerArrays of uint(8) bytes and the int(64) offsets.

I also added in my own hashing function. The reason for this is I can pass in the array, where to start hashing, and where to stop. I think this is not insanely better than using borrow=true in interpretAsString but there's some additional code surrounding that so maybe it helps performance just a bit. But who knows, I could be wrong.

I also changed it so that rather than doing something like hash % numlocales it does ((hash >> 32) * numLocales) >> 32. The reason is that modulo relies on division, which can be slow. A multiply is fast and this basically gets the job done. Roughly speaking this is almost equivalent to (hash * numLocales) >> 64, except for the fact that this way would likely overflow the 64 bit size. So we sacrifice a tiny bit of precision for an easier way to do the calculation.

I opted not to use balancedSegStringRetrieval in the concatenate_uniquely code because I'm ultimately not really sure how much time it actually saves. That being said - I think the idea is useful. Essentially it lends itself really well to cutting out the initial bulk transfer into innerArrays. Basically just bulk transfer the surrounding bytes from other locales (according to the same rules as balancedSegStringRetrieval), hash, and then bulk transfer out of the local slice to other locales into their innerArrays.

Performance stacks up surprisingly well (at least to me) against unique(concatenate(...)). For smaller quantities of strings, uniquely_concatenate wins. At a certain point, uniquely_concatenate starts losing, but it takes a while to get there. Moreover I documented some ideas for improving performance in #5401 that may help get performance up there.

Closes #5089: Add a balancedSegStringRetrieval function

1RyanK · 2025-11-28T17:46:27Z

While this is not passing the CI, it does seem to be a little hungry on the RAM, and I did hit an instantiation limit at one point (on my laptop doing multi dim). I don't think it actually fails.

e-kayrakli

Looks good, with the caveat that the logic was a bit hard to follow. That's not to say that you should change something in the logic, but just to say that I don't think I can catch logical issues and as long as testing is clean, it should be good enough.

There are some inline comments that are mostly stylistic and performance related. Otherwise, looks good!

e-kayrakli · 2025-12-03T19:26:59Z

arkouda/numpy/strings.py

        return Strings.from_return_msg(cast(str, rep_msg))

+    @staticmethod
+    def concatenate_uniquely2(strings: List[Strings]) -> Strings:


Could you find a more self-documenting name rather than the 2 suffix?

e-kayrakli · 2025-12-03T19:27:12Z

src/ConcatenateMsg.chpl

    }
    registerFunction("concatenateUniquely", concatenateUniqueStrMsg, getModuleName());
+
+    proc concatenateUniqueStrMsg2(cmd: string, msgArgs: borrowed MessageArgs, st: borrowed SymTab): MsgTuple throws {


ditto the name

e-kayrakli · 2025-12-03T20:03:06Z

src/ConcatenateMsg.chpl

+                    // Reduce to unique strings within this locale
+                    forall str in myStrings with (+ reduce locSet) {
+                        locSet.add(str);
+                    }


This is an interesting pattern!

It looks like in some tests, at least, using set(string, parSafe=true) instead and not doing things through the + reduce intent results in faster performance.

Credit goes to Shreyas for his quick test after I asked the general question to the Chapel team.

e-kayrakli · 2025-12-03T20:05:48Z

src/Repartition.chpl

+    var firstOffsetPerLoc: [0..#numLocales] int = -1;
+
+    // First pass: for each locale, record its local ranges in the global bytes and offsets.
+    forall loc in 0..#numLocales {


Right now, this loop runs entirely on locale 0 (though in parallel). It looks like this could actually be done in parallel by locales simultaneously, if I am not mistaken. If that's the case, coforall loc in Locales.domain do on Locales[loc] { could be a better loop+on combo here.

I don't know how much this matters, since the loop should execute pretty quickly either way, but which way would these accesses be local?

const currSubdomain = strBytes.localSubdomain(Locales[loc]);

I may be splitting hairs, but if the domain data all lives on locale 0, then doing a coforall here adds in some remote accesses, right?

e-kayrakli · 2025-12-03T20:06:07Z

src/Repartition.chpl

+
+    // Second pass: for each locale, determine where the strings that *start* on this
+    // locale end in the global byte buffer.
+    forall loc in 0..#numLocales {


All the data here lives on locale 0, and I could split up the arrays to have one value per locale, but it feels like that might incur some extra remote accesses. I don't know which way would be faster, though.

e-kayrakli · 2025-12-03T20:10:43Z

src/Repartition.chpl

+
+            // Record the global string index for the first string on locale i
+            firstStringIndices[i] = idx;
+            firstStringStartOffsets[i] = myOffsets[idx];


Are these not race conditions?

The if that these are wrapped inside of makes sure that they are not race conditions. Basically each locale checks to see if it knows what these values should be set to, and only one locale will see that it does.

e-kayrakli · 2025-12-03T20:11:13Z

src/Repartition.chpl

+
+            // Full size of the string in bytes
+            const size = endStr - myOffsets[idx];
+            firstStringSizes[i] = size;


Another potential race condition.

e-kayrakli · 2025-12-03T20:11:42Z

src/Repartition.chpl

+            // Portion of this string that resides on locale i's byte range
+            // (truncate end to not go past that locale’s last byte + 1)
+            const endStrThisLoc = min(endStr, myLastIdxPerLoc[i] + 1);
+            firstStringNumBytesEachLocale[i] = endStrThisLoc - myFirstIdxPerLoc[i];


And another one.

These could be benign write-after-write hazards, though I still wanted to check.

e-kayrakli · 2025-12-03T20:13:12Z

src/Repartition.chpl

+  // Given a SegString (global offsets + global byte buffer), it figures out for each
+  // locale which *parts* of which strings intersect that locale's local chunk of bytes,
+  // and collects bookkeeping needed to rebuild those pieces into local strings.
+  proc balancedSegStringRetrieval(const ref strings: SegString):


Even if there is no reuse opportunity, would it be easy to chop this into multiple smaller functions for easier code maintenance?

I will take another look, but I'm not sure because a lot of the variables get used throughout...

e-kayrakli · 2025-12-03T20:15:21Z

src/Repartition.chpl

+      var strSizes: [0..#strArray.size] int;
+
+      // Compute size of each string in bytes, including the null terminator
+      forall (i, str) in zip(0..#strArray.size, strArray) {


Consider forall (size, str) in zip(strSizes, strArray)

I think I need a value for i, so I changed it to a 3-long tuple

…5160) In order to test performance in PR #5090 I thought it would be good to have a benchmark. This sets up the framework, and then in that PR I will modify the benchmark to include the new function introduced there. Closes #5159: Create a benchmark for uniquely concatenating strings

codecov · 2026-01-29T19:29:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@58c09ad). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #5090   +/-   ##
========================================
  Coverage        ?   100.00%           
========================================
  Files           ?         5           
  Lines           ?       115           
  Branches        ?         0           
========================================
  Hits            ?       115           
  Misses          ?         0           
  Partials        ?         0

Flag	Coverage Δ
python-coverage	`100.00% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

1RyanK · 2026-02-11T17:43:28Z

Follow on ideas documented in #5401

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch 2 times, most recently from 72202bb to 9f29012 Compare November 28, 2025 15:43

1RyanK requested review from ajpotts, drculhane, e-kayrakli and jaketrookman November 28, 2025 16:23

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch 2 times, most recently from a74e4b0 to 68569d8 Compare December 3, 2025 13:04

e-kayrakli approved these changes Dec 3, 2025

View reviewed changes

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch 2 times, most recently from 8a1244d to b1d0ae2 Compare December 14, 2025 21:54

This was referenced Dec 15, 2025

Create a benchmark for uniquely concatenating strings #5159

Closed

Closes #5159: Create a benchmark for uniquely concatenating strings #5160

Merged

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch 2 times, most recently from 1df4839 to ea51fe2 Compare December 18, 2025 16:57

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch from ea51fe2 to eb6175e Compare January 29, 2026 15:19

ajpotts added the blocking This is blocking a developer from completing a task they are actively working. label Feb 10, 2026

1RyanK mentioned this pull request Feb 11, 2026

Investigate further performance optimizations in Repartition #5401

Open

1RyanK marked this pull request as ready for review February 11, 2026 17:43

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch from b05ad33 to 6fe2112 Compare February 11, 2026 17:48

Closes Bears-R-Us#5089: Add a balancedSegStringRetrieval function

89f67d6

1RyanK force-pushed the 5089-Add_a_balancedSegStringRetrieval_function branch from 6fe2112 to 89f67d6 Compare February 11, 2026 17:50

Small indexing fix

b30e12b

Closes #5089: Add a balancedSegStringRetrieval function #5090

Are you sure you want to change the base?

Closes #5089: Add a balancedSegStringRetrieval function #5090

Uh oh!

Conversation

1RyanK commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1RyanK commented Nov 28, 2025

Uh oh!

e-kayrakli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 29, 2026

Codecov Report

Uh oh!

1RyanK commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Closes #5089: Add a `balancedSegStringRetrieval` function #5090

Closes #5089: Add a `balancedSegStringRetrieval` function #5090

1RyanK commented Nov 28, 2025 •

edited

Loading