-
Notifications
You must be signed in to change notification settings - Fork 97
Closes #5089: Add a balancedSegStringRetrieval function
#5090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Closes #5089: Add a balancedSegStringRetrieval function
#5090
Conversation
72202bb to
9f29012
Compare
|
While this is not passing the CI, it does seem to be a little hungry on the RAM, and I did hit an instantiation limit at one point (on my laptop doing multi dim). I don't think it actually fails. |
a74e4b0 to
68569d8
Compare
e-kayrakli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with the caveat that the logic was a bit hard to follow. That's not to say that you should change something in the logic, but just to say that I don't think I can catch logical issues and as long as testing is clean, it should be good enough.
There are some inline comments that are mostly stylistic and performance related. Otherwise, looks good!
arkouda/numpy/strings.py
Outdated
| return Strings.from_return_msg(cast(str, rep_msg)) | ||
|
|
||
| @staticmethod | ||
| def concatenate_uniquely2(strings: List[Strings]) -> Strings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you find a more self-documenting name rather than the 2 suffix?
src/ConcatenateMsg.chpl
Outdated
| } | ||
| registerFunction("concatenateUniquely", concatenateUniqueStrMsg, getModuleName()); | ||
|
|
||
| proc concatenateUniqueStrMsg2(cmd: string, msgArgs: borrowed MessageArgs, st: borrowed SymTab): MsgTuple throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto the name
src/ConcatenateMsg.chpl
Outdated
| // Reduce to unique strings within this locale | ||
| forall str in myStrings with (+ reduce locSet) { | ||
| locSet.add(str); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting pattern!
It looks like in some tests, at least, using set(string, parSafe=true) instead and not doing things through the + reduce intent results in faster performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credit goes to Shreyas for his quick test after I asked the general question to the Chapel team.
| var firstOffsetPerLoc: [0..#numLocales] int = -1; | ||
|
|
||
| // First pass: for each locale, record its local ranges in the global bytes and offsets. | ||
| forall loc in 0..#numLocales { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, this loop runs entirely on locale 0 (though in parallel). It looks like this could actually be done in parallel by locales simultaneously, if I am not mistaken. If that's the case, coforall loc in Locales.domain do on Locales[loc] { could be a better loop+on combo here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much this matters, since the loop should execute pretty quickly either way, but which way would these accesses be local?
const currSubdomain = strBytes.localSubdomain(Locales[loc]);
I may be splitting hairs, but if the domain data all lives on locale 0, then doing a coforall here adds in some remote accesses, right?
|
|
||
| // Second pass: for each locale, determine where the strings that *start* on this | ||
| // locale end in the global byte buffer. | ||
| forall loc in 0..#numLocales { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the data here lives on locale 0, and I could split up the arrays to have one value per locale, but it feels like that might incur some extra remote accesses. I don't know which way would be faster, though.
|
|
||
| // Record the global string index for the first string on locale i | ||
| firstStringIndices[i] = idx; | ||
| firstStringStartOffsets[i] = myOffsets[idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these not race conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if that these are wrapped inside of makes sure that they are not race conditions. Basically each locale checks to see if it knows what these values should be set to, and only one locale will see that it does.
|
|
||
| // Full size of the string in bytes | ||
| const size = endStr - myOffsets[idx]; | ||
| firstStringSizes[i] = size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential race condition.
| // Portion of this string that resides on locale i's byte range | ||
| // (truncate end to not go past that locale’s last byte + 1) | ||
| const endStrThisLoc = min(endStr, myLastIdxPerLoc[i] + 1); | ||
| firstStringNumBytesEachLocale[i] = endStrThisLoc - myFirstIdxPerLoc[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another one.
These could be benign write-after-write hazards, though I still wanted to check.
| // Given a SegString (global offsets + global byte buffer), it figures out for each | ||
| // locale which *parts* of which strings intersect that locale's local chunk of bytes, | ||
| // and collects bookkeeping needed to rebuild those pieces into local strings. | ||
| proc balancedSegStringRetrieval(const ref strings: SegString): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if there is no reuse opportunity, would it be easy to chop this into multiple smaller functions for easier code maintenance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take another look, but I'm not sure because a lot of the variables get used throughout...
| var strSizes: [0..#strArray.size] int; | ||
|
|
||
| // Compute size of each string in bytes, including the null terminator | ||
| forall (i, str) in zip(0..#strArray.size, strArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider forall (size, str) in zip(strSizes, strArray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need a value for i, so I changed it to a 3-long tuple
8a1244d to
b1d0ae2
Compare
1df4839 to
ea51fe2
Compare
ea51fe2 to
eb6175e
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5090 +/- ##
========================================
Coverage ? 100.00%
========================================
Files ? 5
Lines ? 115
Branches ? 0
========================================
Hits ? 115
Misses ? 0
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Follow on ideas documented in #5401 |
b05ad33 to
6fe2112
Compare
6fe2112 to
89f67d6
Compare
Previously it seemed like the way to divide bytes up appropriately in a SegString was to do something like this:
As you can see, each locale is just grabbing the strings corresponding to the offsets that are local. This is rather easy to do, but maybe not the best way to do things.
This PR makes it easy to do things a little bit differently - try to grab mostly the bytes that are on the current locale.
It also turns everything into anAlso addedinnerArrayof strings. You can see how simple this makes theconcatenateUniqueStrMsg2function.segStringFromInnerArrayto convert back into aSegStringeasily.The way that it figures out which strings end up on which locales is as follows: if a string is entirely on one locale, then it belongs to that locale. If it overlaps two or more locales, then the locale that ends up owning it is the locale with the most bytes. If there is a tie, then of the locales with the most bytes of that string, the one with the lowest ID gets it.
I have done some benchmarking on this PR and I found that an
innerArrayof strings was bad. I think this is largely because strings are probably objects which have a pointer to the bytes, along with some other useful things for the object. So when you try to bulk transfer an array of strings, it can't just transfer the bytes smoothly. So I reverted back toinnerArrays ofuint(8)bytes and theint(64)offsets.I also added in my own hashing function. The reason for this is I can pass in the array, where to start hashing, and where to stop. I think this is not insanely better than using
borrow=trueininterpretAsStringbut there's some additional code surrounding that so maybe it helps performance just a bit. But who knows, I could be wrong.I also changed it so that rather than doing something like
hash % numlocalesit does((hash >> 32) * numLocales) >> 32. The reason is that modulo relies on division, which can be slow. A multiply is fast and this basically gets the job done. Roughly speaking this is almost equivalent to(hash * numLocales) >> 64, except for the fact that this way would likely overflow the 64 bit size. So we sacrifice a tiny bit of precision for an easier way to do the calculation.I opted not to use
balancedSegStringRetrievalin theconcatenate_uniquelycode because I'm ultimately not really sure how much time it actually saves. That being said - I think the idea is useful. Essentially it lends itself really well to cutting out the initial bulk transfer intoinnerArrays. Basically just bulk transfer the surrounding bytes from other locales (according to the same rules asbalancedSegStringRetrieval), hash, and then bulk transfer out of the local slice to other locales into theirinnerArrays.Performance stacks up surprisingly well (at least to me) against
unique(concatenate(...)). For smaller quantities of strings,uniquely_concatenatewins. At a certain point,uniquely_concatenatestarts losing, but it takes a while to get there. Moreover I documented some ideas for improving performance in #5401 that may help get performance up there.Closes #5089: Add a
balancedSegStringRetrievalfunction