Skip to content

Remove duplicate throughput call and change string literal check to allow negative offsets#120

Merged
JanLJL merged 4 commits into
RRZE-HPC:masterfrom
stefandesouza:master
Jan 23, 2026
Merged

Remove duplicate throughput call and change string literal check to allow negative offsets#120
JanLJL merged 4 commits into
RRZE-HPC:masterfrom
stefandesouza:master

Conversation

@stefandesouza
Copy link
Copy Markdown
Contributor

@stefandesouza stefandesouza commented Jan 19, 2026

The optimal throughput calculation is invoked twice during the inspect() routine which seems to be unecessary.

In the x86 Intel parser code, there seems to be a string literal check that should instead be a dictionary value conditional
if displacement and "operator_disp" == "-":
This results in instructions with negative memory offsets being parsed incorrectly, for example
"\tlea\trcx, OFFSET FLAT:??_R0M@8-4"

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.48%. Comparing base (60d725b) to head (e0d6273).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
- Coverage   90.48%   90.48%   -0.01%     
==========================================
  Files          25       25              
  Lines        3668     3667       -1     
==========================================
- Hits         3319     3318       -1     
  Misses        349      349              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pleroy
Copy link
Copy Markdown
Contributor

pleroy commented Jan 19, 2026

Could we have a test for the parser change? Something that doesn't parse before and parses after.

@stefandesouza
Copy link
Copy Markdown
Contributor Author

stefandesouza commented Jan 20, 2026

Sure, that's fair. I added two tests of type identifier-offset and identifier+ -offset that fail to parse correctly before and successfully now.

@stefandesouza stefandesouza changed the title Remove duplicate throughput call and change string literal check Remove duplicate throughput call and change string literal check to allow negative offsets Jan 20, 2026
@pleroy
Copy link
Copy Markdown
Contributor

pleroy commented Jan 20, 2026

Thanks for adding the test.

I am a tad concerned by the removal of that call to assign_optimal_throughput. I think it would need a deeper analysis: that function adds semantic attributes to the instructions, and it's entirely conceivable that it would take two passes to do the full analysis.

I am not familiar with this part of the code, but I note that the second call was added by @JanLJL in commit 7724ce2 and the same commit added something call "multiple assignments". Do multiple assignments require multiples passes? Is the removal of the second call going to break Zen3?

Of course, if the repeated call was intentional, it would have been good to have a comment, but here we are. I hope that Jan remembers what is going on here.

@JanLJL JanLJL self-assigned this Jan 22, 2026
@JanLJL
Copy link
Copy Markdown
Collaborator

JanLJL commented Jan 22, 2026

Hi guys and sorry for not replying earlier!
The second assign_optimal_throughput was in fact there for a reason.
As OSACA increases or decreases the port pressure per instruction and without looking ahead, it might not get the optimal case in one go. A second run of (re-)distributing the throughput numbers is supposed to help balancing the overall throughput better and get to the global minima.
However, I tried various kernels in the past two days and could not find a single example in my code snippet folder, where the removal of the second assign run made a difference to the bottleneck, just some kernels where the port pressure of non-limiting ports (i.e., not performance relevant ports) changed by 0.01 cycles. So even though there was some thought behind it, I do not have a proof that it is still necessary for optimal port distribution. As for now, I would like to keep it in and do some more tests before removing it for good.

What Pascal mentioned with "multiple assignments" becomes relevant when we have different options for a port assigned, like an OR in the port mapping (contrary to the ANDs we usually see). This is written as dict with different (irrelevant) keys, e.g., smlal for A64FX in line 2600 of its machine model:

- name: [smlal, smlal2]
  operands:
  ...
  throughput: 8.0
  latency: 18.0
  port_pressure: {0: [[8, '0']], 1: [[8, '2']]}

This means, OSACA will schedule either full 8 cycles on port 0 or full 8 cycles on port 2.
This could be also modeled as port_pressure: [[16, '02']], but if there is already high pressure on one of the two ports this would effectively double the actual runtime and could lead to an incorrect shift of the bottleneck. However, while investigating this together with the two-times assign_optimal_throughput() call, I realized there was a bug in the way OSACA searches in a DFS-manner for the best solution, so thank you for bringing this up! I will commit this separately to this PR.

Finally, for the parsing of memory offsets, this looks very good to me, thank you for adding the test cases!
If it's OK for you and as mentioned above, I'd like to merge the parser changes and reject the throughput assignment change (for now).

@stefandesouza
Copy link
Copy Markdown
Contributor Author

Sure, please go ahead:)

@JanLJL JanLJL merged commit ae54403 into RRZE-HPC:master Jan 23, 2026
6 checks passed
JanLJL added a commit that referenced this pull request Jan 23, 2026
While I couldn't find an example where the second call is still necessary for better port balancing, we keep it in for now to check this more closely (see #120)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants