bazel/macos - fix linker problems, swig and regression for macos#9480
Conversation
There was a problem hiding this comment.
Code Review
This pull request successfully migrates the Tcl dependency from rules_hdl to the Bazel Central Registry (BCR), a significant step in modernizing the build system. It also introduces several fixes to enable building on macOS, including necessary linker flags for Python extensions and compiler definitions. The changes are extensive but largely consist of mechanical updates to BUILD files, replacing the old Tcl dependency. Additionally, a test script has been updated for better cross-platform compatibility. My review includes a suggestion to further improve the robustness of this test script's file handling.
| set f0 [open $out_def r] | ||
| set def0 [read $f0] | ||
| close $f0 | ||
|
|
||
| set f1 [open $unzipped_def r] | ||
| set def1 [read $f1] | ||
| close $f1 |
There was a problem hiding this comment.
The change to compare file contents directly is great for cross-platform compatibility. However, the file handling can be made more robust. If an error occurs during the read command, the corresponding close command will not be executed, leading to a resource leak. It's better to use try...finally blocks to ensure files are always closed.
set f0 [open $out_def r]
try {
set def0 [read $f0]
} finally {
close $f0
}
set f1 [open $unzipped_def r]
try {
set def1 [read $f1]
} finally {
close $f1
}
|
Thanks for contributing. Commits need to be signed to pass DCO (use |
|
clang-tidy review says "All clean, LGTM! 👍" |
65ed7ae to
5a7349e
Compare
I pushed the signed versions. Thanks for working on OpenROAD! |
|
Thanks for contributing! I think last time I checked, the BCR tcl version does not work fully as it fails to find its The BCR version needs to be changed to find these, and I am playing with some options how to do that (mostly involving resolving runfiles) to then upstream to BCR. Maybe we should await that fix (but I will not get to it before the weekend). Maybe the issues you see in the failures can be explained with that ? |
|
I was a little surprised about the regression differences. The lowest hanging fruit to make arm behave like x86 is: This also works the other way around: |
|
Maybe also some rounding or stdlib differences ? |
|
clang-tidy review says "All clean, LGTM! 👍" |
Thanks for looking into this build. I had these problems with missing tcl runtimes like init.tcl on MacOS but i fixed them. The runtime files are adressed in
Thanks for looking at my patch! I had the missing runtime files like init.tcl also in my build. The runtimes are provided with the changes in bazel/BUILD with the data section and bazel/InitRunFiles.cpp. I looked at some of the tests and with something missing like init.tcl more or less all tests fail. |
I work in parallel on building OpenROAD with bazel on debian 13 in an arm64 virtual machine on my Macbook. There the regression fails for two tests "//src/psm/test:gcd_all_vss-tcl_test" and "//src/psm/test:gcd_em_test_vdd-tcl_test" out of 1155. So there are less tests failing and something must be different. |
|
I now used the bazel-macos branch to build on debian 13 aarch64 in a virtual machine on my macbook. Without bazel-macos changes, two tests fail on linux. With the changes from bazel-macos that affect all builds like the tcl from bazel repository the same two tests are failing on linux. So it must be something macos specific and not the parts that affect all builds that result in regression differences. The failed builds on debian 13 are: |
|
since macOS is using clang, could you try this ?
|
| @@ -55,7 +55,7 @@ class BazelInitializer | |||
| } | |||
|
|
|||
| // Set the TCL_LIBRARY environment variable | |||
| const std::string tcl_path = runfiles->Rlocation("tk_tcl/library/"); | |||
| const std::string tcl_path = runfiles->Rlocation("tcl_lang+/library/"); | |||
There was a problem hiding this comment.
the + here looks like an artifact to match how bazel 8 happens to name the directories.
I tested locally and the canonical tcl_lang/library/ also seems to work, so that is probably more robust to use.
|
Thanks, yes the fix in the runfile section in The PR also touches submodules ( Some changes (like the MacOS refinements) are somewhat independent of the tcl change. The following is a suggestion (I am not owner of this project, so I can only recommend, @maliberty is the person to ask) how to split this into smaller changes: Change to tcl in BCRIn this first change, add the tcl from BCR, but under the old name , remove the Change the
|
5a7349e to
f6157c0
Compare
|
I made a new PR #9490 which only replaces the tcl library from rules_hdl to the bcr version. This needs changes in the OpenSTA submodule. The PR there is The-OpenROAD-Project/OpenSTA#290. I will adapt the PR here with the MacOS specific changes. |
I tried that but it made no difference. I have the same 61 failing regression test. |
f6157c0 to
00943ad
Compare
|
@stefanottili - Here is some analysis why MacOS and Linux builds produce different results which result in regression test failures. With the fixes openroad produces identical results on MacOS and Linux. Linux only tested on debian trixie aarch64. This is from claude code. Platform Compatibility: macOS vs Linux Regression Test DeterminismOverviewWhen running OpenROAD regression tests on macOS (arm64) with a hermetic LLVM Test Environment
Results Before and After Fixes
The remaining 52 shared failures are due to the golden files being generated on Root Causes1. System Math Library (libm) DifferencesImpact: ~44 GPL tests, 2 PSM tests, 3 RSZ tests, 2 GRT tests, 1 PDN test The hermetic LLVM toolchain provides the compiler but links against the These tiny differences are amplified through:
Fix: Integrate OpenLibm (v0.8.7), Files:
2. Apple's
|
| Seed | glibc first values | Apple libc first values |
|---|---|---|
| 42 | 71876166, 708592740, ... | 705894, 1126542223, ... |
| 1 | 1804289383, 846930886, ... | 16807, 282475249, ... |
glibc uses a TYPE_3 trinomial feedback shift register with 31 words of
state, while Apple uses a simpler algorithm. This affects:
- ABC logic synthesis (64+ source files use
rand()) — different random
simulation patterns in fraig, different optimization paths - MBFF clustering (
src/gpl/src/mbff.cpp:1326,1347) —std::rand()used
for K-Means++ initialization and cluster assignment - Nesterov placement (
src/gpl/src/nesterovBase.cpp:1914) — random offsets
during cell placement
Fix: Two-part fix:
- Portable
rand()/srand()— a drop-in replacement implementing glibc's
exact TYPE_3 algorithm (31-word state, separation 3, 310 warmup iterations).
Compiled withalwayslink = Trueso the symbols override systemrandin
the final binary. - Replace
rand()withstd::mt19937innesterovBase.cpp— the
Mersenne Twister is specified by the C++ standard and produces identical
sequences on all platforms.
Files:
bazel/openlibm/portable_rand.c— glibc TYPE_3 rand/srand/rand_rbazel/openlibm/BUILD— cc_library withalwayslink = TrueBUILD.bazel— added//bazel/openlibm:portable_randdependencysrc/gpl/src/nesterovBase.cpp— replacedsrand(42); rand()with
std::mt19937 offsetRng(42)
4. qsort() Ordering Differences for Equal Elements
Impact: 8 RMP tests, 1 PDN test, 1 GPL clustering test, 1 MPL test
The C standard does not guarantee qsort() stability — when elements compare
equal, their relative order is implementation-defined. glibc uses a
merge sort (which is stable), while Apple's libc uses a different algorithm
that produces different orderings:
Input: {5,0} {3,1} {3,2} {1,3} {4,4} {2,5} {3,6} {5,7} {1,8}
glibc: (1,3) (1,8) (2,5) (3,1) (3,2) (3,6) (4,4) (5,0) (5,7)
Apple: (1,3) (1,8) (2,5) (3,6) (3,1) (3,2) (4,4) (5,7) (5,0)
^^^^^^^^^^^ ^^^^^^^^^^^
ABC has 90+ qsort() calls. Different orderings of equal elements lead to
different optimization paths in fraig, rewriting, and technology mapping,
producing different gate counts and circuit structures.
Fix: Provide a portable qsort() implementation matching glibc's merge
sort algorithm on macOS. This is compiled only on macOS (Linux already uses
glibc's qsort natively).
Files:
bazel/openlibm/portable_qsort.c— merge sort with insertion sort fallbackbazel/openlibm/BUILD— included viaselect({"@platforms//os:macos": ...})
5. -ffp-contract=off (Pre-existing)
Impact: FFT-based placement tests
FP contraction allows the compiler to fuse multiply-add operations into FMA
instructions, which can produce slightly different results. This flag was
already set in .bazelrc before this work.
Files:
.bazelrc:29—build --cxxopt "-ffp-contract=off"
Summary of All Changes
| File | Change |
|---|---|
MODULE.bazel |
Added OpenLibm http_archive (v0.8.7) |
BUILD.bazel |
Added @openlibm and //bazel/openlibm:portable_rand deps |
bazel/openlibm/bundled.BUILD.bazel |
Full BUILD for OpenLibm with platform selects |
bazel/openlibm/BUILD |
Package with portable_rand and portable_qsort targets |
bazel/openlibm/portable_rand.c |
glibc TYPE_3 rand/srand implementation |
bazel/openlibm/portable_qsort.c |
glibc merge-sort qsort (macOS only) |
.bazelrc |
-fno-builtin (global), -fmath-errno (macOS) |
src/gpl/src/nesterovBase.cpp |
rand() replaced with std::mt19937(42) |
Remaining Failures
Shared Failures (52 tests, both macOS and Linux)
These fail on aarch64 because golden files were generated on x86_64 Linux. The
nesterovBase.cpp change from rand() to mt19937 also changes placement
results vs the x86_64 golden files. These require golden file regeneration.
GPL — Global Placement (45 tests)
| Test | Type |
|---|---|
//src/gpl/test:ar01 |
DEF comparison |
//src/gpl/test:ar01-tcl_test |
Log comparison |
//src/gpl/test:ar02 |
DEF comparison |
//src/gpl/test:ar02-tcl_test |
Log comparison |
//src/gpl/test:cluster_place01-tcl_test |
Log comparison |
//src/gpl/test:convergence01 |
DEF comparison |
//src/gpl/test:convergence01-tcl_test |
Log comparison |
//src/gpl/test:core01 |
DEF comparison |
//src/gpl/test:core01-tcl_test |
Log comparison |
//src/gpl/test:diverge01-tcl_test |
Log comparison |
//src/gpl/test:error01-tcl_test |
Log comparison |
//src/gpl/test:incremental01 |
DEF comparison |
//src/gpl/test:incremental01-tcl_test |
Log comparison |
//src/gpl/test:nograd01 |
DEF comparison |
//src/gpl/test:nograd01-tcl_test |
Log comparison |
//src/gpl/test:simple01 |
DEF comparison |
//src/gpl/test:simple01-obs |
DEF comparison |
//src/gpl/test:simple01-obs-tcl_test |
Log comparison |
//src/gpl/test:simple01-rd-tcl_test |
Log comparison |
//src/gpl/test:simple01-ref |
DEF comparison |
//src/gpl/test:simple01-ref-tcl_test |
Log comparison |
//src/gpl/test:simple01-skip-io |
DEF comparison |
//src/gpl/test:simple01-skip-io-tcl_test |
Log comparison |
//src/gpl/test:simple01-tcl_test |
Log comparison |
//src/gpl/test:simple01-td |
DEF comparison |
//src/gpl/test:simple01-td-tcl_test |
Log comparison |
//src/gpl/test:simple01-td-tune |
DEF comparison |
//src/gpl/test:simple01-td-tune-tcl_test |
Log comparison |
//src/gpl/test:simple01-uniform |
DEF comparison |
//src/gpl/test:simple01-uniform-tcl_test |
Log comparison |
//src/gpl/test:simple02 |
DEF comparison |
//src/gpl/test:simple02-rd-tcl_test |
Log comparison |
//src/gpl/test:simple02-tcl_test |
Log comparison |
//src/gpl/test:simple03 |
DEF comparison |
//src/gpl/test:simple03-rd-tcl_test |
Log comparison |
//src/gpl/test:simple03-tcl_test |
Log comparison |
//src/gpl/test:simple04 |
DEF comparison |
//src/gpl/test:simple04-rd-tcl_test |
Log comparison |
//src/gpl/test:simple04-tcl_test |
Log comparison |
//src/gpl/test:simple07 |
DEF comparison |
//src/gpl/test:simple07-tcl_test |
Log comparison |
//src/gpl/test:simple08 |
DEF comparison |
//src/gpl/test:simple08-tcl_test |
Log comparison |
//src/gpl/test:simple10 |
DEF comparison |
//src/gpl/test:simple10-tcl_test |
Log comparison |
GRT — Global Routing (2 tests)
| Test | Type |
|---|---|
//src/grt/test:congestion2-tcl_test |
Log comparison |
//src/grt/test:congestion5-tcl_test |
Log comparison |
PSM — Power (2 tests)
| Test | Type |
|---|---|
//src/psm/test:gcd_all_vss-tcl_test |
Log comparison |
//src/psm/test:gcd_em_test_vdd-tcl_test |
Log comparison |
RSZ — Resizer (3 tests)
| Test | Type |
|---|---|
//src/rsz/test:buffer_ports10-tcl_test |
Log comparison |
//src/rsz/test:buffer_ports8-tcl_test |
Log comparison |
//src/rsz/test:buffer_ports9-tcl_test |
Log comparison |
macOS-Only Failure (1 test)
| Test | Root Cause |
|---|---|
//src/utl/test:utl_cfile_unittest |
Utl.metrics_server_responds_with_basic_metric fails with "shutdown: Socket is not connected". This is a macOS socket behavior difference (not numerical). |
Tests Fixed by This Work (previously macOS-only, now passing)
| Test | Root Cause | Fix |
|---|---|---|
//src/gpl/test:clust02-tcl_test |
rand(), qsort() |
portable_rand, portable_qsort |
//src/mpl/test:fixed_macros2-tcl_test |
libm, qsort() |
OpenLibm, portable_qsort |
//src/pdn/test:pads_black_parrot_flipchip_connect_overpads-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:aes_annealing-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:aes_asap7-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:aes_dontuse_nangate45-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:aes_nangate45-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:const_cell_removal-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:gcd_annealing2-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:gcd_asap7-tcl_test |
qsort() |
portable_qsort |
//src/rmp/test:gcd_restructure-tcl_test |
qsort() |
portable_qsort |
Why -ffp-contract=off Alone Was Not Sufficient
FP contraction (-ffp-contract) controls whether the compiler fuses
multiply-add operations into FMA instructions. While this is important for
reproducibility, it only affects compiler-level transformations. The
dominant sources of cross-platform divergence are:
- C library function implementations —
sin(),cos(),exp()etc.
return different results at ULP level between glibc and Apple libm. No
compiler flag changes this. - C library
rand()algorithm — completely different PRNG algorithms
between glibc and Apple libc. - C library
qsort()stability — different sort algorithms produce
different orderings for equal elements.
These require library-level fixes (OpenLibm, portable_rand,
portable_qsort), not compiler flags.
Architecture
┌─────────────────────┐
│ openroad binary │
└─────┬───────────────┘
│ deps
┌─────────────┼──────────────────┐
│ │ │
┌────────▼───────┐ ┌─▼──────────┐ ┌───▼────────────┐
│ OpenLibm │ │ portable │ │ portable │
│ (all platforms)│ │ rand │ │ qsort │
│ │ │ (all plat) │ │ (macOS only) │
│ sin,cos,exp, │ │ │ │ │
│ log,pow,sqrt.. │ │ rand() │ │ qsort() │
│ │ │ srand() │ │ qsort_r() │
│ alwayslink=True│ │ rand_r() │ │ │
└────────────────┘ │ │ │ alwayslink= │
│ alwayslink=│ │ True │
│ True │ └───────────────┘
└────────────┘
All three libraries use alwayslink = True which forces the linker to load
all symbols from the static archive into the final binary, overriding the
corresponding system library functions.
2cdcb25 to
a443776
Compare
|
Thanks for your extensive investigation! The changes from |
|
Thanks for this work! Quick question, we've talked about this issue before, although none of us have investigated it to the level you've done here 🎉, I worry that to achieve true portability (at least on Linux/unix) we need to remove all dependence on cmath, or do you think what you've done here is sufficient? Even if OpenRoad removes cmath as a dependency, it seems like it would be hard to do the same for ABC. |
Added -undefined dynamic_lookup linkopts on macOS for Python extension .so targets. The reference symbols are resolved at load time not link time. Without this the linker errors on undefined Python symbols. The change affects only the macos build. Signed-off-by: Friedrich Beckmann <friedrich.beckmann@tha.de>
Moved %include "Exception-py.i" before %template declarations so the Python exception handler is active when SWIG generates template wrapper code. Without this the python tests in the regression fail because the ask for tcl symbols. Signed-off-by: Friedrich Beckmann <friedrich.beckmann@tha.de>
I rewrote the test to use pure tcl file reads and string comparison. Now the test also works on macos. Signed-off-by: Friedrich Beckmann <friedrich.beckmann@tha.de>
Signed-off-by: Friedrich Beckmann <friedrich.beckmann@tha.de>
I extended the mechanism that is used to supply the defines for the python module. Now tha linker options are set in one place. Signed-off-by: Friedrich Beckmann <friedrich.beckmann@tha.de>
|
That makes sense so I just rebased it and it looks to build now. |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
clang-tidy review says "All clean, LGTM! 👍" |
Head branch was pushed to by a user without write access
632cb2f to
6d77389
Compare
|
clang-tidy review says "All clean, LGTM! 👍" |
I already made the tcl changes in the third-party/abc module. They are commited in abc. So lets track them. Signed-off-by: Friedrich Beckmann <friedrich.beckmann@tha.de>
|
clang-tidy review says "All clean, LGTM! 👍" |
|
Sorry - I didn't see the automerge activity. I have rebased this and updated the third-party/abc submodule to track the abc tcl merge. |
|
currently, we're not using the third-party/abc but directly the one from BCR ... so you might not see any difference. |
yes i noticed but i had made the changes before main branch switched to the abc bazel repo. So this is just in case somebody wants to build agains the local abc again. |
|
@fredowski: Thanks for working on this. I gave it a try on a 16GB MacBook M1, MacOS Tahoe. Compared to a cmake build:
|
|
regarding the long build times
(I think the bazel gui build is not working yet on macos) |
|
I just merged the PR to enable qt6 on mac. |
With my Macbook Pro M4 i need 22minutes to compile after cleaning with
I have also around 13 GB in /private/var/tmp but only 3,4G in ~/.cache/bazel-disk-cache
Yes, this are the 62 tests which are failing initially.
This is correct. I should work after the #9568 is now merged. |
|
@stefanottili With a clean build you're paying for compiling all the dependencies of OpenROAD including QT from source, and statically linking them in the OpenROAD binary. On the next successive build you will only be paying for changes to the OpenROAD source code. Bazel is all all about incrementally building hermetically. If you change a single cc file in the OR repo the recompile should be quick. The other thing that's nice about compiling everything from source is that while it might take a while the first time it'll also pass the first time. So we're trading a bit of compile time upfront against the time it takes you to wrestle with apt-get/brew annoyance. |
|
@QuantamHD I take your points, but let me play devils advocate for a sec here: a) In 20 years of compiling klayout with qt4/5/6 on rhel linux and 4 years of doing that and building OR on macOS: Any package manager/build system has its quirks but over time things tend to get fixed. b) /opt/brew 9GB + 2GB cmake build dir vs 18GB .cache + 11GB /private/var/tmp/_bazel... @fredowski Many thanks for getting OR to compile with qt6, it would be nice if cmake + brew qt@5 and qt@6 would work. |
|
qt is using a prebuilt library, no? (https://github.com/The-OpenROAD-Project/qt_bazel_prebuilts) |
|
That repo is named that, but it is fact building from source now. There's no way to reliably ship compatible binaries on linux so I just built it from source. Which mirrors what we do at Google. |
|
Perhaps we will need to have a public mac bazel cache too then. |
@stefanottili I get it, I do. But let me try to give you the background on why we're going this route. Every week or so I see a number of people who file an issue that they were not able to compile OpenROAD on their machine for one reason or another (You personally have filed ~13 such issues over the years ), and for every one of the people who files an issue, I assume a 100 or more just give up. User adoption is a funnel, and building OpenROAD is the widest part of that funnel, which means making it more reliable is the best way to increase the number of user conversions. Bazel does take longer the first time, and it does take more disk, but what it aims to be more than anything is reliable. I want it to work the first every time for every users on the planet no matter what weird linux or macos system they're on, and unfortunately the best way to ensure that is to make everything compile from source using a hermetic compiler toolchain. Some of the build time issues as Matt suggests can be offset by populating a public cache that bazel can pull from as is already done on linux platforms. |
I fixed some parts of the bazel build on macos to be able to build the openroad commandline version and to run the regression.
bazelisk run //:openroadI start the regression with
bazelisk test //src/...and have
Executed 1117 out of 1157 tests: 1096 tests pass and 61 fail locally.Most failures are in gpl (33), rmp (8) with some (small) deviations in numbers. This is what claude says:
I can also run the flow regression with
and see some parts failing. I run this on a MacbookPro M4 with macos 15.7.3 (24G419)
I made some changes for the tcl migration on the sta and the abc submodule. In the meantime abc has changed and it works right away with the new abc version. OpenSTA needs to choose the MachineApple.cc on MacOS. The gui build on MacOS does not work. So this is only commandline and regression.
The OpenSTA changes are here: https://github.com/fredowski/OpenSTA/tree/bazel-macos
abc changes for the OpenRoad abc version: https://github.com/fredowski/abc/tree/bazel-macos (not needed)