Dynamic build support with CI matrix#45
Conversation
4de99d0 to
23dd6fa
Compare
|
Interestingly, this now seems to fail due to the missing rts sublib issues. That are fixed in #44. Specifically this: |
dabc558 to
bbb815a
Compare
a10be51 to
d3180ae
Compare
d9f06c2 to
a1edbb5
Compare
7c8212f to
e74ec54
Compare
b78b210 to
1967f08
Compare
| sanitized_hc := $(subst $(space),_,$(subst :,_,$(subst /,_,$(subst \,_,$(TEST_HC))))) | ||
| test_hc_hash := $(shell \ | ||
| if command -v openssl >/dev/null 2>&1; then \ | ||
| openssl dgst -sha256 $(TEST_HC) | awk '{print substr($$2, 1, 8)}'; \ | ||
| elif command -v sha256sum >/dev/null 2>&1; then \ | ||
| sha256sum $(TEST_HC) | awk '{print substr($$1, 1, 8)}'; \ | ||
| elif command -v shasum >/dev/null 2>&1; then \ | ||
| shasum -a 256 $(TEST_HC) | awk '{print substr($$1, 1, 8)}'; \ | ||
| else \ | ||
| echo "no_hash"; \ | ||
| fi) | ||
| ghc_config_mk = $(TOP)/mk/$(test_hc_hash)_ghcconfig$(sanitized_hc).mk |
There was a problem hiding this comment.
I should add a comment here why we do this. It adds a hash to the ghc config file. This ensure we always read the correct ghcconfig, if the ghc changed, the hash changed, and we recompute.
| (NoArg (setGeneralFlag Opt_NoHsMain)) | ||
| , make_ord_flag defGhcFlag "no-rts" | ||
| (NoArg (setGeneralFlag Opt_NoRts)) | ||
| , make_ord_flag defGhcFlag "no-ghc-internal" |
There was a problem hiding this comment.
We have the following dependeny as visible to cabal:
ghc-internal
+ rts
however, ghc will try to insert:
ghc-internal
+ rts-sublib (based on the -threaded / -debug flag)
+ rts
If we try to build the rts-sublib with ghc, we can end up trying to load ghc-internal, due to auto-injection of libraries.
Maybe a better solution is to add a flag to outright disable ghc's auto population of libs, instead of having separate ones for each lib 😅
8c7a84c to
3e7c948
Compare
2536c91 to
d16343c
Compare
e808dde to
537b3a7
Compare
a6aa186 to
1fe3235
Compare
|
On ARM64 alpine musl linked: |
I'll look at this, why are we running dyn on static musl/alpine? That's odd, but probably something we should either prohibit by construction or fix. |
3ae8429 to
9cfb121
Compare
| PATCHELF ?= patchelf | ||
| INSTALL_NAME_TOOL ?= install_name_tool |
There was a problem hiding this comment.
I wonder if we really need these still?
| ifeq ($(DYNAMIC),1) | ||
| $(SED) -i -e 's/"RTS ways","/"RTS ways","dyn debug_dyn thr_dyn thr_debug_dyn /' $@ | ||
| endif |
| # $1 = TIPLET | ||
| define build_cross | ||
| GHC=$(GHC) HADRIAN_SETTINGS='$(call HADRIAN_SETTINGS)' \ | ||
| LD_LIBRARY_PATH=$(LD_LIBRARY_PATH) GHC=$(GHC) HADRIAN_SETTINGS='$(call HADRIAN_SETTINGS)' \ |
There was a problem hiding this comment.
The need for all these LD_LIBRARY_PATH stuff, feels more like a hack. Something we ought to fix properly.
| fi ; \ | ||
| done ; \ | ||
| if [ $${ffi_incdir} != "none" ] ; then $(call copy_headers,ffitarget.h,$(CURDIR)/$${ffi_incdir},libffi-clib,$(CURDIR)/_build/bindist/bin/$1-ghc-pkg$(EXE_EXT)) ; fi | ||
| if [ $${ffi_incdir} != "none" ] ; then $(call copy_headers,ffitarget.h,$(CURDIR)/$${ffi_incdir},libffi-clib,LD_LIBRARY_PATH=$(LD_LIBRARY_PATH) $(CURDIR)/_build/bindist/bin/$1-ghc-pkg$(EXE_EXT)) ; fi |
There was a problem hiding this comment.
I believe this is fixed in cabal now, the copy-headers stuff?
| # $1 = rpath | ||
| # $2 = binary | ||
| # set rpath relative to the current executable | ||
| # TODO: on darwin, this doesn't overwrite rpath, but just adds to it, | ||
| # so we'll have the old rpaths from the build host in there as well | ||
| # set_rpath: Add rpath to binary. On Darwin, check if rpath already exists | ||
| # before adding (install_name_tool fails if rpath is duplicate). | ||
| define set_rpath | ||
| $(if $(filter Darwin,$(UNAME)), \ | ||
| if ! otool -l "$(2)" 2>/dev/null | grep -A2 'LC_RPATH' | grep -q "@executable_path/$(1)"; then \ | ||
| $(INSTALL_NAME_TOOL) -add_rpath "@executable_path/$(1)" "$(2)"; \ | ||
| fi, \ | ||
| $(PATCHELF) --force-rpath --set-rpath "\$$ORIGIN/$(1)" "$(2)") |
There was a problem hiding this comment.
Again, this feels like a massive hack, we shouldn't need and that GHC should just do on it's own properly.
| /* | ||
| * Note [Boot library symbol visibility] | ||
| * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| * Boot library promotion to RTLD_GLOBAL is now handled in RTS initialization. | ||
| * See Note [Promoting Boot Libraries to RTLD_GLOBAL] in rts/RtsStartup.c. | ||
| * | ||
| * This ensures all GHC API programs (not just ghc-iserv) properly export | ||
| * boot library symbols before any user code is loaded via dlopen. | ||
| */ | ||
|
|
There was a problem hiding this comment.
I don't think this comment is necessary anymore 🤔
4b6d197 to
5f99a8c
Compare
| -- addresses directly from the running RTS. This is called at runtime after | ||
| -- promoteBootLibrariesToGlobal() has run, ensuring we get the correct RTS. | ||
| -- | ||
| -- The lookup is cached in an IORef for efficiency. |
There was a problem hiding this comment.
You don't need an IORef: a CAF would be enough. On first entry it would perform the unsafePerformIO stuff.
| -- | RTS function to get stg_interp_constr*_entry address. | ||
| -- This returns the address directly from the running RTS, avoiding any | ||
| -- symbol resolution issues from library loading order. | ||
| foreign import ccall unsafe "getInterpConstrEntryAddr" |
There was a problem hiding this comment.
In rts/includes/stg/MiscClosures.h there is this comment:
// RTS_FUN(stg_interp_constr1_entry);
// RTS_FUN(stg_interp_constr2_entry);
// RTS_FUN(stg_interp_constr3_entry);
// RTS_FUN(stg_interp_constr4_entry);
// RTS_FUN(stg_interp_constr5_entry);
// RTS_FUN(stg_interp_constr6_entry);
// RTS_FUN(stg_interp_constr7_entry);
//
// This is referenced using the FFI in the compiler (GHC.ByteCode.InfoTable),
// so we can't give it the correct type here because the prototypes
// would clash (FFI references are always declared with type StgWord[]
// in the generated C code). You should now be allowed to declare the correct types here and to remove the comment I think.
| * fit in a 32-bit slot. When a reference to a symbol in a shared library | ||
| * exceeds this range, we use "jump islands" (trampolines) for code references | ||
| * via makeSymbolExtra(). | ||
| * | ||
| * We unfortunately can't tell whether symbol references are to code | ||
| * or data. So for now we assume they are code (the vast majority | ||
| * are), and allocate jump-table slots. Unfortunately this will | ||
| * SILENTLY generate crashing code for data references. This hack is | ||
| * enabled by X86_64_ELF_NONPIC_HACK. | ||
| * For FUNCTION symbols (STT_FUNC), jump islands work correctly: the call | ||
| * bounces through the trampoline to reach the real function. | ||
| * | ||
| * One workaround is to use shared Haskell libraries. This is the case | ||
| * when dynamically-linked GHCi is used. | ||
| * For DATA symbols (STT_OBJECT, STT_NOTYPE), jump islands are INCORRECT: | ||
| * the jump island address would be embedded as a data pointer (e.g., an info | ||
| * table pointer in a closure), causing the GC to interpret jump island memory | ||
| * as a valid info table — leading to "strange closure type" crashes. | ||
| * | ||
| * Another workaround is to keep the static libraries but compile them | ||
| * with -fPIC -fexternal-dynamic-refs, because that will generate PIC | ||
| * references to data which can be relocated. This is the case when | ||
| * +RTS -xp is passed. | ||
| * We detect data references using ELF_ST_TYPE(sym.st_info) from the symbol | ||
| * table. When a non-function symbol overflows 32-bit range, we emit a clear | ||
| * error message instead of silently creating a corrupt jump island. | ||
| * | ||
| * Workarounds for data reference overflow: | ||
| * - Use shared Haskell libraries (dynamically-linked GHCi) | ||
| * - Compile with -fPIC -fexternal-dynamic-refs (generates GOT-based | ||
| * relocations that can handle arbitrary distances) | ||
| * - Use +RTS -xp (maps loaded objects into low memory) | ||
| * | ||
| * This hack is enabled by X86_64_ELF_NONPIC_HACK. |
There was a problem hiding this comment.
| * fit in a 32-bit slot. When a reference to a symbol in a shared library | |
| * exceeds this range, we use "jump islands" (trampolines) for code references | |
| * via makeSymbolExtra(). | |
| * | |
| * We unfortunately can't tell whether symbol references are to code | |
| * or data. So for now we assume they are code (the vast majority | |
| * are), and allocate jump-table slots. Unfortunately this will | |
| * SILENTLY generate crashing code for data references. This hack is | |
| * enabled by X86_64_ELF_NONPIC_HACK. | |
| * For FUNCTION symbols (STT_FUNC), jump islands work correctly: the call | |
| * bounces through the trampoline to reach the real function. | |
| * | |
| * One workaround is to use shared Haskell libraries. This is the case | |
| * when dynamically-linked GHCi is used. | |
| * For DATA symbols (STT_OBJECT, STT_NOTYPE), jump islands are INCORRECT: | |
| * the jump island address would be embedded as a data pointer (e.g., an info | |
| * table pointer in a closure), causing the GC to interpret jump island memory | |
| * as a valid info table — leading to "strange closure type" crashes. | |
| * | |
| * Another workaround is to keep the static libraries but compile them | |
| * with -fPIC -fexternal-dynamic-refs, because that will generate PIC | |
| * references to data which can be relocated. This is the case when | |
| * +RTS -xp is passed. | |
| * We detect data references using ELF_ST_TYPE(sym.st_info) from the symbol | |
| * table. When a non-function symbol overflows 32-bit range, we emit a clear | |
| * error message instead of silently creating a corrupt jump island. | |
| * | |
| * Workarounds for data reference overflow: | |
| * - Use shared Haskell libraries (dynamically-linked GHCi) | |
| * - Compile with -fPIC -fexternal-dynamic-refs (generates GOT-based | |
| * relocations that can handle arbitrary distances) | |
| * - Use +RTS -xp (maps loaded objects into low memory) | |
| * | |
| * This hack is enabled by X86_64_ELF_NONPIC_HACK. | |
| * fit in a 32-bit slot. When a reference to a symbol in a shared library | |
| * exceeds this range, we can use a hack (enabled by X86_64_ELF_NONPIC_HACK) | |
| * which consists in generating "jump islands" (trampolines) for code references | |
| * (this happens in makeSymbolExtra()). | |
| * | |
| * For FUNCTION symbols (STT_FUNC), jump islands work correctly: the call | |
| * bounces through the trampoline to reach the real function. | |
| * | |
| * For DATA symbols (STT_OBJECT, STT_NOTYPE), jump islands are INCORRECT: | |
| * the jump island address would be embedded as a data pointer (e.g., an info | |
| * table pointer in a closure), causing the GC to interpret jump island memory | |
| * as a valid info table — leading to "strange closure type" crashes. | |
| * | |
| * We detect data references using ELF_ST_TYPE(sym.st_info) from the symbol | |
| * table. When a non-function symbol overflows 32-bit range, we emit a clear | |
| * error message instead of silently creating a corrupt jump island. | |
| * | |
| * Workarounds for data reference overflow: | |
| * - Use shared Haskell libraries (dynamically-linked GHCi) | |
| * - Compile with -fPIC -fexternal-dynamic-refs (generates GOT-based | |
| * relocations that can handle arbitrary distances) | |
| * - Use +RTS -xp (maps loaded objects into low memory) | |
| * |
| { | ||
| StgInt64 off = value - P; | ||
| if (off != (Elf64_Sword)off && X86_64_ELF_NONPIC_HACK) { | ||
| /* Check symbol type: jump islands only work for code (functions). |
There was a problem hiding this comment.
Nice! We should probably upstream this.
Is it an issue for tables-next-to-code where a symbol is both data and code (entry code)? Shouldn't we use two symbols instead of one to trigger this error when appropriate?
There was a problem hiding this comment.
I just remember debugging this and scratching my head wtf was going on. Hence this at least gives a bit clearer error message. Two symbols might be nice, although it would lead to even more symbol explosion. Windows also struggles with this data/code issue massively. Ultimately, I'm not sure TNTC is a feature worth keeping. (it also inflates object sizes quite a bit by putting a lot of redundant data into the objects).
| ghc-options: -this-unit-id rts -ghcversion-file=include/ghcversion.h -optc-DFS_NAMESPACE=rts | ||
| cmm-options: -this-unit-id rts | ||
|
|
||
| -- [The AutoApply story] |
There was a problem hiding this comment.
Sub-libraries are such a kludge. I'm not convinced they are worth all the added complexity
There was a problem hiding this comment.
I'm open to other solutions that play nice with cabal. I'm not super happy about the sublibs, but they seem to be the least worst of the pack.
| # Wrap path resolution to avoid passing None/invalid paths to Path APIs. | ||
| def current(_way): | ||
| p = path_func() | ||
| if p is None: | ||
| raise StatsException("No path returned for size collection") | ||
| # If p looks absolute, use it directly; else resolve relative to testdir | ||
| pth = Path(p) | ||
| if not pth.is_absolute(): | ||
| pth = in_testdir(p) | ||
| if not pth.exists(): | ||
| raise StatsException(f"Path not found for size collection: {pth}") | ||
| return os.path.getsize(pth) | ||
| return collect_generic_stat ( 'size', deviation, current ) |
There was a problem hiding this comment.
Do you remember why it failed? We need some motivation to upstream it. It looks reasonable but a concrete example would help.
| return collect_size_func(deviation, lambda: find_non_inplace_so(library)) | ||
| try: | ||
| return collect_size_func(deviation, lambda: find_non_inplace_so(library)) | ||
| except Exception as _: |
There was a problem hiding this comment.
Ah, I guess the code above is to handle this case. Maybe not good for upstreaming then.
There was a problem hiding this comment.
Not sure. Making it more robust might be a good idea (and reduce the stable-haskell ghc/upstream drift).
| -- Export RTS symbols to dynamically loaded libraries | ||
| -- See Note [ghc-iserv and dynamic symbol export] | ||
| if os(linux) || os(freebsd) | ||
| ghc-options: -rdynamic | ||
| if os(osx) || os(darwin) | ||
| ghc-options: -optl -Wl,-flat_namespace | ||
| -- Note: Windows has a hard limit of 65535 symbol exports (16-bit index). | ||
| -- We cannot use --export-all-symbols here as we exceed that limit. | ||
|
|
There was a problem hiding this comment.
The Darwin/OSX -flat_namespace option should be passed in GHC.Runtime.Interpreter.C too.
(we already pass -Wl,--export-dynamic on ELF targets there)
| BINDIST_EXECTUABLES := \ | ||
| ghc$(EXE_EXT) \ | ||
| ghc-iserv$(EXE_EXT) \ | ||
| ghc-iserv-dyn$(EXE_EXT) \ |
There was a problem hiding this comment.
Do we still need to distribute iserv programs now that GHC can build them on demand?
There was a problem hiding this comment.
For now yes. In the future I hope not.
| if os(osx) || os(darwin) | ||
| ghc-options: -optl -Wl,-flat_namespace | ||
| -- Note: Windows has a hard limit of 65535 symbol exports (16-bit index). | ||
| -- We cannot use --export-all-symbols here as we exceed that limit. |
There was a problem hiding this comment.
does this mean that on Windows the dynamic build doesn't work?
There was a problem hiding this comment.
I don't think it has ever? And yes, this is a stupid limitation. But the issue is more about how many symbols we actually try to have visible.
This commit fixes several critical issues with the RTS object linker that
prevented dynamic GHC builds from loading code correctly.
Key fixes:
1. Detect data vs code references in X86_64_ELF_NONPIC_HACK (Elf.c)
- The jump island mechanism was incorrectly applied to data references
- For info table pointers (_con_info symbols), embedding a jump island
address caused GC crashes ("strange closure type")
- Now distinguishes R_X86_64_PLT32 (code) from R_X86_64_PC32 (data)
- Data references use GOT-style indirection through extra->addr instead
2. Preserve dlerror for linker script fallback handling (LoadNativeObjPosix.c)
- dlerror() clears after first call, losing error context
- Now saves error string before retry logic
- Fixes misleading error messages when loading fails
3. Promote boot libraries to RTLD_GLOBAL for dynamic code loading (RtsStartup.c)
- Boot libraries loaded with RTLD_LOCAL weren't visible to dlsym
- Dynamic object loading failed to resolve symbols from boot libs
- Now re-opens boot libraries with RTLD_GLOBAL flag at startup
4. Skip loading libc/libm already linked into process (LoadNativeObjPosix.c)
- Avoids redundant loading and symbol conflicts
- Checks if library is already resident before calling dlopen
5. Dynamic lookup of stg_interp_constr entry points via dlsym (RtsSymbols.c)
- Interpreter constructor symbols need runtime resolution
- Adds dynamic fallback when static symbols unavailable
6. Remove residual debug instrumentation
- Cleans up debugging code from Evac.c and LoadNativeObjPosix.c
This commit adds infrastructure for RTS sublibrary loading in dynamic builds, enabling the split RTS architecture to work with shared library linking. Key changes: 1. RTS sublibrary infrastructure (rts/rts.cabal) - Define separate sublibraries for RTS components - Add proper library dependencies and visibility - Configure shared library generation for RTS parts 2. Configure support for dynamic builds (rts/configure.ac) - Detect platform-specific dynamic linking requirements - Set appropriate linker flags for each sublibrary - Handle symbol visibility for exported functions 3. API updates for sublibrary boundaries (rts/include/RtsAPI.h) - Adjust exported symbol declarations - Ensure proper visibility across sublibrary boundaries 4. AutoApply support for interpreter (rts/AutoApply*.cmm) - Add AutoApply.cmm and vector variants (V16, V32, V64) - Required for dynamic bytecode interpreter operation 5. Cabal project configuration - cabal.project.stage1: Add no-ghc-internal flag for stage1 builds - cabal.project.stage2: Configure full RTS with all sublibraries 6. Thread infrastructure (rts/Threads.h) - Updates for sublibrary thread handling
This commit addresses compiler and driver issues specific to dynamic GHC builds, ensuring proper code generation and linking behavior. Key changes: 1. Make Opt_ExternalDynamicRefs default on all PIC platforms (DynFlags.hs) - Previously only enabled for specific configurations - Dynamic builds require external dynamic references for proper GOT usage - Prevents relocation issues with large code models 2. Pipeline and session handling updates - Driver/Pipeline.hs: Handle dynamic linking in compilation pipeline - Driver/Session.hs: Session configuration for dynamic builds - Driver/Flags.hs: Flag handling for dynamic mode 3. Linker updates for dynamic mode - Linker/Executable.hs: Executable linking for dynamic builds - Linker/Static.hs: Static linking coordination - ByteCode/Linker.hs: Bytecode linker for dynamic interpreter 4. Unit state and GHCi support - Unit/State.hs: Package database handling for dynamic libs - GHCi/InfoTable.hsc: Info table generation for dynamic mode - Tc/Gen/Splice.hs: Template Haskell splice handling
GHC and ghc-iserv load Haskell shared libraries dynamically for Template Haskell and GHCi. These libraries reference RTS symbols (e.g., stg_INTLIKE_closure) that are linked into the executable. Without special linker flags, those symbols aren't visible to dlopen'd libraries. This commit adds platform-specific linker flags to export these symbols: - Linux/FreeBSD: -rdynamic (passes --export-dynamic to ld) - macOS: -flat_namespace (makes all symbols visible across namespaces) - Windows: Cannot use --export-all-symbols due to 65535 symbol limit See Note [ghc-iserv and dynamic symbol export] in ghc-iserv.cabal.in for detailed explanation of the approach and alternatives considered.
This commit adds build system support for creating dynamic GHC builds, including Makefile targets, bindist generation, and utility configurations. Key changes: 1. Makefile enhancements - Add DYNAMIC=1 build variable support - Create dylib symlinks for macOS dynamic builds - Use concrete file target for testsuite-timeout - Include ghc-iserv-dyn in tarballs for all targets - Proper bindist generation for dynamic builds 2. Utility cabal files (hp2ps.cabal, unlit.cabal) - Configure for dynamic linking support - Ensure utilities work with dynamic GHC 3. ghc-iserv infrastructure (iservmain.c) - Updates for dynamic interpreter server - Proper initialization for dynamic linking context 4. Test expectations for Stable Haskell - Update bug report URL in test expectations Usage: make DYNAMIC=1 _build/bindist # Build dynamic GHC bindist
This commit updates the testsuite to handle the split RTS architecture and dynamic GHC build configuration. Key changes: 1. testlib.py improvements - More robust test driver for dynamic builds - Better handling of shared library paths - Improved error detection and reporting 2. Test infrastructure (boilerplate.mk) - Configure tests for dynamic linking environment - Set proper library paths for test execution 3. Test adjustments for RTS split - T18072debug: Update grep to match cabal-based RTS naming - T23142.hs: Revert module name to fix -Di debug output test - keep-cafs-fail.stdout: Update expected output 4. Dynamic linking test updates - ghci/linking/dyn/all.T: Adjust for dynamic GHC - T2228: Restore expect_broken(7298) for dynamic builds - T11531.stderr: Update expected error messages 5. Platform-specific adjustments - T10458: Skip on musl with dynamic GHC - T11223 tests: Update stderr expectations for Windows 6. Test configuration - .gitignore: Add patterns for dynamic test artifacts - dynlibs/Makefile: Update for dynamic build testing - perf/size/all.T: Adjust size expectations
This commit extends the CI/CD pipeline to build and test dynamic GHC configurations alongside the existing static builds. Key changes: 1. ci.yml - Main CI workflow - Add DYNAMIC=1 to build matrix - Configure dynamic build jobs for Linux and macOS - Run ghci-ext tests on dynamic builds (require interpreter) - Parallel execution of static and dynamic builds 2. reusable-release.yml - Release workflow - Add dynamic GHC builds to release artifacts - Generate separate bindists for dynamic configuration - Include ghc-iserv-dyn in release tarballs - Re-enable release workflow on pull requests for testing The dynamic build matrix allows testing of: - Template Haskell with dynamic code loading - GHCi interactive features - Dynamic library loading and linking - Interpreter-based test suites (ghci-ext) Build configurations: - Static (default): DYNAMIC=0 or unset - Dynamic: DYNAMIC=1
5f99a8c to
0be09ea
Compare
Summary
This PR adds comprehensive dynamic build support to the stable-haskell GHC build, building on the foundation from PR #137.
Changes
CI/Build System:
RTS/Linker:
Testsuite:
Based on
Test plan