fix(docker): set musl pthread stack via PT_GNU_STACK (real fix for #60)#61
Merged
Conversation
…x for #60) #60 diagnosed the SIGSEGV correctly — musl's 128 KB default pthread stack overflowed on the first BLAS3 call from MUMPS factorization — but patched the wrong place: the sed inserted pthread_attr_setstacksize into a block guarded by `#ifdef NEED_STACKATTR`, which blas_server.c itself `#undef`s unconditionally on Linux (line 103). The injected code compiled out. `strings volca | grep pthread_attr_setstacksize` on the merged image returns zero matches. User-confirmed: `OPENBLAS_NUM_THREADS=1` avoids the crash (no worker threads → no 128 KB stacks). Real fix: bake the desired default into the ELF's PT_GNU_STACK header via `-Wl,-z,stack-size=8388608`. musl reads p_memsz at process start and uses it as __default_stacksize, so every pthread created with `NULL` attr — OpenBLAS workers, GHC RTS capabilities, anything else — starts with 8 MB. No source patching of OpenBLAS needed.
PR #60 shipped broken because nothing in the build pipeline verified the injected pthread_attr_setstacksize call survived compilation; the silent regression was only caught by a production SIGSEGV. The fix in this branch (-Wl,-z,stack-size=8388608) replaces that with a different invariant — "PT_GNU_STACK->p_memsz == 0x800000 in the shipped ELF" — which is again only checked manually. Add a readelf-based assertion right after UPX so a future linker-flag refactor (or a UPX-side header rewrite) fails the image build with a diagnostic naming the actual value found, instead of recurring as a runtime crash on a customer VM. The check runs on /build/output/volca (post-strip, post-UPX) because that's the binary that actually runs. binutils — which provides readelf — is already in the build-stage apk add.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pthread_attr_setstacksize(&attr, 8 << 20)sits inside#ifdef NEED_STACKATTR, andblas_server.citself does#undef NEED_STACKATTRunconditionally on Linux (line 103). Verified post-merge:strings /usr/local/bin/volca | grep pthread_attr_setstacksizereturns 0 matches in thevolca-with-frontend:latestbuilt from fix(docker): unblock 8 GB VM target — OpenBLAS musl pthread + RTS memory discipline #60.exit 139at the same point —[MATRIX] Pre-computing factorization for database 'agribalyse-3-2'— on a 16 GB host running the freshly-pruned-and-rebuilt image.OPENBLAS_NUM_THREADS=1works around it (single-threaded path skipspthread_create), confirming the root cause is still musl's default thread stack.-Wl,-z,stack-size=8388608. musl readsPT_GNU_STACK->p_memszat process start and uses it as__default_stacksize. Every pthread the binary creates withNULLattr — OpenBLAS workers, GHC RTS capabilities, anything else — starts with 8 MB. No OpenBLAS source patching, no entrypoint env var, no thread-count downgrade.docker/Dockerfile.Why not patch OpenBLAS properly instead
Two reasons to prefer the linker flag:
NEED_STACKATTRcan't simply be#define'd: the secondpthread_create(&blas_threads[i], &attr, …)site ingoto_set_num_threadsreferences anattrthat isn't declared in that function's scope — defining the macro turns latent dead code into a compile error.+RTS -N) inherit musl's default too, so a Haskell-side BLAS call from a non-OpenBLAS worker would still have been at risk.Test plan
./volca/docker-build.sh --with-frontendfromvolca-deploy/produces avolca-with-frontendimage.readelf -l dist/volca | grep -A1 GNU_STACK(or on the image's/usr/local/bin/volca) showsMemSiz=0x800000(8 MiB).docker run --rm -p 8080:8080 -v /home/ecobalyse/volca_data:/data volca-with-frontend --config /data/volca.toml server --password …no longer crashes at MUMPS factorization foragribalyse-3-2withoutOPENBLAS_NUM_THREADS=1.