perf: move EG() and CG() in ZTS builds into __thread storage#22231
perf: move EG() and CG() in ZTS builds into __thread storage#22231henderkes wants to merge 3 commits into
Conversation
|
Meant to write comments explaining a few peculiar decisions... but got carried away in the evening. Will get to it and tag relevant reviewers in the morning. |
|
|
||
| static void ts_free_resources(tsrm_tls_entry *thread_resources) | ||
| { | ||
| bool own_thread = thread_resources->thread_id == tsrm_thread_id(); |
There was a problem hiding this comment.
Here it's possible that a thread id was recycled, but the tls points to a now obsolete one
| } | ||
|
|
||
| if (!resource_types_table[i].fast_offset) { | ||
| if (!resource_types_table[i].fast_offset && !resource_types_table[i].tls_addr) { |
There was a problem hiding this comment.
we can't manually free __thread storage
| /* allocates a resource id whose per-thread storage is a native __thread block */ | ||
| TSRM_API ts_rsrc_id ts_allocate_tls_id(ts_rsrc_id *rsrc_id, void *(*tls_addr)(void), size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor) | ||
| {/*{{{*/ | ||
| TSRM_ERROR((TSRM_ERROR_LEVEL_CORE, "Obtaining a new TLS resource id, %d bytes", size)); |
There was a problem hiding this comment.
function largely copied from above, looking at it now I see that size_t should be printed as %zu.
| # define TSRM_TLS_MODEL_ATTR | ||
| # define TSRM_TLS_MODEL_DEFAULT | ||
| #elif __PIC__ | ||
| #elif __PIC__ && !defined(__PIE__) |
There was a problem hiding this comment.
a PIE program can use local exec if it's the main executable. Only shared libraries (embed, extensions) need to fall back to initial-exed.
This alone would already be a small speedup (one fewer instruction per access)
| AS_VAR_APPEND([CFLAGS], [" -DZEND_EG_TLS"]) | ||
|
|
||
| dnl -mtls-size=12 drops the dead high-bits offset add from TLS access, | ||
| dnl valid while the thread-local block stays under 4 KiB. |
There was a problem hiding this comment.
This would produce linker errors if tls size exceeded 4kb, but I did a test atatic build with 100 extensions statically compiled in (what a terrible idea) and tls size ended up at 3.7kb. I can't think of a way to test whether a link would succeed before compiling, so this is unconditional.
|
@arnaud-lb I also had a different idea on achieving the same thing by pinning resolved EG and CG to a register on gcc. It worked well, but
Perhaps you have a different idea here? |
There was a problem hiding this comment.
This is a good idea, but you will need to check that it works in various configurations:
- Accessing TLS variables from other modules (shared extensions) used to be broken on some platforms (module wouldn't link or wouldn't load). I don't know if that's still the case, but that's one reason why extensions have their own
_tsrm_ls_cacheinstead of accessing the main one. - local-exec / initial-exec TLS when php is itself a shared library (apache module, embed) is allocated from the TLS surplus which is non portable and limited in size. The size of EG+CG might be larger than surplus. I don't remember why we specify the model explicitly but the loader patches the code to use more efficient models when applicable, so this may not be necessary.
Register pinning on the hybrid VM would work, but it may increase register pressure. I've tried adding new VM arguments recently for another reason, and it resulted in a performance regression. Global registers would work too, but only in VM code as the register can be clobbered when the variable is not in scope.
Some idea: EG()/CG() and a few others use ZEND_TSRMG_FAST(), which is _tsrm_ls_cache+some_offset. So EG(foo) is load(_tsrm_ls_cache)+load(some_offset)+offsetof(zend_executor_globals,foo). We could try to make the offset a const expr so that it folds to load(_tsrm_ls_cache)+const:
- Create a struct that contains all fast TLS entries (see php_reserve_tsrm_memory)
- Replace
php_reserve_tsrm_memory()bytsrm_reserve(sizeof(the_struct)). #define EG(v) ZEND_TSRMG_FAST(offsetof(the_struct, eg), zend_executor_globals *, v)
| ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_executor_globals executor_globals_tls; | ||
| ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_compiler_globals compiler_globals_tls; |
There was a problem hiding this comment.
Possibly these could be embeded in the main _tsrm_ls_cache (TSRMLS_MAIN_CACHE_DEFINE) :
ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR struct {
void *cache;
zend_executor_globals eg;
zend_compiler_globals cg;
} _tsrm_ls_cache;
This would simplify JIT changes as we can access eg/cg at an offset from jit->tls.
Also the compiler may generate better code when both EG and CG are used in the same function.
So, I was chasing what caused the much higher instruction count on aarch64 for clang vs gcc and while figuring that out, got side tracked to figuring out that ZTS builds also have a ton of extra instructions compared to NTS builds. Turned out it was mostly just from constantly accessing executor and compiler globals when running phpstan. Each access became pseudo assembly:
Here we're slashing one extra load and one add by moving EG and CG onto actual thread storage. Benchmarks are using phpstan on full laravel codebase (full symfony codebase showed the same, but I added the LE optimization after and don't want to rerun 6 hours of benchmarks) and phoronix-test-suite phpbench.
aarch64:
x86_64:
LLM disclosure
claude