Conversation
commit 7ad4e68338a55eb2dba48e7de5ae48b582732681
Author: Chandrasekaran <amrish.chandrasekaran@intel.com>
Date: Thu Jan 3 12:32:13 2019 -0600
Modified mlp_tbb.cc to try different configs. Added python script for analysing load imbalance
For promotion to github repo See merge request DeveloperProducts/Runtimes/Threading/customer-samples/mlp!1
7df27f8 to
24392b1
Compare
|
|
||
| CFLAGS = -DMKL_ILP64 -m64 -I${MKLROOT}/include -I${NUMAROOT}/include -I${TBBROOT}/include -mavx2 -mfma -mf16c -fopenmp -mavx512f -Wall #-march=skylake | ||
| SP ?=1 | ||
| UBN ?=0 |
There was a problem hiding this comment.
So USE_BROADCAST_NODE doesn't give speedup right?
There was a problem hiding this comment.
If there was an improvement, it was very small. It basically removes the overhead of spawning a task for the very first node only.
| UBN ?=0 | ||
| TI ?=0 | ||
| FG ?=1 | ||
| NB ?=0 |
There was a problem hiding this comment.
Does NUMA_BIND hurt performance?
| #seq = y | ||
|
|
||
| CFLAGS = -DMKL_ILP64 -m64 -I${MKLROOT}/include -I${NUMAROOT}/include -I${TBBROOT}/include -mavx2 -mfma -mf16c -fopenmp -mavx512f -Wall #-march=skylake | ||
| SP ?=1 |
There was a problem hiding this comment.
SP is for "split" and it is the factor that is used to create more chunks for the parallel_for. So you get SP * nthreads_per_socket. We left the default at 1, although our tests did show improvements with SP=2.
|
|
||
| CXX ?= g++ | ||
| #CXX ?= g++ | ||
| CXX = icpc |
There was a problem hiding this comment.
Can we please test with gcc to be on the same page? We're using gcc 5.5.0
| #CC = /usr/local/opt/gcc/bin/g++-7 -std=c++11 | ||
|
|
||
| LDFLAGS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lpthread -lm -ldl ${NUMAROOT}/lib/libnuma.a ${TBBROOT}/lib/libtbb.a | ||
| LDFLAGS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -L${NUMAROOT}/lib -lnuma -L${TBBROOT}/lib -ltbb -L/usr/lib -liomp5 |
There was a problem hiding this comment.
Can we please not use Intel OpenMP to be on the same page? Is using mkl_sequential needed? I basically want to know what change is needed for better performance and which is not. It would be great if we can know the minimal changes to get good perf.
| public: | ||
| pinning_observer(tbb::task_arena& arena, int numa_node_id) | ||
| : tbb::task_scheduler_observer(arena), numa_node_id_(numa_node_id) { | ||
| : tbb::task_scheduler_observer(arena), arena_(arena), numa_node_id_(numa_node_id) { |
| SP*nthreads_per_socket, | ||
| [&](size_t task_id) { | ||
| double sgst = dsecnd(); | ||
| int tid = numa_node_id_ * nthreads_per_socket + task_id; |
There was a problem hiding this comment.
If a thread grabs more than 1 task, we want to account the execution time of the multiple tasks to the thread. This is a reason why I kept track execution time based on current_thread_index . Please let me know your thought.
No description provided.