Skip to content

Intel's modifications#1

Open
anton-malakhov wants to merge 13 commits intojspark1105:masterfrom
anton-malakhov:master
Open

Intel's modifications#1
anton-malakhov wants to merge 13 commits intojspark1105:masterfrom
anton-malakhov:master

Conversation

@anton-malakhov
Copy link
Copy Markdown

No description provided.

anton-malakhov and others added 12 commits January 24, 2019 19:01
commit 7ad4e68338a55eb2dba48e7de5ae48b582732681
Author: Chandrasekaran <amrish.chandrasekaran@intel.com>
Date:   Thu Jan 3 12:32:13 2019 -0600

    Modified mlp_tbb.cc to try different configs. Added python script for analysing load imbalance
For promotion to github repo

See merge request DeveloperProducts/Runtimes/Threading/customer-samples/mlp!1
Comment thread Makefile

CFLAGS = -DMKL_ILP64 -m64 -I${MKLROOT}/include -I${NUMAROOT}/include -I${TBBROOT}/include -mavx2 -mfma -mf16c -fopenmp -mavx512f -Wall #-march=skylake
SP ?=1
UBN ?=0
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So USE_BROADCAST_NODE doesn't give speedup right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was an improvement, it was very small. It basically removes the overhead of spawning a task for the very first node only.

Comment thread Makefile
UBN ?=0
TI ?=0
FG ?=1
NB ?=0
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NUMA_BIND hurt performance?

Comment thread Makefile
#seq = y

CFLAGS = -DMKL_ILP64 -m64 -I${MKLROOT}/include -I${NUMAROOT}/include -I${TBBROOT}/include -mavx2 -mfma -mf16c -fopenmp -mavx512f -Wall #-march=skylake
SP ?=1
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does SP stand for?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SP is for "split" and it is the factor that is used to create more chunks for the parallel_for. So you get SP * nthreads_per_socket. We left the default at 1, although our tests did show improvements with SP=2.

Comment thread Makefile

CXX ?= g++
#CXX ?= g++
CXX = icpc
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please test with gcc to be on the same page? We're using gcc 5.5.0

Comment thread Makefile
#CC = /usr/local/opt/gcc/bin/g++-7 -std=c++11

LDFLAGS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lpthread -lm -ldl ${NUMAROOT}/lib/libnuma.a ${TBBROOT}/lib/libtbb.a
LDFLAGS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -L${NUMAROOT}/lib -lnuma -L${TBBROOT}/lib -ltbb -L/usr/lib -liomp5
Copy link
Copy Markdown
Owner

@jspark1105 jspark1105 Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please not use Intel OpenMP to be on the same page? Is using mkl_sequential needed? I basically want to know what change is needed for better performance and which is not. It would be great if we can know the minimal changes to get good perf.

Comment thread mlp_tbb.cc
public:
pinning_observer(tbb::task_arena& arena, int numa_node_id)
: tbb::task_scheduler_observer(arena), numa_node_id_(numa_node_id) {
: tbb::task_scheduler_observer(arena), arena_(arena), numa_node_id_(numa_node_id) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we using arena_ ?

Comment thread mlp_tbb.cc
SP*nthreads_per_socket,
[&](size_t task_id) {
double sgst = dsecnd();
int tid = numa_node_id_ * nthreads_per_socket + task_id;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a thread grabs more than 1 task, we want to account the execution time of the multiple tasks to the thread. This is a reason why I kept track execution time based on current_thread_index . Please let me know your thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants