Review performance u-benchmarking results from multi-threaded client-server application. #79

gapisback · 2024-06-13T00:52:40Z

gapisback
Jun 13, 2024
Collaborator

Updated 15.Jun.2024:

I now have a new commit under PR #85 that can be used to run multiple iterations of client-server performance tests.

I had to massage perf-report.py to find the median value of metrics and generate the comparison report.

Here are the reports:

Perf run parameters: 32-clients, 1 Million msgs/client, run 5 iterations with different server-thread configurations

Observations:

The drop with L3-logging is still ~3-4%. In one case it is more than 7%
All we can see is that with increased concurrency overheads of write() logging and spdlog logging does go up.

Performance comparison for For 32 clients, 32000000 (32 Million) msgs, num_threads=1

    **** (Using median value of metric across 5 iterations) ****

+-------------------------------+-------------------+----------+-------------------+----------+----------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread  |
+-------------------------------+-------------------+----------+-------------------+----------+----------------+
| Baseline - No logging         |  297.28 K ops/sec |  0.00 %  |   9.81 K ops/sec  |  0.00 %  | 32 Million ops |
| L3-logging (no LOC)           |  296.3 K ops/sec  | -0.33 %  |   9.75 K ops/sec  | -0.61 %  | 32 Million ops |
| L3-fast logging (no LOC)      |  297.6 K ops/sec  |  0.11 %  |   9.85 K ops/sec  |  0.49 %  | 32 Million ops |
| L3-fprintf() logging (no LOC) |  284.87 K ops/sec | -4.18 %  |   9.33 K ops/sec  | -4.82 %  | 32 Million ops |
| L3-write() logging (no LOC)   |  184.61 K ops/sec | -37.90 % |   5.98 K ops/sec  | -39.05 % | 32 Million ops |
| L3-logging default LOC        |  296.93 K ops/sec | -0.12 %  |   9.81 K ops/sec  | -0.01 %  | 32 Million ops |
| L3-logging LOC-ELF            |  298.03 K ops/sec |  0.25 %  |   9.87 K ops/sec  |  0.65 %  | 32 Million ops |
| spdlog-logging                |  271.57 K ops/sec | -8.65 %  |   8.9 K ops/sec   | -9.25 %  | 32 Million ops |
| spdlog-backtrace-logging      |  284.34 K ops/sec | -4.35 %  |   9.33 K ops/sec  | -4.84 %  | 32 Million ops |
+-------------------------------+-------------------+----------+-------------------+----------+----------------+

Performance comparison for For 32 clients, 32000000 (32 Million) msgs, num_threads=2
**** (Using median value of metric across 5 iterations) ****

+-------------------------------+-------------------+----------+-------------------+----------+----------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread  |
+-------------------------------+-------------------+----------+-------------------+----------+----------------+
| Baseline - No logging         |  475.03 K ops/sec |  0.00 %  |  16.06 K ops/sec  |  0.00 %  | 16 Million ops |
| L3-logging (no LOC)           |  451.57 K ops/sec | -4.94 %  |  15.21 K ops/sec  | -5.28 %  | 16 Million ops |
| L3-fast logging (no LOC)      |  445.58 K ops/sec | -6.20 %  |  14.97 K ops/sec  | -6.80 %  | 16 Million ops |
| L3-fprintf() logging (no LOC) |  446.73 K ops/sec | -5.96 %  |  15.17 K ops/sec  | -5.55 %  | 16 Million ops |
| L3-write() logging (no LOC)   |  303.43 K ops/sec | -36.12 % |  10.03 K ops/sec  | -37.51 % | 16 Million ops |
| L3-logging default LOC        |  440.15 K ops/sec | -7.34 %  |  14.77 K ops/sec  | -8.01 %  | 16 Million ops |
| L3-logging LOC-ELF            |  440.99 K ops/sec | -7.17 %  |   14.8 K ops/sec  | -7.83 %  | 16 Million ops |
| spdlog-logging                |  407.35 K ops/sec | -14.25 % |  13.66 K ops/sec  | -14.92 % | 16 Million ops |
| spdlog-backtrace-logging      |  460.22 K ops/sec | -3.12 %  |  15.72 K ops/sec  | -2.15 %  | 16 Million ops |
+-------------------------------+-------------------+----------+-------------------+----------+----------------+

Performance comparison for For 32 clients, 32000000 (32 Million) msgs, num_threads=4
**** (Using median value of metric across 5 iterations) ****

-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         |  692.96 K ops/sec |  0.00 %  |  24.49 K ops/sec  |  0.00 %  | 8 Million ops |
| L3-logging (no LOC)           |  670.79 K ops/sec | -3.20 %  |  23.62 K ops/sec  | -3.56 %  | 8 Million ops |
| L3-fast logging (no LOC)      |  673.05 K ops/sec | -2.87 %  |  23.62 K ops/sec  | -3.56 %  | 8 Million ops |
| L3-fprintf() logging (no LOC) |  582.33 K ops/sec | -15.96 % |  19.95 K ops/sec  | -18.52 % | 8 Million ops |
| L3-write() logging (no LOC)   |  315.32 K ops/sec | -54.50 % |  10.35 K ops/sec  | -57.74 % | 8 Million ops |
| L3-logging default LOC        |  666.63 K ops/sec | -3.80 %  |  23.44 K ops/sec  | -4.29 %  | 8 Million ops |
| L3-logging LOC-ELF            |  668.58 K ops/sec | -3.52 %  |  23.45 K ops/sec  | -4.25 %  | 8 Million ops |
| spdlog-logging                |  491.63 K ops/sec | -29.05 % |  16.58 K ops/sec  | -32.29 % | 8 Million ops |
| spdlog-backtrace-logging      |  626.53 K ops/sec | -9.59 %  |  21.84 K ops/sec  | -10.81 % | 8 Million ops |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

Performance comparison for For 32 clients, 32000000 (32 Million) msgs, num_threads=8
**** (Using median value of metric across 5 iterations) ****

+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         |  720.06 K ops/sec |  0.00 %  |  25.48 K ops/sec  |  0.00 %  | 4 Million ops |
| L3-logging (no LOC)           |  733.33 K ops/sec |  1.84 %  |  26.06 K ops/sec  |  2.31 %  | 4 Million ops |
| L3-fast logging (no LOC)      |  734.04 K ops/sec |  1.94 %  |  25.99 K ops/sec  |  2.03 %  | 4 Million ops |
| L3-fprintf() logging (no LOC) |  695.69 K ops/sec | -3.38 %  |   24.4 K ops/sec  | -4.21 %  | 4 Million ops |
| L3-write() logging (no LOC)   |  290.9 K ops/sec  | -59.60 % |   9.51 K ops/sec  | -62.68 % | 4 Million ops |
| L3-logging default LOC        |  728.18 K ops/sec |  1.13 %  |   25.9 K ops/sec  |  1.67 %  | 4 Million ops |
| L3-logging LOC-ELF            |  729.45 K ops/sec |  1.30 %  |  25.79 K ops/sec  |  1.24 %  | 4 Million ops |
| spdlog-logging                |  569.47 K ops/sec | -20.91 % |  19.41 K ops/sec  | -23.81 % | 4 Million ops |
| spdlog-backtrace-logging      |  751.81 K ops/sec |  4.41 %  |  26.91 K ops/sec  |  5.61 %  | 4 Million ops |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

I now have the test.sh and report generation productized under the 2 commits under PR #75 .

I ran a perf-test on AWS bare-metal instance with 72 cores.

Test configuration: 32-clients, 1Mil messages / client: Server: 1, 2, 4, 8 thread-count

Here are the post-processed results.

I was expecting to see increasing degradation due to concurrent threads logging but the results don't reflect this behaviour.

Below each chunk of output, I have annotated my observations. Comments welcome:

Performance comparison for NumClients=32, NumOps=32000000 (32 Million), NumThreads=1

+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~295.85 K ops/sec |  0.00 %  |  ~9.79 K ops/sec  |  0.00 %  |   32 Million  |
| L3-logging (no LOC)           | ~355.17 K ops/sec | 20.05 %  |  ~11.83 K ops/sec | 20.85 %  |   32 Million  |
| L3-fast logging (no LOC)      | ~296.56 K ops/sec |  0.24 %  |  ~9.73 K ops/sec  | -0.58 %  |   32 Million  |
| L3-fprintf() logging (no LOC) | ~270.67 K ops/sec | -8.51 %  |  ~8.87 K ops/sec  | -9.44 %  |   32 Million  |
| L3-write() logging (no LOC)   | ~188.06 K ops/sec | -36.43 % |  ~6.09 K ops/sec  | -37.77 % |   32 Million  |
| L3-logging default LOC        | ~296.63 K ops/sec |  0.26 %  |  ~9.76 K ops/sec  | -0.31 %  |   32 Million  |
| L3-logging LOC-ELF            | ~297.14 K ops/sec |  0.44 %  |  ~9.77 K ops/sec  | -0.19 %  |   32 Million  |
| spdlog-logging                | ~284.75 K ops/sec | -3.75 %  |  ~9.36 K ops/sec  | -4.36 %  |   32 Million  |
| spdlog-backtrace-logging      | ~283.64 K ops/sec | -4.13 %  |  ~9.31 K ops/sec  | -4.90 %  |   32 Million  |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

L3-logging is 20% faster than baseline: This is totally unexpected and unheard of. I attribute this to some m/c instability [?]
The remaining numbers of %-drop are "in line" what we've been seeing elsewhere.

Performance comparison for NumClients=32, NumOps=32000000 (32 Million), NumThreads=2

+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~457.74 K ops/sec |  0.00 %  |  ~15.41 K ops/sec |  0.00 %  |   16 Million  |
| L3-logging (no LOC)           | ~442.93 K ops/sec | -3.24 %  |  ~14.87 K ops/sec | -3.46 %  |   16 Million  |
| L3-fast logging (no LOC)      | ~437.40 K ops/sec | -4.44 %  |  ~14.68 K ops/sec | -4.75 %  |   16 Million  |
| L3-fprintf() logging (no LOC) | ~451.28 K ops/sec | -1.41 %  |  ~15.31 K ops/sec | -0.62 %  |   16 Million  |
| L3-write() logging (no LOC)   | ~238.83 K ops/sec | -47.82 % |  ~7.75 K ops/sec  | -49.70 % |   16 Million  |
| L3-logging default LOC        | ~439.79 K ops/sec | -3.92 %  |  ~14.76 K ops/sec | -4.20 %  |   16 Million  |
| L3-logging LOC-ELF            | ~439.76 K ops/sec | -3.93 %  |  ~14.76 K ops/sec | -4.20 %  |   16 Million  |
| spdlog-logging                | ~406.27 K ops/sec | -11.24 % |  ~13.63 K ops/sec | -11.56 % |   16 Million  |
| spdlog-backtrace-logging      | ~457.64 K ops/sec | -0.02 %  |  ~15.63 K ops/sec |  1.43 %  |   16 Million  |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

For 2-threads, the overall %-age drops are seemingly reasonable / as expected.
Between 1 and 2 server-threads, one would expect to see greater degradation for fprintf() but it has gone down.
write() perf has worsened - I guess this could be due to concurrent threads logging [?]

Performance comparison for NumClients=32, NumOps=32000000 (32 Million), NumThreads=4

+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~695.46 K ops/sec |  0.00 %  |  ~24.44 K ops/sec |  0.00 %  |   8 Million   |
| L3-logging (no LOC)           | ~673.46 K ops/sec | -3.16 %  |  ~23.68 K ops/sec | -3.11 %  |   8 Million   |
| L3-fast logging (no LOC)      | ~658.55 K ops/sec | -5.31 %  |  ~22.99 K ops/sec | -5.94 %  |   8 Million   |
| L3-fprintf() logging (no LOC) | ~580.25 K ops/sec | -16.57 % |  ~19.95 K ops/sec | -18.35 % |   8 Million   |
| L3-write() logging (no LOC)   | ~314.35 K ops/sec | -54.80 % |  ~10.34 K ops/sec | -57.69 % |   8 Million   |
| L3-logging default LOC        | ~681.19 K ops/sec | -2.05 %  |  ~23.89 K ops/sec | -2.24 %  |   8 Million   |
| L3-logging LOC-ELF            | ~664.07 K ops/sec | -4.51 %  |  ~23.27 K ops/sec | -4.79 %  |   8 Million   |
| spdlog-logging                | ~504.25 K ops/sec | -27.49 % |  ~17.06 K ops/sec | -30.18 % |   8 Million   |
| spdlog-backtrace-logging      | ~626.13 K ops/sec | -9.97 %  |  ~21.84 K ops/sec | -10.62 % |   8 Million   |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

With 4 and, below, 8 threads, write() and spdlog logging performance seems to go down quite a bit. This is consistent with the hypothesis that concurrent logging degrades these schemes even more.
But fprintf() degradation is fluctuating -- this may be due to file-system / glibc caching?

Performance comparison for NumClients=32, NumOps=32000000 (32 Million), NumThreads=8

+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~744.79 K ops/sec |  0.00 %  |  ~26.40 K ops/sec |  0.00 %  |   4 Million   |
| L3-logging (no LOC)           | ~734.90 K ops/sec | -1.33 %  |  ~26.08 K ops/sec | -1.22 %  |   4 Million   |
| L3-fast logging (no LOC)      | ~719.97 K ops/sec | -3.33 %  |  ~25.50 K ops/sec | -3.42 %  |   4 Million   |
| L3-fprintf() logging (no LOC) | ~681.84 K ops/sec | -8.45 %  |  ~23.86 K ops/sec | -9.61 %  |   4 Million   |
| L3-write() logging (no LOC)   | ~293.85 K ops/sec | -60.55 % |  ~9.63 K ops/sec  | -63.51 % |   4 Million   |
| L3-logging default LOC        | ~720.68 K ops/sec | -3.24 %  |  ~25.44 K ops/sec | -3.66 %  |   4 Million   |
| L3-logging LOC-ELF            | ~726.58 K ops/sec | -2.44 %  |  ~25.68 K ops/sec | -2.75 %  |   4 Million   |
| spdlog-logging                | ~581.29 K ops/sec | -21.95 % |  ~19.93 K ops/sec | -24.50 % |   4 Million   |
| spdlog-backtrace-logging      | ~758.00 K ops/sec |  1.77 %  |  ~27.20 K ops/sec |  3.01 %  |   4 Million   |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

Summary Conclusions:

Overall, with increased concurrency other logging schemes show greater degradation.
We are not seeing the ~1% performance degradation with L3-logging. It's always 3+%. Maybe a measurement anomaly [?]
spdlog-backtrace logging is competitive with L3 and l3-fast logging, in terms of degradation, but you cannot unpack the log-info post-the run. (Backtrace only dumps while the server is running.)
L3-fast logging still has some non-trivial degradation. You need to investigate and address open issue L3-Fast logging interface improvements. #76.

gregthelaw · 2024-06-13T14:02:32Z

gregthelaw
Jun 13, 2024
Collaborator

As you say, clearly there is some instability here - the 20% -ve slowdown for L3 logging is particularly weird. I suspect it has to do with time keeping, e.g. NTP made the clock run slower. I think we'll need to do multiple runs (8?) and take the median. I realise that's painful, but I can't think of an alternative.

…

-- https://undo.io | Record. Replay. Resolve. Technical Paper: Fix Bugs Faster with Time Travel Debugging <https://info.undo.io/time-travel-debugging-whitepaper>

1 reply

gapisback Jun 14, 2024
Collaborator Author

Email exchange with @gregthelaw on this comment:

It's not a 20% drop but a 20% gain. L3-logging did -better- than baseline. That's even weird!

Sorry that's what I meant by "negative slowdown" -- negative slowdown = speedup. It's clearly a measurement error.

My guess is to do with clocks. e.g. NTP detected that the system clock was ahead, so it slowed down the clock for a bit and so the test appeared to get more throughput. If that's right, then doing a bunch and taking the median should solve it.

Greg

gregthelaw · 2024-06-16T18:01:50Z

gregthelaw
Jun 16, 2024
Collaborator

All very odd, isn't it?

The numthreads=1 and numthreads=4 are reasonable. numthreads=2 is disappointing because, as you say, spdlog appears to be competitive. But on numthreads=8 we once again see spdlog and L3 all have negative slowdown - i.e. a speedup compared to baseline. This implies that we are still not able to measure accurately.

Did you say the machine you're using has 72 physical cores? That is presumably a NUMA architecture, and weird shit can indeed happen there. Can you try with a smaller machine, specifically one that is not NUMA?

And/or we may need to experiment with pinning threads to CPU cores - e.g. sched_setaffinity() Note that Linux CPU numbering is weird though -- you'd think CPU's 1 and 2 would be 'neighbours' but they're usually not.

1 reply

gapisback Jun 16, 2024
Collaborator Author

@gregthelaw I am using a c5n.metal h/w config. As per this AWS article, for the machine I am using:

C5n instances feature the Intel Xeon Platinum 8000 series (Skylake-SP) processor with a sustained all core Turbo CPU clock speed of up to 3.5 GHz. C5n instances provide up to 100 Gbps of network bandwidth and up to 19 Gbps of dedicated bandwidth to Amazon EBS. C5n instances also feature 33% higher memory footprint compared to C5 instances. C5n.18xlarge instances support [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) to run tightly-coupled [HPC](https://aws.amazon.com/hpc/) applications at scale on AWS.

As per Intel's this processor arch doc, the Skylake Mesh Architecture does seem like it's a NUMA machine.

As the number of cores on the CPU increased with each generation, the access latency increased and available bandwidth per core diminished. This trend was mitigated by dividing the chip into two halves and introducing a second ring to reduce distances and to add additional bandwidth.

This AWS EC2 doc gives details of cores and vCPUs. Scroll down to a section titled: "Compute optimized instances"

I think the processor has 36 physical cores x2 hyper-threading for a total of 72 vCPUs.

About your suggestions of trying diff machines: I am running out of time right now. Finding out a non-NUMA machine with sufficient large # of core counts so we can run with 16 or 32-clients, plus 4-threads ... means we need a non-NUMA machine with 20+ or 36+ cores. Chasing that down will take time.

I need to start developing that preso. Let me make some progress on that first ... and if I have cycles, will try to hunt a machine down.

gregthelaw · 2024-06-17T14:09:05Z

gregthelaw
Jun 17, 2024
Collaborator

I experimented with my laptop. I carefully shut down all other programs, plugged in, and deliberately throttled CPUs to try to avoid weird thermal/throttling issues.There is some bizarre effect with spdlog and two threads:

gel@onyx:~/src/l3$ echo 2000000 | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
gel@onyx:~/src/l3$ L3_PERF_SERVER_NUM_THREADS="1 2 4 8" ./test.sh run-all-client-server-perf-tests --clock-default $((1000*1000)) | tee ~/tmp/perf.out
[..]
    **** Performance comparison for NumClients=5, NumOps=5000000 (5 Million), NumThreads=1 ****
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~350.29 K ops/sec |  0.00 %  | ~107.93 K ops/sec |  0.00 %  |   5 Million   |
| L3-logging (no LOC)           | ~345.48 K ops/sec | -1.37 %  | ~105.63 K ops/sec | -2.13 %  |   5 Million   |
| L3-fast logging (no LOC)      | ~347.35 K ops/sec | -0.84 %  | ~106.55 K ops/sec | -1.28 %  |   5 Million   |
| L3-fprintf() logging (no LOC) | ~321.25 K ops/sec | -8.29 %  |  ~94.74 K ops/sec | -12.22 % |   5 Million   |
| L3-write() logging (no LOC)   | ~203.10 K ops/sec | -42.02 % |  ~50.99 K ops/sec | -52.75 % |   5 Million   |
| L3-logging default LOC        | ~345.33 K ops/sec | -1.42 %  | ~105.57 K ops/sec | -2.19 %  |   5 Million   |
| L3-logging LOC-ELF            | ~347.20 K ops/sec | -0.88 %  | ~106.47 K ops/sec | -1.36 %  |   5 Million   |
| spdlog-logging                | ~293.50 K ops/sec | -16.21 % |  ~83.12 K ops/sec | -22.99 % |   5 Million   |
| spdlog-backtrace-logging      | ~308.78 K ops/sec | -11.85 % |  ~89.36 K ops/sec | -17.21 % |   5 Million   |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+


    **** Performance comparison for NumClients=5, NumOps=5000000 (5 Million), NumThreads=2 ****
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~352.98 K ops/sec |  0.00 %  | ~109.70 K ops/sec |  0.00 %  | ~2.50 Million |
| L3-logging (no LOC)           | ~353.00 K ops/sec |  0.01 %  | ~109.55 K ops/sec | -0.14 %  | ~2.50 Million |
| L3-fast logging (no LOC)      | ~352.53 K ops/sec | -0.13 %  | ~109.45 K ops/sec | -0.23 %  | ~2.50 Million |
| L3-fprintf() logging (no LOC) | ~345.15 K ops/sec | -2.22 %  | ~105.09 K ops/sec | -4.20 %  | ~2.50 Million |
| L3-write() logging (no LOC)   | ~294.28 K ops/sec | -16.63 % |  ~83.16 K ops/sec | -24.20 % | ~2.50 Million |
| L3-logging default LOC        | ~350.80 K ops/sec | -0.62 %  | ~108.39 K ops/sec | -1.20 %  | ~2.50 Million |
| L3-logging LOC-ELF            | ~352.78 K ops/sec | -0.06 %  | ~109.52 K ops/sec | -0.16 %  | ~2.50 Million |
| spdlog-logging                | ~338.16 K ops/sec | -4.20 %  | ~101.67 K ops/sec | -7.32 %  | ~2.50 Million |
| spdlog-backtrace-logging      | ~354.37 K ops/sec |  0.39 %  | ~110.45 K ops/sec |  0.68 %  | ~2.50 Million |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+


    **** Performance comparison for NumClients=5, NumOps=5000000 (5 Million), NumThreads=4 ****
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~341.02 K ops/sec |  0.00 %  | ~104.15 K ops/sec |  0.00 %  | ~1.25 Million |
| L3-logging (no LOC)           | ~336.83 K ops/sec | -1.23 %  | ~101.94 K ops/sec | -2.12 %  | ~1.25 Million |
| L3-fast logging (no LOC)      | ~339.09 K ops/sec | -0.57 %  | ~103.02 K ops/sec | -1.09 %  | ~1.25 Million |
| L3-fprintf() logging (no LOC) | ~317.10 K ops/sec | -7.01 %  |  ~92.61 K ops/sec | -11.08 % | ~1.25 Million |
| L3-write() logging (no LOC)   | ~249.61 K ops/sec | -26.81 % |  ~66.53 K ops/sec | -36.12 % | ~1.25 Million |
| L3-logging default LOC        | ~340.85 K ops/sec | -0.05 %  | ~103.72 K ops/sec | -0.41 %  | ~1.25 Million |
| L3-logging LOC-ELF            | ~338.17 K ops/sec | -0.84 %  | ~102.85 K ops/sec | -1.25 %  | ~1.25 Million |
| spdlog-logging                | ~299.97 K ops/sec | -12.04 % |  ~85.23 K ops/sec | -18.16 % | ~1.25 Million |
| spdlog-backtrace-logging      | ~324.02 K ops/sec | -4.98 %  |  ~96.12 K ops/sec | -7.71 %  | ~1.25 Million |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+


    **** Performance comparison for NumClients=5, NumOps=5000000 (5 Million), NumThreads=8 ****
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Run-Type                      | Server throughput | Srv:Drop | Client throughput | Cli:Drop | NumOps/thread |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+
| Baseline - No logging         | ~313.31 K ops/sec |  0.00 %  |  ~91.44 K ops/sec |  0.00 %  |     625 K     |
| L3-logging (no LOC)           | ~312.59 K ops/sec | -0.23 %  |  ~91.10 K ops/sec | -0.37 %  |     625 K     |
| L3-fast logging (no LOC)      | ~314.22 K ops/sec |  0.29 %  |  ~92.02 K ops/sec |  0.64 %  |     625 K     |
| L3-fprintf() logging (no LOC) | ~288.70 K ops/sec | -7.85 %  |  ~81.05 K ops/sec | -11.36 % |     625 K     |
| L3-write() logging (no LOC)   | ~144.60 K ops/sec | -53.85 % |  ~33.77 K ops/sec | -63.06 % |     625 K     |
| L3-logging default LOC        | ~313.02 K ops/sec | -0.09 %  |  ~91.54 K ops/sec |  0.11 %  |     625 K     |
| L3-logging LOC-ELF            | ~315.57 K ops/sec |  0.72 %  |  ~92.34 K ops/sec |  0.99 %  |     625 K     |
| spdlog-logging                | ~230.18 K ops/sec | -26.53 % |  ~59.68 K ops/sec | -34.73 % |     625 K     |
| spdlog-backtrace-logging      | ~299.41 K ops/sec | -4.44 %  |  ~85.65 K ops/sec | -6.33 %  |     625 K     |
+-------------------------------+-------------------+----------+-------------------+----------+---------------+

everything else looks vaguely sane. There are a few -ve slowdowns, but all under 1% so I think we conclude "no measurable difference if the change against baseline is < 1% either way."

But spdlog multi-threaded is just weird: 4 and 8 threads are still less slowdown than 1, which makes no sense: there is a mutex, it will be always uncontended in the single threaded case.

1 reply

gapisback Jun 17, 2024
Collaborator Author

@gregthelaw - For tomorrow, I am not particularly concerned about these variations.

I need some info from you: I experimented with my laptop.

Can you give me the specs of your Linux workstation. Machine, core description, # of threads etc ...?

gregthelaw · 2024-06-17T14:09:25Z

gregthelaw
Jun 17, 2024
Collaborator

Oh wait, hang on, it looks like maybe you need to configure spdlog differently for multithreaded. Do we do that?

4 replies

gregthelaw Jun 17, 2024
Collaborator

Hmm, yeah, looks like we do:

    Spd_logger = spdlog::basic_logger_mt("file_logger", logfile, true);

gregthelaw Jun 17, 2024
Collaborator

Oh but hang on, the above line is only compiled if #if defined(L3_LOGT_SPDLOG) and not #elif defined(L3_LOGT_SPDLOG_BACKTRACE) So I'm not sure, it looks like maybe for the backtrace version we don't configure a multi-threaded logger.

gapisback Jun 17, 2024
Collaborator Author

@gregthelaw - I don't have time right now to dig into this spdlog interface angle.

I built the spdlog code using examples from these links:

https://github.com/gabime/spdlog

https://github.com/gabime/spdlog#backtrace-support

I built the L3 integration using sample code from here:

https://github.com/gabime/spdlog/blob/v1.x/example/example.cpp

Can you dig into those to see if I've made a mistake in my use of their interfaces?

Thanks!

gapisback Jun 17, 2024
Collaborator Author

For spdlog-backtrace, there is no mention of multi-threading support in their examples.

The usage seems to be just like: spdlog::enable_backtrace(32);

... which is what I did.

That's why you don't see any traces of mt in the #elif block.

I could stand corrected ... if you can dig into the details.

Review performance u-benchmarking results from multi-threaded client-server application. #79

Uh oh!

Uh oh!

gapisback Jun 13, 2024 Collaborator

Replies: 4 comments · 7 replies

Uh oh!

gregthelaw Jun 13, 2024 Collaborator

Uh oh!

gapisback Jun 14, 2024 Collaborator Author

Uh oh!

gregthelaw Jun 16, 2024 Collaborator

Uh oh!

gapisback Jun 16, 2024 Collaborator Author

Uh oh!

Uh oh!

gregthelaw Jun 17, 2024 Collaborator

Uh oh!

gapisback Jun 17, 2024 Collaborator Author

Uh oh!

gregthelaw Jun 17, 2024 Collaborator

Uh oh!

gregthelaw Jun 17, 2024 Collaborator

Uh oh!

gregthelaw Jun 17, 2024 Collaborator

Uh oh!

gapisback Jun 17, 2024 Collaborator Author

Uh oh!

gapisback Jun 17, 2024 Collaborator Author

gapisback
Jun 13, 2024
Collaborator

Replies: 4 comments 7 replies

gregthelaw
Jun 13, 2024
Collaborator

gapisback Jun 14, 2024
Collaborator Author

gregthelaw
Jun 16, 2024
Collaborator

gapisback Jun 16, 2024
Collaborator Author

gregthelaw
Jun 17, 2024
Collaborator

gapisback Jun 17, 2024
Collaborator Author

gregthelaw
Jun 17, 2024
Collaborator

gregthelaw Jun 17, 2024
Collaborator

gregthelaw Jun 17, 2024
Collaborator

gapisback Jun 17, 2024
Collaborator Author

gapisback Jun 17, 2024
Collaborator Author