Skip to content

Parallelisation performance reflection #236

@XingerTang

Description

@XingerTang

In summary, the main programme parallelisation is working well, while the IO parallelisation is not working at all, and it cannot be fixed immediately.

AlphaPeel has two options for parallelisation:

Multithreading Options:
  -n_io_thread N_IO_THREAD
                        Number of threads to use for input and output. Default: 1.
  -n_thread N_THREAD    Maximum number of threads to use for analysis. Default: 1.

Parallel IO

The n_io_thread is the option for the parallelisation of the input and output processes. In the actual implementation, it is not realised by the multi-threading but multi-processing with the concurrent.futures.ProcessPoolExecutor. One possible reason for the mismatch between the name and the actual implementation methodology is that the developer originally intended to use a multi-threaded approach, but Python's GIL makes it impossible to achieve performance gains from multi-threading. Later, the code was changed to multi-processes, but the name was not updated accordingly.

However, the multi-processes are still not performing very well.

In the following table, 0001 represents 1 process, 0002 represents 2 processes, and 0003 represents 5 processes. All of them are running the same task of reading in files, except for the difference in the number of processes being used.

-------------------------------------------------------------------------------------------- benchmark: 3 tests -------------------------------------------------------------------------------------------
Name (time in ms)                      Min                   Max                Mean              StdDev              Median                 IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_iothreads (0001_d5197e0)      89.6345 (1.0)        138.7888 (1.0)       96.3105 (1.0)       12.8109 (2.20)      93.0399 (1.0)        0.9166 (1.0)           1;2  10.3831 (1.0)          13           1
test_iothreads (0002_d5197e0)     654.5294 (7.30)       667.5440 (4.81)     661.3113 (6.87)       5.8327 (1.0)      659.3465 (7.09)      10.3068 (11.24)         3;0   1.5121 (0.15)          5           1
test_iothreads (0003_d5197e0)     860.1779 (9.60)     1,182.9155 (8.52)     950.9395 (9.87)     140.6704 (24.12)    863.3150 (9.28)     175.4821 (191.44)        1;0   1.0516 (0.10)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We can see that the more processes we use, the longer the time it takes, which is not what we want. It's probably because of the overhead of transmitting the data across processes. Usually, IO works better with multi-threading, which shares memory.

If I change the code to do the multi-threading, we can see the following results. Here, 0004 represents 1 thread, 0005 represents 2 threads, and 0006 represents 5 threads.

----------------------------------------------------------------------------------------- benchmark: 3 tests -----------------------------------------------------------------------------------------
Name (time in ms)                      Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_iothreads (0004_d5197e0)      93.3218 (1.0)      196.6089 (1.57)     107.6778 (1.0)      30.4942 (3.06)      94.7627 (1.0)      14.9402 (1.0)           1;1  9.2870 (1.0)          11           1
test_iothreads (0006_d5197e0)      98.6850 (1.06)     125.3731 (1.0)      109.0491 (1.01)     10.3442 (1.04)     103.2473 (1.09)     18.2262 (1.22)          4;0  9.1702 (0.99)         11           1
test_iothreads (0005_d5197e0)     101.4286 (1.09)     126.7332 (1.01)     108.5702 (1.01)      9.9609 (1.0)      103.0935 (1.09)     15.0986 (1.01)          3;0  9.2106 (0.99)         11           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The code is available at XingerTang/tinyhouse@6f888ea. Now we won't have a longer runtime as the number of threads increases, as we won't have that much overhead with multi-threading. But we are not achieving the performance gain as well, because of Python's GIL lock for Python < 3.14.

One way to release the GIL lock is by using CPython. An example is https://github.com/XingerTang/Multithreads-CPython-Example. Cython can also achieve the same thing in a similar way. But both require a lot of change in the structure of tinyhouse and later tinyhouse would need compilation before calling the C or Cython extension.

Another way is what the main programme is doing, that is, using the jit compilation of numba with nogil=True. Unfortunately, the IO code does not automatically fit into the numba requirements of jit, so it is not useful in this case.

Parallel Peeling

n_thread controls the number of threads running in the main programme.

Simple results of parallel peeling are also obtained to support that the multi-threading in the main programme is working well.

Here, the simple accuracy test is benchmarked, and results are represented in the following table, where 0001 refers to 1 thread, 0002 refers to 2 threads, and 0003 refers to 5 threads.

------------------------------------------------------------------------------------------------------------ benchmark: 3 tests -----------------------------------------------------------------------------------------------------------
Name (time in s)                                                                 Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_accu[single-None-None-None-None-None-None-None-None] (0003_d466daf)     24.9094 (1.0)      29.3099 (1.02)     26.2840 (1.0)      1.7987 (2.43)     25.4170 (1.0)      2.0700 (1.71)          1;0  0.0380 (1.0)           5           1
test_accu[single-None-None-None-None-None-None-None-None] (0002_d466daf)     27.0001 (1.08)     28.7417 (1.0)      27.7603 (1.06)     0.7404 (1.0)      27.9412 (1.10)     1.2135 (1.0)           2;0  0.0360 (0.95)          5           1
test_accu[single-None-None-None-None-None-None-None-None] (0001_d466daf)     33.8098 (1.36)     37.8417 (1.32)     35.2298 (1.34)     1.7202 (2.32)     34.6285 (1.36)     2.6724 (2.20)          1;0  0.0284 (0.75)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With an increasing number of threads, the runtime is reduced, but not by much, showcasing the possibility of further optimisation.

My question

  • Should we remove the parallel IO, given that it is not working
  • Or should we try to integrate CPython or Cython for better performance

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions