Parallelisation performance reflection

In summary, the main programme parallelisation is working well, while the IO parallelisation is not working at all, and it cannot be fixed immediately.

AlphaPeel has two options for parallelisation:

```
Multithreading Options:
  -n_io_thread N_IO_THREAD
                        Number of threads to use for input and output. Default: 1.
  -n_thread N_THREAD    Maximum number of threads to use for analysis. Default: 1.
```

## Parallel IO 

The ``n_io_thread`` is the option for the parallelisation of the input and output processes. In the actual implementation, it is not realised by the multi-threading but multi-processing with the ``concurrent.futures.ProcessPoolExecutor``. One possible reason for the mismatch between the name and the actual implementation methodology is that the developer originally intended to use a multi-threaded approach, but Python's GIL makes it impossible to achieve performance gains from multi-threading. Later, the code was changed to multi-processes, but the name was not updated accordingly.

However, the multi-processes are still not performing very well. 

In the following table, ``0001`` represents 1 process, ``0002`` represents 2 processes, and ``0003`` represents 5 processes. All of them are running the same task of reading in files, except for the difference in the number of processes being used.

```
-------------------------------------------------------------------------------------------- benchmark: 3 tests -------------------------------------------------------------------------------------------
Name (time in ms)                      Min                   Max                Mean              StdDev              Median                 IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_iothreads (0001_d5197e0)      89.6345 (1.0)        138.7888 (1.0)       96.3105 (1.0)       12.8109 (2.20)      93.0399 (1.0)        0.9166 (1.0)           1;2  10.3831 (1.0)          13           1
test_iothreads (0002_d5197e0)     654.5294 (7.30)       667.5440 (4.81)     661.3113 (6.87)       5.8327 (1.0)      659.3465 (7.09)      10.3068 (11.24)         3;0   1.5121 (0.15)          5           1
test_iothreads (0003_d5197e0)     860.1779 (9.60)     1,182.9155 (8.52)     950.9395 (9.87)     140.6704 (24.12)    863.3150 (9.28)     175.4821 (191.44)        1;0   1.0516 (0.10)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

We can see that the more processes we use, the longer the time it takes, which is not what we want. It's probably because of the overhead of transmitting the data across processes. Usually, IO works better with multi-threading, which shares memory.

If I change the code to do the multi-threading, we can see the following results. Here, ``0004`` represents 1 thread, ``0005`` represents 2 threads, and ``0006`` represents 5 threads.

```
----------------------------------------------------------------------------------------- benchmark: 3 tests -----------------------------------------------------------------------------------------
Name (time in ms)                      Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_iothreads (0004_d5197e0)      93.3218 (1.0)      196.6089 (1.57)     107.6778 (1.0)      30.4942 (3.06)      94.7627 (1.0)      14.9402 (1.0)           1;1  9.2870 (1.0)          11           1
test_iothreads (0006_d5197e0)      98.6850 (1.06)     125.3731 (1.0)      109.0491 (1.01)     10.3442 (1.04)     103.2473 (1.09)     18.2262 (1.22)          4;0  9.1702 (0.99)         11           1
test_iothreads (0005_d5197e0)     101.4286 (1.09)     126.7332 (1.01)     108.5702 (1.01)      9.9609 (1.0)      103.0935 (1.09)     15.0986 (1.01)          3;0  9.2106 (0.99)         11           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

```

The code is available at https://github.com/XingerTang/tinyhouse/commit/6f888eaf8ab117d7bc298519fd1f33f6c5938be0. Now we won't have a longer runtime as the number of threads increases, as we won't have that much overhead with multi-threading. But we are not achieving the performance gain as well, because of Python's GIL lock for Python < 3.14.

One way to release the GIL lock is by using CPython. An example is https://github.com/XingerTang/Multithreads-CPython-Example. Cython can also achieve the same thing in a similar way. But both require a lot of change in the structure of ``tinyhouse`` and later ``tinyhouse`` would need compilation before calling the C or Cython extension. 

Another way is what the main programme is doing, that is, using the ``jit`` compilation of ``numba`` with ``nogil=True``. Unfortunately, the IO code does not automatically fit into the ``numba`` requirements of ``jit``, so it is not useful in this case.

## Parallel Peeling

``n_thread`` controls the number of threads running in the main programme.

Simple results of parallel peeling are also obtained to support that the multi-threading in the main programme is working well. 

Here, the simple accuracy test is benchmarked, and results are represented in the following table, where  ``0001`` refers to 1 thread, ``0002`` refers to 2 threads, and ``0003`` refers to 5 threads.

```
------------------------------------------------------------------------------------------------------------ benchmark: 3 tests -----------------------------------------------------------------------------------------------------------
Name (time in s)                                                                 Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_accu[single-None-None-None-None-None-None-None-None] (0003_d466daf)     24.9094 (1.0)      29.3099 (1.02)     26.2840 (1.0)      1.7987 (2.43)     25.4170 (1.0)      2.0700 (1.71)          1;0  0.0380 (1.0)           5           1
test_accu[single-None-None-None-None-None-None-None-None] (0002_d466daf)     27.0001 (1.08)     28.7417 (1.0)      27.7603 (1.06)     0.7404 (1.0)      27.9412 (1.10)     1.2135 (1.0)           2;0  0.0360 (0.95)          5           1
test_accu[single-None-None-None-None-None-None-None-None] (0001_d466daf)     33.8098 (1.36)     37.8417 (1.32)     35.2298 (1.34)     1.7202 (2.32)     34.6285 (1.36)     2.6724 (2.20)          1;0  0.0284 (0.75)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

With an increasing number of threads, the runtime is reduced, but not by much, showcasing the possibility of further optimisation. 

## My question

- Should we remove the parallel IO, given that it is not working
- Or should we try to integrate CPython or Cython for better performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelisation performance reflection #236

Parallel IO

Parallel Peeling

My question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelisation performance reflection #236

Description

Parallel IO

Parallel Peeling

My question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions