Skip to content
This repository was archived by the owner on Dec 11, 2023. It is now read-only.
This repository was archived by the owner on Dec 11, 2023. It is now read-only.

Pandas out_flavor for better ctable performance #176

@ARF1

Description

@ARF1

In this issue I want to make the case for the extension of the effect of out_flavor to __getitem__() (and related functions) and the introduction of a pandas outflavor. While I appreciate the rationale for limiting bcolz to the numpy data model I believe the possible performance improvements with pandas merit consideration.

I would be very interested to know if this has any chance for inclusion in bcolz, since the effort would be non-trivial for a clean implementation.

Executive Summary

  • bcolz is a columnar (albeit chunked) data container
  • thus after decompression of each chunk, the data for each column is (chunk-)contiguous
  • currently however, this column-contiguous data is forced into a row-major memory structure (numpy structured array)
  • this has two consequences:
    1. the contiguous column-data has to be copied in row-sized chunks which is much slower than en-bloc copying of contiguous data,
    2. the data has to be copied (rather than being decompressed directly into the "proper" place)
  • Moving to a column-major memory layout (e.g. pandas DataFrame) would provide:
    1. much faster assignment of data to colums (which is a very common operation with ctable)
    2. opens the possibility for avoiding repeated copy operations in the first place
    3. ctable would be able to really leverage multi-core machines. Currently the numpy structured array is a bottleneck for __getitem__() and the like: Numpy bottleneck with ctable #174
    4. improve down-stream performance of data analysis as much data analysis is carried out along column rather than along rows
  • due to the similar data access model of numpy structured arrays and pandas dataframes most code could probably remain unchanged after the output data object is initialised either as numpy or pandas

Evidence of column assignment bottleneck

Note: for the sake of simplicity, when I use the term "column" in relation to numpy structured arrays, I refer to the fields of the (single column) structured array that bcolz uses to store its output column.

Numpy structured arrays (as used by bcolz) are inherently row-major. It is impossible to change this as far as I can see. This means that column assignment is fairly slow:

In [1]: import numpy as np

In [2]: M,N=int(1e7),10

# make a row-major numpy array

In [4]: A1=np.zeros((M,N),'f')

In [9]: dt=np.dtype(','.join(['f' for _ in range(N)]))

# make a structured numpy array

In [10]: A2=np.zeros((M,),dtype=dt)

In [11]: X=np.arange(M+0.0)

### simulate ctable __getitem()__ operation:
# assignment to a row-major numpy array column for column

In [13]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 2.36 s per loop

# assignment of structured array field for field

In [15]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop

Significant speedup (factor 6.5!) can be achieved by moving to a column-major memory layout for the numpy array:

In [1]: import numpy as np

In [2]: M,N=int(1e7),10

# make a colum-major numpy array
In [3]: A1=np.zeros((M,N),'f', 'F')

In [4]: dt=np.dtype(','.join(['f' for _ in range(N)]))

# make a structured numpy array

In [5]: A2=np.zeros((M,),dtype=dt)

In [6]: X=np.arange(M+0.0)

### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column

In [8]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 374 ms per loop

# assignment of structured array field for field

In [9]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.43 s per loop

Why Pandas DataFrames would help

While moving to column-major numpy arrays would be the ideal solution, this is obviously not an option: they require the entire array to have a homogeneous dtype.

Pandas DataFrame however supports columns of different dtypes and by design stores data in column-major order. (As I remember because Wes McKinney said that most of his data analysis happened along columns, though I cannot find the reference. That said I think the following article in which he explains his reasons for not choosing numpy structured arrays is interesting: Wes McKinney: A Roadmap for Rich Scientific Data Structures in Python)

In addition, the choice of column-major ordering permits size-mutability: columns can be added without copying the data. This fits well with the ctable column-store philosopy. With numpy structured arrays, the entire array has to be copied (almost entry by entry as far as I can see) to add a new column.

Due to these advantages, column-major ordering is used my many well-known dedicated number crunching environments, among others: Fortran, MATLAB and R

Note: Making a pandas DataFrame for an already existing numpy structured array returned by ctable is not an option either: again, effectively element by element copying of the data is required.

Evidence of performance improvements with Pandas

Instantiation of Pandas DataFrames is admittedly an issue with smaller databases. Leaving this issue aside for the moment, one can see that assignment to an (instantiated DataFrame) is much faster:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: M,N=int(1e7),10

In [4]: A1=np.zeros((M,N),'f', 'F')

In [5]: dt=np.dtype(','.join(['f' for _ in range(N)]))

In [6]: A2=np.zeros((M,),dtype=dt)

In [7]: X=np.arange(M+0.0)

# make a pandas DataFrame

In [8]: df_A1 = pd.DataFrame(index=xrange(M), columns=range(N), dtype='f')

### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column

In [9]: %timeit for n in range(N): A1[:,n]=X
1 loops, best of 3: 369 ms per loop

# assignment to a structured numpy array field for field

In [10]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop

# (optimized) assignment to a pandas DataFrame column for column

In [8]: %timeit for n in range(N): df_A1._data.blocks[0].values[n,:]=X
1 loops, best of 3: 364 ms per loop

Instantiation of the DataFrame will probably be an issue. My gut reaction is that it should be possible to instantiate (and cache) an empty DataFrame with the correct structure and then for each __getitem__() call shallow-copy this template and assign new data arrays. This should help with the instantiation overhead, since the column makeup of a ctable instance usually does not change too often during a program.

Down the road

In the long run it might be worth getting rid of the memory copies altogether for DataFrame out_flavor and decopressing the chunks directly into the arrays backing the DataFrame. This would likely lead to further performance improvements (though smaller as the remaining memory copies would already be efficient en-bloc copies).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions