You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 11, 2023. It is now read-only.
In this issue I want to make the case for the extension of the effect of out_flavor to __getitem__() (and related functions) and the introduction of a pandas outflavor. While I appreciate the rationale for limiting bcolz to the numpy data model I believe the possible performance improvements with pandas merit consideration.
I would be very interested to know if this has any chance for inclusion in bcolz, since the effort would be non-trivial for a clean implementation.
Executive Summary
bcolz is a columnar (albeit chunked) data container
thus after decompression of each chunk, the data for each column is (chunk-)contiguous
currently however, this column-contiguous data is forced into a row-major memory structure (numpy structured array)
this has two consequences:
the contiguous column-data has to be copied in row-sized chunks which is much slower than en-bloc copying of contiguous data,
the data has to be copied (rather than being decompressed directly into the "proper" place)
Moving to a column-major memory layout (e.g. pandas DataFrame) would provide:
much faster assignment of data to colums (which is a very common operation with ctable)
opens the possibility for avoiding repeated copy operations in the first place
ctable would be able to really leverage multi-core machines. Currently the numpy structured array is a bottleneck for __getitem__() and the like: Numpy bottleneck with ctable #174
improve down-stream performance of data analysis as much data analysis is carried out along column rather than along rows
due to the similar data access model of numpy structured arrays and pandas dataframes most code could probably remain unchanged after the output data object is initialised either as numpy or pandas
Evidence of column assignment bottleneck
Note: for the sake of simplicity, when I use the term "column" in relation to numpy structured arrays, I refer to the fields of the (single column) structured array that bcolz uses to store its output column.
Numpy structured arrays (as used by bcolz) are inherently row-major. It is impossible to change this as far as I can see. This means that column assignment is fairly slow:
In [1]: import numpy as np
In [2]: M,N=int(1e7),10
# make a row-major numpy array
In [4]: A1=np.zeros((M,N),'f')
In [9]: dt=np.dtype(','.join(['f' for _ in range(N)]))
# make a structured numpy array
In [10]: A2=np.zeros((M,),dtype=dt)
In [11]: X=np.arange(M+0.0)
### simulate ctable __getitem()__ operation:
# assignment to a row-major numpy array column for column
In [13]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 2.36 s per loop
# assignment of structured array field for field
In [15]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop
Significant speedup (factor 6.5!) can be achieved by moving to a column-major memory layout for the numpy array:
In [1]: import numpy as np
In [2]: M,N=int(1e7),10
# make a colum-major numpy array
In [3]: A1=np.zeros((M,N),'f', 'F')
In [4]: dt=np.dtype(','.join(['f' for _ in range(N)]))
# make a structured numpy array
In [5]: A2=np.zeros((M,),dtype=dt)
In [6]: X=np.arange(M+0.0)
### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column
In [8]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 374 ms per loop
# assignment of structured array field for field
In [9]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.43 s per loop
Why Pandas DataFrames would help
While moving to column-major numpy arrays would be the ideal solution, this is obviously not an option: they require the entire array to have a homogeneous dtype.
Pandas DataFrame however supports columns of different dtypes and by design stores data in column-major order. (As I remember because Wes McKinney said that most of his data analysis happened along columns, though I cannot find the reference. That said I think the following article in which he explains his reasons for not choosing numpy structured arrays is interesting: Wes McKinney: A Roadmap for Rich Scientific Data Structures in Python)
In addition, the choice of column-major ordering permits size-mutability: columns can be added without copying the data. This fits well with the ctable column-store philosopy. With numpy structured arrays, the entire array has to be copied (almost entry by entry as far as I can see) to add a new column.
Due to these advantages, column-major ordering is used my many well-known dedicated number crunching environments, among others: Fortran, MATLAB and R
Note: Making a pandas DataFrame for an already existing numpy structured array returned by ctable is not an option either: again, effectively element by element copying of the data is required.
Evidence of performance improvements with Pandas
Instantiation of Pandas DataFrames is admittedly an issue with smaller databases. Leaving this issue aside for the moment, one can see that assignment to an (instantiated DataFrame) is much faster:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: M,N=int(1e7),10
In [4]: A1=np.zeros((M,N),'f', 'F')
In [5]: dt=np.dtype(','.join(['f' for _ in range(N)]))
In [6]: A2=np.zeros((M,),dtype=dt)
In [7]: X=np.arange(M+0.0)
# make a pandas DataFrame
In [8]: df_A1 = pd.DataFrame(index=xrange(M), columns=range(N), dtype='f')
### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column
In [9]: %timeit for n in range(N): A1[:,n]=X
1 loops, best of 3: 369 ms per loop
# assignment to a structured numpy array field for field
In [10]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop
# (optimized) assignment to a pandas DataFrame column for column
In [8]: %timeit for n in range(N): df_A1._data.blocks[0].values[n,:]=X
1 loops, best of 3: 364 ms per loop
Instantiation of the DataFrame will probably be an issue. My gut reaction is that it should be possible to instantiate (and cache) an empty DataFrame with the correct structure and then for each __getitem__() call shallow-copy this template and assign new data arrays. This should help with the instantiation overhead, since the column makeup of a ctable instance usually does not change too often during a program.
Down the road
In the long run it might be worth getting rid of the memory copies altogether for DataFrame out_flavor and decopressing the chunks directly into the arrays backing the DataFrame. This would likely lead to further performance improvements (though smaller as the remaining memory copies would already be efficient en-bloc copies).
In this issue I want to make the case for the extension of the effect of
out_flavorto__getitem__()(and related functions) and the introduction of apandasoutflavor. While I appreciate the rationale for limiting bcolz to the numpy data model I believe the possible performance improvements with pandas merit consideration.I would be very interested to know if this has any chance for inclusion in bcolz, since the effort would be non-trivial for a clean implementation.
Executive Summary
ctable)__getitem__()and the like: Numpy bottleneck with ctable #174numpyorpandasEvidence of column assignment bottleneck
Note: for the sake of simplicity, when I use the term "column" in relation to numpy structured arrays, I refer to the fields of the (single column) structured array that bcolz uses to store its output column.
Numpy structured arrays (as used by bcolz) are inherently row-major. It is impossible to change this as far as I can see. This means that column assignment is fairly slow:
Significant speedup (factor 6.5!) can be achieved by moving to a column-major memory layout for the numpy array:
Why Pandas DataFrames would help
While moving to column-major numpy arrays would be the ideal solution, this is obviously not an option: they require the entire array to have a homogeneous dtype.
Pandas DataFrame however supports columns of different dtypes and by design stores data in column-major order. (As I remember because Wes McKinney said that most of his data analysis happened along columns, though I cannot find the reference. That said I think the following article in which he explains his reasons for not choosing numpy structured arrays is interesting: Wes McKinney: A Roadmap for Rich Scientific Data Structures in Python)
In addition, the choice of column-major ordering permits size-mutability: columns can be added without copying the data. This fits well with the ctable column-store philosopy. With numpy structured arrays, the entire array has to be copied (almost entry by entry as far as I can see) to add a new column.
Due to these advantages, column-major ordering is used my many well-known dedicated number crunching environments, among others: Fortran, MATLAB and R
Note: Making a pandas DataFrame for an already existing numpy structured array returned by ctable is not an option either: again, effectively element by element copying of the data is required.
Evidence of performance improvements with Pandas
Instantiation of Pandas DataFrames is admittedly an issue with smaller databases. Leaving this issue aside for the moment, one can see that assignment to an (instantiated DataFrame) is much faster:
Instantiation of the DataFrame will probably be an issue. My gut reaction is that it should be possible to instantiate (and cache) an empty DataFrame with the correct structure and then for each
__getitem__()call shallow-copy this template and assign new data arrays. This should help with the instantiation overhead, since the column makeup of a ctable instance usually does not change too often during a program.Down the road
In the long run it might be worth getting rid of the memory copies altogether for DataFrame out_flavor and decopressing the chunks directly into the arrays backing the DataFrame. This would likely lead to further performance improvements (though smaller as the remaining memory copies would already be efficient en-bloc copies).