Pandas out_flavor for better ctable performance

In this issue I want to make the case for the extension of the effect of `out_flavor` to `__getitem__()` (and related functions) and the introduction of a `pandas` outflavor. While I appreciate the rationale for limiting bcolz to the numpy data model I believe the possible performance improvements with pandas merit consideration.

I would be very interested to know if this has any chance for inclusion in bcolz, since the effort would be non-trivial for a clean implementation.
### Executive Summary
- bcolz is a columnar (albeit chunked) data container
- thus after decompression of each chunk, the data for each column is (chunk-)contiguous
- currently however, this column-contiguous data is forced into a row-major memory structure (numpy structured array)
- this has two consequences:
  1. the contiguous column-data has to be copied in row-sized chunks which is much slower than en-bloc copying of contiguous data,
  2. the data has to be copied (rather than being decompressed directly into the "proper" place)
- Moving to a column-major memory layout (e.g. pandas DataFrame) would provide:
  1. much faster assignment of data to colums (which is a very common operation with `ctable`)
  2. opens the possibility for avoiding repeated copy operations in the first place
  3. ctable would be able to really leverage multi-core machines. Currently the numpy structured array is a bottleneck for `__getitem__()` and the like: #174
  4.  improve down-stream performance of data analysis as much data analysis is carried out along column rather than along rows
-  due to the similar data access model of numpy structured arrays and pandas dataframes most code could probably remain unchanged after the output data object is initialised either as `numpy` or `pandas`
### Evidence of column assignment bottleneck

Note: for the sake of simplicity, when I use the term "column" in relation to numpy structured arrays, I refer to the fields of the (single column) structured array that bcolz uses to store its output column.

Numpy structured arrays (as used by bcolz) are inherently row-major. It is impossible to change this as far as I can see. This means that column assignment is fairly slow:

```
In [1]: import numpy as np

In [2]: M,N=int(1e7),10

# make a row-major numpy array

In [4]: A1=np.zeros((M,N),'f')

In [9]: dt=np.dtype(','.join(['f' for _ in range(N)]))

# make a structured numpy array

In [10]: A2=np.zeros((M,),dtype=dt)

In [11]: X=np.arange(M+0.0)

### simulate ctable __getitem()__ operation:
# assignment to a row-major numpy array column for column

In [13]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 2.36 s per loop

# assignment of structured array field for field

In [15]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop
```

Significant speedup (factor 6.5!) can be achieved by moving to a column-major memory layout for the numpy array:

```
In [1]: import numpy as np

In [2]: M,N=int(1e7),10

# make a colum-major numpy array
In [3]: A1=np.zeros((M,N),'f', 'F')

In [4]: dt=np.dtype(','.join(['f' for _ in range(N)]))

# make a structured numpy array

In [5]: A2=np.zeros((M,),dtype=dt)

In [6]: X=np.arange(M+0.0)

### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column

In [8]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 374 ms per loop

# assignment of structured array field for field

In [9]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.43 s per loop
```
### Why Pandas DataFrames would help

While moving to column-major numpy arrays would be the ideal solution, this is obviously not an option: they require the entire array to have a homogeneous dtype.

Pandas DataFrame however supports columns of different dtypes and by design stores data in column-major order. (As I remember because Wes McKinney said that most of his data analysis happened along columns, though I cannot find the reference. That said I think the following article in which he explains his reasons for not choosing numpy structured arrays is interesting: [Wes McKinney: A Roadmap for Rich Scientific Data Structures in Python](http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/))

In addition, the choice of column-major ordering permits size-mutability: columns can be added without copying the data. This fits well with the ctable column-store philosopy. With numpy structured arrays, the entire array has to be copied (almost entry by entry as far as I can see) to add a new column.

Due to these advantages, column-major ordering is used my many well-known dedicated number crunching environments, among others: Fortran, MATLAB and R

Note: Making a pandas DataFrame for an already existing numpy structured array returned by ctable is not an option either: again, effectively element by element copying of the data is required.
### Evidence of performance improvements with Pandas

Instantiation of Pandas DataFrames is admittedly an issue with smaller databases. Leaving this issue aside for the moment, one can see that assignment to an (instantiated DataFrame) is much faster:

```
In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: M,N=int(1e7),10

In [4]: A1=np.zeros((M,N),'f', 'F')

In [5]: dt=np.dtype(','.join(['f' for _ in range(N)]))

In [6]: A2=np.zeros((M,),dtype=dt)

In [7]: X=np.arange(M+0.0)

# make a pandas DataFrame

In [8]: df_A1 = pd.DataFrame(index=xrange(M), columns=range(N), dtype='f')

### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column

In [9]: %timeit for n in range(N): A1[:,n]=X
1 loops, best of 3: 369 ms per loop

# assignment to a structured numpy array field for field

In [10]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop

# (optimized) assignment to a pandas DataFrame column for column

In [8]: %timeit for n in range(N): df_A1._data.blocks[0].values[n,:]=X
1 loops, best of 3: 364 ms per loop
```

Instantiation of the DataFrame will probably be an issue. My gut reaction is that it should be possible to instantiate (and cache) an empty DataFrame with the correct structure and then for each `__getitem__()` call shallow-copy this template and assign new data arrays. This should help with the instantiation overhead, since the column makeup of a ctable instance usually does not change too often during a program.
### Down the road

In the long run it might be worth getting rid of the memory copies altogether for DataFrame out_flavor and decopressing the chunks directly into the arrays backing the DataFrame. This would likely lead to further performance improvements (though smaller as the remaining memory copies would already be efficient en-bloc copies).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas out_flavor for better ctable performance #176

Executive Summary

Evidence of column assignment bottleneck

Why Pandas DataFrames would help

Evidence of performance improvements with Pandas

Down the road

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pandas out_flavor for better ctable performance #176

Description

Executive Summary

Evidence of column assignment bottleneck

Why Pandas DataFrames would help

Evidence of performance improvements with Pandas

Down the road

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions