df-compress

A python package to compress pandas DataFrames akin to Stata's compress command. This function may prove particularly helpfull to those dealing with large datasets.

Installation

You can install df-compress by running the following command:

pip install df_compress

How to use

After installing the package use the following import:

from df_compress import compress

Example

It follows a reproducible example on df-compress usage:

from df_compress import compress
import pandas as pd
import numpy as np

size = 1000000
df = pd.DataFrame(columns=["Year","State","Value","Int_value"])
df.Year = np.random.randint(low=2000,high=2023,size=size).astype(str)
df.State = np.random.choice(['RJ','SP','ES','MT'],size=size)
df.Value= np.random.rand(size,1)
df.Int_value = df.Value*10 // 1

compress(df, show_conversions=True, parallel = False) # which modifies the original DataFrame without needing to reassign it

Which will print for you the transformations and memory saved:

Initial memory usage: 114.44 MB
Final memory usage: 7.63 MB
Memory reduced by: 106.81 MB (93.3%)

Variable type conversions:
   column    from       to  memory saved (MB)
     Year  object    int16          48.637264
    State  object category          47.683231
    Value float64  float32           3.814571
Int_value float64     int8           6.675594

Optional Parameters

The function has three optimal parameters (arguments):

convert_strings (bool): Whether to attempt to parse object columns as numbers
- defaults to True
numeric_threshold (float): Indicates the proportion of valid numeric entries needed to convert a string to numeric
- defaults to 0.999
show_conversions (bool): whether to report the changes made column by column
- defaults to False
parallel (bool): whether to compress the columns in parallel
- defaults to False

Parallelization Caveats

The parallelization is implemented using Dask and a local client. Moreover, the code is parallelized at the columns. Thus, opting for parallel compression does not guarantee performance improvements and should be a conscious decision taken on a case-by-case basis. To prove this point, the implementation example provided above runs significantly slower when opting for parallel compression (0.29x).

As far as I know, the reason why parallelization does not guarantee efficiency regards the overhead time. Whenever you run some code in parallel you must "organize" it before computing the operation, which may take some time. If the efficiency gains from parallelizing the operation do not cover the overhead time, you incur an efficiency loss. Therefore, my recommendation is to only opt for parallel compression when you have a DataFrame with many columns.

It follows a quick benchmark on a 12 CPUs computer to give you perspective on when to use parallel compression:

import pandas as pd
from df_compress import compress
import sys, os
import numpy as np
from time import time

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

def timereps(reps, func):
    start = time()
    for i in range(0, reps):
        func()
    end = time()
    return (end - start) / reps

def benchmark_compression(df):
    print("Running benchmark on DataFrame with shape:", df.shape, "\n")
    
    print("Testing non-parallel compression...")
    with HiddenPrints():
        time_non_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=False, show_conversions=False))
    print(f"Non-parallel time: {time_non_parallel:.2f} seconds\n")
    
    print("Testing parallel compression...")
    with HiddenPrints():
        time_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=True, show_conversions=False))
    print(f"Parallel time: {time_parallel:.2f} seconds\n")
    
    speedup = time_non_parallel / time_parallel if time_parallel > 0 else float('inf')
    print(f"Parallel speedup: {speedup:.2f}x")

def generate_test_dataframe(n_rows=1_000_000, n_object_cols=10, n_numeric_cols=10):
    data = {}
    for i in range(n_object_cols):
        data[f"obj_{i}"] = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=n_rows)
    for i in range(n_numeric_cols):
        data[f"num_{i}"] = np.random.randn(n_rows)
    return pd.DataFrame(data)

When testing for a 40 column DataFrame (benchmark_compression(generate_test_dataframe(n_object_cols=20, n_numeric_cols=20))) I find that:

Running benchmark on DataFrame with shape: (1000000, 40) 

Testing non-parallel compression...
Non-parallel time: 17.60 seconds

Testing parallel compression...
Parallel time: 12.06 seconds

Parallel speedup: 1.46x

That said, a known issue is that parallel compression breaks down when dealing with really large datasets (56M rows and 50+ columns, for example). Addressing this is a top priority on the to-do list.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
df_compress		df_compress
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
config.cfg		config.cfg
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

df-compress

Installation

How to use

Example

Optional Parameters

Parallelization Caveats

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

df-compress

Installation

How to use

Example

Optional Parameters

Parallelization Caveats

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages