Version: 0.2.0
Complete API documentation for PipeFrame - the modern, intuitive data manipulation library for Python.
- Core Classes
- Data Manipulation Verbs
- Aggregation Verbs
- Reshape Operations
- Conditional Functions
- Column Selection Helpers
- I/O Operations
- Utility Functions
The main data structure in PipeFrame, wrapping pandas DataFrame with enhanced functionality.
DataFrame(data=None, index=None, columns=None, **kwargs)Parameters:
data: dict, list, ndarray, DataFrame, or Series- Data to construct DataFrame from
index: Index or array-like- Index to use for resulting frame
columns: Index or array-like- Column labels
**kwargs: Additional arguments passed to pandas DataFrame
Returns:
DataFrame: PipeFrame DataFrame object
Examples:
from pipeframe import DataFrame
# From dictionary
df = DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
})
# From pandas DataFrame
import pandas as pd
pdf = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df = DataFrame(pdf)
# From list of dictionaries
df = DataFrame([
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30}
])Chain operations using the >> operator.
df >> operationParameters:
operation: Callable- Function to apply to the DataFrame
Returns:
DataFrameorGroupBy: Result of the operation
Examples:
result = (df
>> filter('age > 25')
>> define(bonus='salary * 0.1')
>> select('name', 'bonus')
)shape: tuple - Dimensions of the DataFrame (rows, columns)columns: Index - Column labelsindex: Index - Row labelsdtypes: Series - Data types of each columnempty: bool - Whether DataFrame is emptysize: int - Number of elements
head(n=5)
- Return first
nrows - Returns: DataFrame
tail(n=5)
- Return last
nrows - Returns: DataFrame
copy()
- Make a deep copy
- Returns: DataFrame
to_pandas()
- Convert to pandas DataFrame
- Returns: pd.DataFrame
I/O Methods:
to_csv(path, **kwargs)- Write to CSV fileto_excel(path, **kwargs)- Write to Excel fileto_json(path, **kwargs)- Write to JSON fileto_parquet(path, **kwargs)- Write to Parquet file
One-dimensional labeled array, similar to pandas Series.
Series(data=None, index=None, name=None, **kwargs)Parameters:
data: array-like, dict, or scalar- Data for the Series
index: array-like or Index- Index labels
name: str- Name for the Series
**kwargs: Additional arguments
Examples:
from pipeframe import Series
# From list
s = Series([1, 2, 3, 4, 5])
# From dict
s = Series({'a': 1, 'b': 2, 'c': 3})
# With index and name
s = Series([10, 20, 30], index=['x', 'y', 'z'], name='values')to_frame(name=None)
- Convert Series to DataFrame
- Parameters:
name(str) - Column name for resulting DataFrame - Returns: DataFrame
to_pandas()
- Convert to pandas Series
- Returns: pd.Series
Grouped data structure for split-apply-combine operations.
summarize(**aggregations)
- Aggregate grouped data
- See summarize() for details
ungroup()
- Remove grouping
- Returns: DataFrame
head(n=5)
- First
nrows of each group - Returns: DataFrame
tail(n=5)
- Last
nrows of each group - Returns: DataFrame
Filter rows based on conditions.
filter(condition: str) -> CallableParameters:
condition: str- Boolean expression to filter rows
Returns:
Callable: Function that filters a DataFrame
Examples:
# Single condition
df >> filter('age > 30')
# Multiple conditions with &
df >> filter('age > 30 & salary > 50000')
# Using OR
df >> filter('(age < 25) | (age > 60)')
# String matching
df >> filter('name.str.contains("Alice")')
# Not null
df >> filter('email.notna()')Notes:
- Uses pandas
eval()for safe expression evaluation - Supports all pandas comparison operators:
>,<,>=,<=,==,!= - Use
&for AND,|for OR,~for NOT - Parentheses group conditions
Create or modify columns.
define(**columns) -> CallableParameters:
**columns: str or Callable- Column definitions as keyword arguments
- Values can be string expressions or callable functions
Returns:
Callable: Function that adds/modifies columns in a DataFrame
Examples:
# Simple arithmetic
df >> define(total='price * quantity')
# Multiple columns
df >> define(
revenue='price * quantity',
profit='revenue - cost',
margin='profit / revenue * 100'
)
# Using pandas functions
df >> define(
month='pd.to_datetime(date).dt.month',
year='pd.to_datetime(date).dt.year'
)
# String operations
df >> define(
upper_name='name.str.upper()',
email='name.str.lower() + "@example.com"'
)
# Callable functions
df >> define(
category=lambda df: df['value'].apply(lambda x: 'high' if x > 100 else 'low')
)
# Conditional with if_else
df >> define(
status=if_else('age >= 18', 'adult', 'minor')
)Supported in Expressions:
- Pandas module functions:
pd.to_datetime(),pd.cut(), etc. - NumPy functions:
np.sqrt(),np.log(), etc. - DataFrame column operations
- String methods:
.str.upper(),.str.contains(), etc. - DateTime methods:
.dt.year,.dt.month, etc.
Choose columns to keep.
select(*columns) -> CallableParameters:
*columns: str- Column names or selection helpers
Returns:
Callable: Function that selects columns from a DataFrame
Examples:
# Basic selection
df >> select('name', 'age', 'salary')
# Using helpers
df >> select(
'id',
starts_with('date_'),
ends_with('_amount'),
contains('price')
)
# Column ranges
df >> select('name:salary') # All columns from 'name' to 'salary'
# Regex patterns
df >> select('id', matches(r'Q\d_sales'))
# Everything except
df >> select('-age', '-salary') # Remove these columnsSelection Helpers:
starts_with(prefix)- Columns starting with prefixends_with(suffix)- Columns ending with suffixcontains(pattern)- Columns containing patternmatches(pattern)- Columns matching regex pattern
Sort rows by column values.
arrange(*columns) -> CallableParameters:
*columns: str- Column names to sort by
- Prefix with
-for descending order
Returns:
Callable: Function that sorts a DataFrame
Examples:
# Single column ascending
df >> arrange('age')
# Descending
df >> arrange('-salary')
# Multiple columns
df >> arrange('department', '-salary', 'name')
# Mixed ascending/descending
df >> arrange('city', '-age', 'name')Rename columns.
rename(**renames) -> CallableParameters:
**renames: str- Mapping of new_name=old_name
Returns:
Callable: Function that renames columns in a DataFrame
Examples:
# Single column
df >> rename(customer_name='name')
# Multiple columns
df >> rename(
emp_id='id',
emp_name='name',
emp_age='age'
)
# Chain with other operations
df >> rename(total_sales='sales') >> filter('total_sales > 1000')Keep only unique rows.
distinct(*columns, keep='first') -> CallableParameters:
*columns: str, optional- Columns to consider for uniqueness
- If empty, considers all columns
keep: {'first', 'last', False}, default 'first'- Which duplicate to keep
Returns:
Callable: Function that removes duplicates from a DataFrame
Examples:
# All columns
df >> distinct()
# Specific columns
df >> distinct('customer_id')
# Multiple columns
df >> distinct('product', 'store', 'date')
# Keep last occurrence
df >> distinct('email', keep='last')
# Remove all duplicates
df >> distinct('id', keep=False)Group data by column values.
group_by(*columns) -> CallableParameters:
*columns: str- Column names to group by
Returns:
Callable: Function that creates a GroupBy object
Examples:
# Single column
df >> group_by('category')
# Multiple columns
df >> group_by('category', 'region', 'year')
# Combined with summarize
df >> group_by('department') >> summarize(
avg_salary='mean(salary)',
count='count()'
)Aggregate grouped data.
summarize(**aggregations) -> CallableParameters:
**aggregations: str- Aggregation specifications as keyword arguments
- Format:
new_col='function(column)'
Returns:
Callable: Function that aggregates a GroupBy or DataFrame
Aggregation Functions:
count()- Count rowssum(col)- Sum valuesmean(col)- Averagemedian(col)- Medianmin(col)- Minimummax(col)- Maximumstd(col)- Standard deviationvar(col)- Variancefirst(col)- First valuelast(col)- Last value
Examples:
# Basic aggregation
df >> group_by('category') >> summarize(
total='sum(sales)',
avg='mean(price)',
count='count()'
)
# Multiple aggregations
df >> group_by('region', 'product') >> summarize(
total_revenue='sum(revenue)',
total_units='sum(units)',
avg_price='mean(price)',
max_price='max(price)',
min_price='min(price)',
num_transactions='count()'
)
# Without grouping (overall statistics)
df >> summarize(
total_sales='sum(sales)',
avg_price='mean(price)'
)
# With post-aggregation calculations
df >> group_by('category') >> summarize(
total='sum(sales)',
count='count()'
) >> define(
avg_per_item='total / count'
)Remove grouping from a GroupBy object.
ungroup() -> CallableReturns:
Callable: Function that converts GroupBy to DataFrame
Examples:
df >> group_by('category') >> ungroup()
# After aggregation
result = (df
>> group_by('dept')
>> summarize(avg='mean(salary)')
>> ungroup()
>> arrange('-avg')
)Convert wide data to long format.
pivot_longer(
cols,
names_to='name',
values_to='value',
names_prefix=None
) -> CallableParameters:
cols: list of str- Columns to pivot longer
names_to: str, default 'name'- Name for the new column containing column names
values_to: str, default 'value'- Name for the new column containing values
names_prefix: str, optional- Prefix to remove from column names
Examples:
# Basic pivot
df >> pivot_longer(
cols=['Q1', 'Q2', 'Q3', 'Q4'],
names_to='quarter',
values_to='sales'
)
# Remove prefix
df >> pivot_longer(
cols=['revenue_2021', 'revenue_2022', 'revenue_2023'],
names_to='year',
values_to='revenue',
names_prefix='revenue_'
)Convert long data to wide format.
pivot_wider(
id_cols=None,
names_from='name',
values_from='value',
values_fill=None
) -> CallableParameters:
id_cols: str or list of str- Columns to use as identifiers
names_from: str, default 'name'- Column to get new column names from
values_from: str, default 'value'- Column to get values from
values_fill: scalar, optional- Value to fill for missing combinations
Examples:
# Basic pivot
df >> pivot_wider(
id_cols='student',
names_from='subject',
values_from='grade'
)
# With fill value
df >> pivot_wider(
id_cols=['year', 'month'],
names_from='category',
values_from='sales',
values_fill=0
)Unpivot DataFrame (similar to pivot_longer).
melt(
id_vars=None,
value_vars=None,
var_name='variable',
value_name='value'
) -> CallableParameters:
id_vars: list of str, optional- Columns to keep as identifiers
value_vars: list of str, optional- Columns to unpivot
var_name: str, default 'variable'- Name for variable column
value_name: str, default 'value'- Name for value column
Examples:
df >> melt(
id_vars=['id', 'name'],
value_vars=['math', 'science', 'english'],
var_name='subject',
value_name='score'
)Split one column into multiple columns.
separate(
col,
into,
sep='_',
remove=True
) -> CallableParameters:
col: str- Column to separate
into: list of str- Names for new columns
sep: str or regex, default '_'- Separator pattern
remove: bool, default True- Whether to remove original column
Examples:
# Split by underscore
df >> separate('full_name', into=['first', 'last'], sep=' ')
# Split by regex
df >> separate('date', into=['year', 'month', 'day'], sep='-')
# Keep original column
df >> separate('code', into=['prefix', 'number'], sep='_', remove=False)Combine multiple columns into one.
unite(
col,
cols,
sep='_',
remove=True,
na_rm=False
) -> CallableParameters:
col: str- Name for new combined column
cols: list of str- Columns to combine
sep: str, default '_'- Separator to use
remove: bool, default True- Whether to remove original columns
na_rm: bool, default False- Whether to remove NaN values before uniting
Examples:
# Combine columns
df >> unite('full_name', ['first', 'last'], sep=' ')
# Custom separator
df >> unite('date', ['year', 'month', 'day'], sep='-')
# Keep original columns
df >> unite('full_address', ['street', 'city', 'zip'], sep=', ', remove=False)Vectorized if-else condition.
if_else(condition, true_value, false_value)Parameters:
condition: str- Boolean expression
true_value: scalar or str- Value when condition is True
false_value: scalar or str- Value when condition is False
Returns:
- Expression that can be used in
define()
Examples:
df >> define(
status=if_else('age >= 18', 'adult', 'minor'),
category=if_else('salary > 60000', 'high', 'standard'),
eligible=if_else('score >= 70', True, False)
)Multiple conditional statements (like SQL CASE WHEN).
case_when(*conditions, default=None)Parameters:
*conditions: tuple- Tuples of (condition, value)
default: scalar, optional- Default value when no conditions match
Returns:
- Expression for use in
define()
Examples:
df >> define(
grade=case_when(
('score >= 90', 'A'),
('score >= 80', 'B'),
('score >= 70', 'C'),
('score >= 60', 'D'),
default='F'
)
)
df >> define(
segment=case_when(
('revenue > 10000 & age < 30', 'Young High Value'),
('revenue > 10000 & age >= 30', 'Mature High Value'),
('revenue <= 10000 & age < 30', 'Young Standard'),
default='Mature Standard'
)
)Select columns starting with a prefix.
starts_with(prefix: str) -> listSelect columns ending with a suffix.
ends_with(suffix: str) -> listSelect columns containing a pattern.
contains(pattern: str) -> listSelect columns matching a regex pattern.
matches(pattern: str) -> listExamples:
df >> select(
'id',
starts_with('date_'),
ends_with('_total'),
contains('revenue'),
matches(r'Q[1-4]_\d{4}')
)read_csv(filepath, **kwargs)
- Read CSV file
- Returns: DataFrame
read_excel(filepath, **kwargs)
- Read Excel file
- Returns: DataFrame
read_json(filepath, **kwargs)
- Read JSON file
- Returns: DataFrame
read_parquet(filepath, **kwargs)
- Read Parquet file
- Returns: DataFrame
read_sql(query, connection, **kwargs)
- Read from SQL database
- Returns: DataFrame
read_clipboard(**kwargs)
- Read from clipboard
- Returns: DataFrame
Examples:
from pipeframe import read_csv, read_excel, read_json
# CSV
df = read_csv('data.csv')
df = read_csv('data.csv', sep=';', encoding='utf-8')
# Excel
df = read_excel('data.xlsx', sheet_name='Sheet1')
df = read_excel('data.xlsx', sheet_name=0)
# JSON
df = read_json('data.json', orient='records')
# Parquet
df = read_parquet('data.parquet')
# SQL
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df = read_sql('SELECT * FROM users', engine)
# Clipboard
df = read_clipboard() # Paste data from Excel/spreadsheetAll write methods are available on DataFrame objects:
df.to_csv(filepath, **kwargs)
df.to_excel(filepath, **kwargs)
df.to_json(filepath, **kwargs)
df.to_parquet(filepath, **kwargs)
Examples:
# CSV
df.to_csv('output.csv', index=False)
# Excel
df.to_excel('report.xlsx', sheet_name='Results', index=False)
# JSON
df.to_json('data.json', orient='records', lines=True)
# Parquet
df.to_parquet('data.parquet', compression='gzip')Quick look at intermediate data in pipelines.
peek(n=5, where='head') -> CallableParameters:
n: int, default 5- Number of rows to display
where: {'head', 'tail', 'sample'}, default 'head'- Which rows to show
Examples:
result = (df
>> filter('age > 30')
>> peek(3) # Shows first 3 rows
>> define(bonus='salary * 0.1')
>> peek(5, where='tail') # Shows last 5 rows
)Capture and compare DataFrame states.
from pipeframe.utils import Snapshot
snap = Snapshot()
result = (df
>> snap.capture('after_filter')
>> filter('x > 50')
>> snap.capture('after_group')
>> group_by('category')
)
# Compare snapshots
stats = snap.compare('after_filter', 'after_group')@timer
- Time function execution
@catch_empty(default=None)
- Handle empty DataFrames gracefully
@validate_columns(*required_cols)
- Validate required columns exist
Examples:
from pipeframe.utils import timer, catch_empty, validate_columns
@timer
def process_data(df):
return df >> filter('x > 100') >> group_by('category')
@catch_empty(default=DataFrame())
def get_filtered_data(df):
return df >> filter('value > 1000')
@validate_columns('name', 'age', 'salary')
def process_employees(df):
return df >> filter('age > 30')PipeFrame provides custom exceptions for clear error messages:
PipeFrameError- Base exceptionPipeFrameTypeError- Type-related errorsPipeFrameValueError- Invalid valuesPipeFrameExpressionError- Expression evaluation errorsPipeFrameIOError- I/O operations errors
Example:
from pipeframe.exceptions import PipeFrameExpressionError
try:
df >> define(bad="__import__('os').system('ls')")
except PipeFrameExpressionError as e:
print(f"Dangerous expression blocked: {e}")PipeFrame validates all expressions to prevent code injection:
Blocked patterns:
__import__exec()eval()compile()open()file()
Safe evaluation:
- Uses pandas
eval()with restricted environment - Validates expression syntax
- Limited namespace access
# ✅ Good - readable pipeline
result = (df
>> filter('age > 25')
>> define(bonus='salary * 0.1')
>> select('name', 'salary', 'bonus')
>> arrange('-bonus')
)
# ❌ Avoid - nested functions
result = arrange(
select(
define(
filter(df, 'age > 25'),
bonus='salary * 0.1'
),
'name', 'salary', 'bonus'
),
'-bonus'
)# Filter and transform
processed = (df
>> filter('date >= "2024-01-01"')
>> define(
revenue='price * quantity',
profit='revenue - cost'
)
)
# Aggregate
summary = (processed
>> group_by('category')
>> summarize(
total_revenue='sum(revenue)',
total_profit='sum(profit)'
)
)result = (df
>> filter('value > 0')
>> peek(3) # Debug checkpoint
>> group_by('category')
>> peek(5, where='tail') # Check aggregation
>> summarize(total='sum(value)')
)# ✅ Readable and concise
df >> filter('age > 30 & salary > 50000')
# ❌ More verbose
df >> filter(lambda df: (df['age'] > 30) & (df['salary'] > 50000))from pipeframe import *
# Load data
sales = read_csv('sales_data.csv')
# Full analysis pipeline
analysis = (sales
# Data cleaning
>> filter('order_date >= "2024-01-01" & revenue > 0')
>> define(
quarter='pd.to_datetime(order_date).dt.quarter',
profit='revenue - cost',
margin='(profit / revenue) * 100'
)
# Categorization
>> define(
customer_segment=case_when(
('total_orders > 10 & revenue > 10000', 'VIP'),
('total_orders > 5', 'Regular'),
default='New'
)
)
# Aggregation
>> group_by('product_category', 'quarter', 'customer_segment')
>> summarize(
total_revenue='sum(revenue)',
total_profit='sum(profit)',
avg_margin='mean(margin)',
num_orders='count()',
unique_customers='nunique(customer_id)'
)
# Post-aggregation
>> define(
profit_per_order='total_profit / num_orders',
revenue_per_customer='total_revenue / unique_customers'
)
# Final formatting
>> arrange('-total_revenue')
>> select(
'product_category',
'quarter',
'customer_segment',
'total_revenue',
'total_profit',
'avg_margin',
'num_orders'
)
)
# Export results
analysis.to_excel('quarterly_analysis.xlsx', sheet_name='Summary')See CHANGELOG.md for detailed version history.
- GitHub Issues: https://github.com/Yasser03/pipeframe/issues
- Discussions: https://github.com/Yasser03/pipeframe/discussions
- Email: yasser.mustafan@gmail.com
Last Updated: 2026-02-15
Version: 0.2.0
License: MIT