diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..b458216 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,50 @@ +name: Docs + +on: + push: + branches: [main] + workflow_dispatch: + +permissions: + contents: read + pages: write + id-token: write + +concurrency: + group: "pages" + cancel-in-progress: false + +jobs: + build: + runs-on: ubuntu-22.04 + steps: + - uses: actions/checkout@v5 + + - name: Install uv + uses: astral-sh/setup-uv@v5 + with: + enable-cache: true + cache-dependency-glob: pyproject.toml + python-version: "3.12" + + - name: Install dependencies + run: uv sync --group dev + + - name: Build docs + run: uv run mkdocs build + + - name: Upload artifact + uses: actions/upload-pages-artifact@v3 + with: + path: site/ + + deploy: + needs: build + runs-on: ubuntu-22.04 + environment: + name: github-pages + url: ${{ steps.deployment.outputs.page_url }} + steps: + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v4 diff --git a/.gitignore b/.gitignore index 2dd2c76..887ef2e 100644 --- a/.gitignore +++ b/.gitignore @@ -7,3 +7,4 @@ sandbox/ __pycache__ *.pyc .coverage +site/ diff --git a/site/404.html b/site/404.html deleted file mode 100644 index ea35fd1..0000000 --- a/site/404.html +++ /dev/null @@ -1,929 +0,0 @@ - - - -
- - - - - - - - - - - - - - - - - - - -
-
-
-
- Process large Parquet files that exceed available RAM by reading and transforming data in chunks.
-When working with files larger than available memory, use process_chunked() instead of process(). This method:
ChunkedProtocol with per-chunk statisticsfrom transformplan import TransformPlan, Col
-
-plan = (
- TransformPlan()
- .col_rename(column="PatientID", new_name="patient_id")
- .rows_filter(Col("age") >= 18)
- .rows_unique(columns=["patient_id", "visit_date"])
-)
-
-# Process a large file in chunks
-result, protocol = plan.process_chunked(
- source="patients_10gb.parquet",
- partition_key="patient_id", # Keep patient rows together
- chunk_size=100_000,
-)
-
-protocol.print()
-Not all operations can be used with chunked processing. Operations are classified into three categories:
-These operations process each row independently and work with any chunk:
-| Category | -Operations | -
|---|---|
| Column | -col_drop, col_rename, col_cast, col_reorder, col_select, col_duplicate, col_fill_null, col_drop_null, col_drop_zero, col_add, col_add_uuid, col_hash, col_coalesce |
-
| Math | -math_add, math_subtract, math_multiply, math_divide, math_clamp, math_abs, math_round, math_set_min, math_set_max, math_add_columns, math_subtract_columns, math_multiply_columns, math_divide_columns, math_percent_of |
-
| String | -str_replace, str_slice, str_truncate, str_lower, str_upper, str_strip, str_pad, str_split, str_concat, str_extract |
-
| Datetime | -dt_year, dt_month, dt_day, dt_week, dt_quarter, dt_year_month, dt_quarter_year, dt_calendar_week, dt_parse, dt_format, dt_diff_days, dt_age_years, dt_is_between, dt_truncate |
-
| Map | -map_values, map_discretize, map_bool_to_int, map_null_to_value, map_value_to_null, map_case, map_from_column |
-
| Rows | -rows_filter, rows_drop, rows_flag, rows_explode, rows_drop_nulls, rows_melt |
-
These operations need all rows for a group together. They work with chunked processing only when partition_key includes their grouping columns:
| Operation | -Group Parameter | -Requirement | -
|---|---|---|
rows_unique |
-columns |
-partition_key must contain columns |
-
rows_deduplicate |
-columns |
-partition_key must contain columns |
-
math_cumsum |
-group_by |
-partition_key must contain group_by |
-
math_rank |
-group_by |
-partition_key must contain group_by |
-
Example: To use rows_unique(columns=["patient_id"]), you must set partition_key="patient_id" (or a list containing it).
These operations require the full dataset and cannot be used with chunked processing:
-rows_sort - Requires global orderingrows_pivot - Needs all values to determine output columnsrows_sample - Random sampling requires full datasetrows_head - Requires global orderingrows_tail - Requires global orderingAttempting to use these operations will raise a ChunkingError.
Before processing, validate that your pipeline is compatible with chunked processing:
-# Validate without processing
-validation = plan.validate_chunked(
- schema={"patient_id": pl.Utf8, "age": pl.Int64, "visit_date": pl.Date},
- partition_key="patient_id"
-)
-
-print(validation)
-# Pipeline is compatible with chunked processing.
-
-# Or validate with a sample DataFrame
-validation = plan.validate_chunked(data=sample_df, partition_key="patient_id")
-
-if not validation.is_valid:
- for error in validation.errors:
- print(f"Error: {error}")
- ChunkedProtocol
-
-
-¶Protocol for tracking chunked processing with per-chunk information.
-Tracks the overall processing as well as individual chunk statistics.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
VERSION |
- - | -
-
-
- Protocol version string. - |
-
Initialize an empty ChunkedProtocol.
- - - - - - - - -transformplan/chunking.py chunks
-
-
-
- property
-
-
-¶List of chunk information.
- - total_input_rows
-
-
-
- property
-
-
-¶Total rows across all input chunks.
- - total_output_rows
-
-
-
- property
-
-
-¶Total rows across all output chunks.
- - total_elapsed_seconds
-
-
-
- property
-
-
-¶Total processing time across all chunks.
- - num_chunks
-
-
-
- property
-
-
-¶Number of chunks processed.
- - metadata
-
-
-
- property
-
-
-¶Protocol metadata.
- - set_source
-
-
-¶Set source file information.
- - -transformplan/chunking.py set_operations
-
-
-¶ set_metadata
-
-
-¶ add_chunk
-
-
-¶ output_hash
-
-
-¶Compute a combined hash of all output chunk hashes.
- - -Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- A 16-character hex hash of all chunk output hashes combined. - |
-
transformplan/chunking.py to_dict
-
-
-¶Serialize protocol to a dictionary.
- - -Returns:
-| Type | -Description | -
|---|---|
- dict[str, Any]
- |
-
-
-
- Dictionary representation of the protocol. - |
-
transformplan/chunking.py from_dict
-
-
-
- classmethod
-
-
-¶from_dict(data: dict[str, Any]) -> ChunkedProtocol
-Deserialize protocol from a dictionary.
- - -Returns:
-| Type | -Description | -
|---|---|
- ChunkedProtocol
- |
-
-
-
- ChunkedProtocol instance. - |
-
transformplan/chunking.py to_json
-
-
-¶Serialize protocol to JSON.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- path
- |
-
- str | Path | None
- |
-
-
-
- Optional file path to write to. - |
-
- None
- |
-
- indent
- |
-
- int
- |
-
-
-
- JSON indentation level. - |
-
- 2
- |
-
Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- JSON string. - |
-
transformplan/chunking.py from_json
-
-
-
- classmethod
-
-
-¶from_json(source: str | Path) -> ChunkedProtocol
-Deserialize protocol from JSON.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- source
- |
-
- str | Path
- |
-
-
-
- Either a JSON string or a path to a JSON file. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- ChunkedProtocol
- |
-
-
-
- ChunkedProtocol instance. - |
-
transformplan/chunking.py summary
-
-
-¶Generate a human-readable summary of the chunked processing.
- - -Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- Formatted string summary of the protocol. - |
-
transformplan/chunking.py431 -432 -433 -434 -435 -436 -437 -438 -439 -440 -441 -442 -443 -444 -445 -446 -447 -448 -449 -450 -451 -452 -453 -454 -455 -456 -457 -458 -459 -460 -461 -462 -463 -464 -465 -466 -467 -468 -469 -470 -471 -472 -473 -474 -475 -476 -477 -478 -479 -480 -481 -482 -483 -484 -485 -486 -487 -488 -489 -490 -491 -492 -493 | |
ChunkValidationResult
-
-
-
- dataclass
-
-
-¶ChunkValidationResult(
- is_valid: bool,
- errors: list[str] = list(),
- warnings: list[str] = list(),
- global_operations: list[str] = list(),
- group_dependent_ops: list[tuple[str, list[str] | None]] = list(),
-)
-Result of validating a pipeline for chunked processing.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
is_valid |
-
- bool
- |
-
-
-
- Whether the pipeline can be processed in chunks. - |
-
errors |
-
- list[str]
- |
-
-
-
- List of error messages explaining incompatibilities. - |
-
warnings |
-
- list[str]
- |
-
-
-
- List of warning messages (non-blocking). - |
-
global_operations |
-
- list[str]
- |
-
-
-
- Names of operations that require full dataset. - |
-
group_dependent_ops |
-
- list[tuple[str, list[str] | None]]
- |
-
-
-
- List of (operation, columns) for group-dependent ops. - |
-
errors
-
-
-
- class-attribute
- instance-attribute
-
-
-¶ warnings
-
-
-
- class-attribute
- instance-attribute
-
-
-¶ global_operations
-
-
-
- class-attribute
- instance-attribute
-
-
-¶ group_dependent_ops
-
-
-
- class-attribute
- instance-attribute
-
-
-¶ ChunkingError
-
-
-¶ChunkingError(
- message: str, validation_result: ChunkValidationResult | None = None
-)
-
- Bases: Exception
Raised when a pipeline is incompatible with chunked processing.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
validation_result |
- - | -
-
-
- The validation result containing error details. - |
-
Initialize ChunkingError with message and optional validation result.
- - - - - - - - -transformplan/chunking.pyimport polars as pl
-from transformplan import TransformPlan, Col, ChunkingError
-
-# Build pipeline with group-dependent operation
-plan = (
- TransformPlan()
- .col_rename(column="PatientID", new_name="patient_id")
- .dt_age_years(birth_column="date_of_birth", new_column="age")
- .rows_filter(Col("age") >= 18)
- .rows_unique(columns=["patient_id", "visit_date"]) # Needs partition key
- .col_drop("date_of_birth")
-)
-
-# Validate first
-validation = plan.validate_chunked(
- schema={
- "PatientID": pl.Utf8,
- "date_of_birth": pl.Date,
- "visit_date": pl.Date,
- "diagnosis": pl.Utf8,
- },
- partition_key="PatientID"
-)
-
-if validation.is_valid:
- # Process the large file
- result, protocol = plan.process_chunked(
- source="patients_archive.parquet",
- partition_key="PatientID",
- chunk_size=50_000,
- )
-
- # View processing summary
- protocol.print()
-
- # Save audit trail
- protocol.to_json("chunked_audit.json")
-else:
- print("Pipeline incompatible with chunking:")
- for error in validation.errors:
- print(f" - {error}")
-======================================================================
-CHUNKED PROCESSING PROTOCOL
-======================================================================
-Source: patients_archive.parquet
-Partition key: ['PatientID']
-Target chunk size: 50,000
-----------------------------------------------------------------------
-Chunks processed: 24
-Total input rows: 1,187,432
-Total output rows: 892,156
-Row change: -295,276
-Total time: 12.4523s
-Avg time per chunk: 0.5188s
-Output hash: 7a3b2c1d4e5f6789
-----------------------------------------------------------------------
-
-# Input Output Change Time Hash
-----------------------------------------------------------------------
-0 49,832 37,291 -12,541 0.4821s a1b2c3d4e5f67890
-1 50,127 38,456 -11,671 0.5123s b2c3d4e5f6789012
-2 49,956 37,892 -12,064 0.4956s c3d4e5f678901234
-...
-======================================================================
-
-
-
-
- Serializable filter expressions for row filtering operations.
-The filter system provides a way to build complex filter conditions that can be serialized to JSON and deserialized back. This enables reproducible pipelines that can be saved and shared.
-from transformplan import Col, Filter
-
-# Build a filter
-filter_expr = (Col("age") >= 18) & (Col("status") == "active")
-
-# Use in pipeline
-plan = TransformPlan().rows_filter(filter_expr)
-
-# Serialize
-filter_dict = filter_expr.to_dict()
-
-# Deserialize
-restored = Filter.from_dict(filter_dict)
- Col
-
-
-¶Column reference for building filter expressions.
-Col provides a fluent interface for creating filter conditions on DataFrame -columns. Use comparison operators and methods to build filters that can be -combined using & (and), | (or), and ~ (not).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- name
- |
-
- str
- |
-
-
-
- The name of the column to reference. - |
- - required - | -
------Comparison operators¶
-Col("age") >= 18 -Col("status") == "active" -Col("price") < 100
-String methods¶
-Col("email").str_contains("@company.com") -Col("name").str_starts_with("A")
-Null checks¶
-Col("optional").is_null() -Col("required").is_not_null()
-Membership¶
-Col("country").is_in(["US", "CA", "MX"]) -Col("age").between(18, 65)
-Combining conditions¶
-(Col("age") >= 18) & (Col("status") == "active") -(Col("role") == "admin") | (Col("role") == "moderator")
-
Initialize a column reference.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- name
- |
-
- str
- |
-
-
-
- The name of the column to reference. - |
- - required - | -
transformplan/filters.py __eq__
-
-
-¶__eq__(value: Any) -> Eq
-Create an equality filter (column == value).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- value
- |
-
- Any
- |
-
-
-
- Value to compare against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Eq
- |
-
-
-
- Eq filter for column equals value. - |
-
transformplan/filters.py __ne__
-
-
-¶__ne__(value: Any) -> Ne
-Create an inequality filter (column != value).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- value
- |
-
- Any
- |
-
-
-
- Value to compare against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Ne
- |
-
-
-
- Ne filter for column not equals value. - |
-
transformplan/filters.py __gt__
-
-
-¶__gt__(value: Any) -> Gt
-Create a greater-than filter (column > value).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- value
- |
-
- Any
- |
-
-
-
- Value to compare against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Gt
- |
-
-
-
- Gt filter for column greater than value. - |
-
transformplan/filters.py __ge__
-
-
-¶__ge__(value: Any) -> Ge
-Create a greater-or-equal filter (column >= value).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- value
- |
-
- Any
- |
-
-
-
- Value to compare against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Ge
- |
-
-
-
- Ge filter for column greater than or equal to value. - |
-
transformplan/filters.py __lt__
-
-
-¶__lt__(value: Any) -> Lt
-Create a less-than filter (column < value).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- value
- |
-
- Any
- |
-
-
-
- Value to compare against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Lt
- |
-
-
-
- Lt filter for column less than value. - |
-
transformplan/filters.py __le__
-
-
-¶__le__(value: Any) -> Le
-Create a less-or-equal filter (column <= value).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- value
- |
-
- Any
- |
-
-
-
- Value to compare against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Le
- |
-
-
-
- Le filter for column less than or equal to value. - |
-
transformplan/filters.py is_in
-
-
-¶is_in(values: Sequence[Any]) -> IsIn
-Create a membership filter (column in values).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- values
- |
-
- Sequence[Any]
- |
-
-
-
- Sequence of values to check membership against. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- IsIn
- |
-
-
-
- IsIn filter for column value in the given sequence. - |
-
------Col("status").is_in(["active", "pending"])
-
transformplan/filters.py is_null
-
-
-¶is_null() -> IsNull
-Create a null check filter (column is null).
- - -Returns:
-| Type | -Description | -
|---|---|
- IsNull
- |
-
-
-
- IsNull filter for column is null. - |
-
------Col("optional_field").is_null()
-
transformplan/filters.py is_not_null
-
-
-¶is_not_null() -> IsNotNull
-Create a not-null check filter (column is not null).
- - -Returns:
-| Type | -Description | -
|---|---|
- IsNotNull
- |
-
-
-
- IsNotNull filter for column is not null. - |
-
------Col("required_field").is_not_null()
-
transformplan/filters.py str_contains
-
-
-¶str_contains(pattern: str, literal: bool = True) -> StrContains
-Create a string contains filter.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- pattern
- |
-
- str
- |
-
-
-
- Substring or regex pattern to search for. - |
- - required - | -
- literal
- |
-
- bool
- |
-
-
-
- If True, treat pattern as literal string. If False, as regex. - |
-
- True
- |
-
Returns:
-| Type | -Description | -
|---|---|
- StrContains
- |
-
-
-
- StrContains filter for column containing pattern. - |
-
------Col("email").str_contains("@company.com") -Col("description").str_contains(r"\d+", literal=False)
-
transformplan/filters.py str_starts_with
-
-
-¶str_starts_with(prefix: str) -> StrStartsWith
-Create a string starts-with filter.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- prefix
- |
-
- str
- |
-
-
-
- Prefix to check for. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- StrStartsWith
- |
-
-
-
- StrStartsWith filter for column starting with prefix. - |
-
------Col("code").str_starts_with("PRD-")
-
transformplan/filters.py str_ends_with
-
-
-¶str_ends_with(suffix: str) -> StrEndsWith
-Create a string ends-with filter.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- suffix
- |
-
- str
- |
-
-
-
- Suffix to check for. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- StrEndsWith
- |
-
-
-
- StrEndsWith filter for column ending with suffix. - |
-
------Col("filename").str_ends_with(".csv")
-
transformplan/filters.py between
-
-
-¶between(lower: Any, upper: Any) -> Between
-Create a range filter (lower <= column <= upper).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- lower
- |
-
- Any
- |
-
-
-
- Lower bound (inclusive). - |
- - required - | -
- upper
- |
-
- Any
- |
-
-
-
- Upper bound (inclusive). - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Between
- |
-
-
-
- Between filter for column within range. - |
-
------Col("age").between(18, 65) -Col("date").between("2024-01-01", "2024-12-31")
-
transformplan/filters.py Filter
-
-
-¶
- Bases: ABC
Abstract base class for all filter expressions.
-Filters are composable, serializable expressions that define row selection -criteria. They can be combined using logical operators (&, |, ~) and -serialized to dictionaries for storage and transmission.
- - -------filter1 = Col("age") >= 18 -filter2 = Col("status") == "active" -combined = filter1 & filter2 # And filter -inverted = ~filter1 # Not filter
-
to_expr
-
-
-
- abstractmethod
-
-
-¶Convert to a Polars expression.
- - -Returns:
-| Type | -Description | -
|---|---|
- Expr
- |
-
-
-
- A Polars expression that can be used with DataFrame.filter(). - |
-
to_dict
-
-
-
- abstractmethod
-
-
-¶Serialize to a dictionary for JSON storage.
-The dictionary includes a 'type' key identifying the filter class, -plus any parameters needed to reconstruct the filter.
- - -Returns:
-| Type | -Description | -
|---|---|
- dict[str, Any]
- |
-
-
-
- Dictionary representation of the filter. - |
-
transformplan/filters.py from_dict
-
-
-
- classmethod
-
-
-¶from_dict(data: dict[str, Any]) -> Filter
-Deserialize a filter from a dictionary.
-Uses the 'type' key to determine which filter class to instantiate.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- data
- |
-
- dict[str, Any]
- |
-
-
-
- Dictionary with 'type' key and filter parameters. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Filter
- |
-
-
-
- Reconstructed Filter instance. - |
-
Raises:
-| Type | -Description | -
|---|---|
- ValueError
- |
-
-
-
- If 'type' is missing or unknown. - |
-
------data = {"type": "eq", "column": "status", "value": "active"} -filter_obj = Filter.from_dict(data)
-
transformplan/filters.py Eq
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Equality filter: column == value.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to compare. - |
-
value |
-
- Any
- |
-
-
-
- Value to compare against. - |
-
to_expr
-
-
-¶ Ne
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Inequality filter: column != value.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to compare. - |
-
value |
-
- Any
- |
-
-
-
- Value to compare against. - |
-
to_expr
-
-
-¶ Gt
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Greater-than filter: column > value.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to compare. - |
-
value |
-
- Any
- |
-
-
-
- Value to compare against. - |
-
to_expr
-
-
-¶ Ge
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Greater-or-equal filter: column >= value.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to compare. - |
-
value |
-
- Any
- |
-
-
-
- Value to compare against. - |
-
to_expr
-
-
-¶ Lt
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Less-than filter: column < value.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to compare. - |
-
value |
-
- Any
- |
-
-
-
- Value to compare against. - |
-
to_expr
-
-
-¶ Le
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Less-or-equal filter: column <= value.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to compare. - |
-
value |
-
- Any
- |
-
-
-
- Value to compare against. - |
-
to_expr
-
-
-¶ IsIn
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Membership filter: column value in list of values.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to check. - |
-
values |
-
- Sequence[Any]
- |
-
-
-
- Sequence of values to check membership against. - |
-
Between
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Range filter: lower <= column <= upper.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the column to check. - |
-
lower |
-
- Any
- |
-
-
-
- Lower bound (inclusive). - |
-
upper |
-
- Any
- |
-
-
-
- Upper bound (inclusive). - |
-
IsNull
-
-
-
- dataclass
-
-
-¶ IsNotNull
-
-
-
- dataclass
-
-
-¶ StrContains
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
String contains filter: column contains pattern.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the string column to search. - |
-
pattern |
-
- str
- |
-
-
-
- Substring or regex pattern to find. - |
-
literal |
-
- bool
- |
-
-
-
- If True, treat pattern as literal. If False, as regex. - |
-
StrStartsWith
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
String starts-with filter: column starts with prefix.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the string column to check. - |
-
prefix |
-
- str
- |
-
-
-
- Prefix to match at the start. - |
-
StrEndsWith
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
String ends-with filter: column ends with suffix.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
column |
-
- str
- |
-
-
-
- Name of the string column to check. - |
-
suffix |
-
- str
- |
-
-
-
- Suffix to match at the end. - |
-
And
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Logical AND filter: both conditions must be true.
-Typically created using the & operator between filters.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
left |
-
- Filter
- |
-
-
-
- First filter condition. - |
-
right |
-
- Filter
- |
-
-
-
- Second filter condition. - |
-
------(Col("age") >= 18) & (Col("status") == "active")
-
Or
-
-
-
- dataclass
-
-
-¶
- Bases: Filter
Logical OR filter: at least one condition must be true.
-Typically created using the | operator between filters.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
left |
-
- Filter
- |
-
-
-
- First filter condition. - |
-
right |
-
- Filter
- |
-
-
-
- Second filter condition. - |
-
------(Col("role") == "admin") | (Col("role") == "moderator")
-
Not
-
-
-
- dataclass
-
-
-¶Not(operand: Filter)
-
- Bases: Filter
Logical NOT filter: inverts the condition.
-Typically created using the ~ operator on a filter.
- - -Attributes:
-| Name | -Type | -Description | -
|---|---|---|
operand |
-
- Filter
- |
-
-
-
- Filter condition to invert. - |
-
------~(Col("deleted") == True)
-
from transformplan import Col
-
-# Numeric comparisons
-Col("age") >= 18
-Col("price") < 100
-Col("quantity") == 0
-
-# String equality
-Col("status") == "active"
-Col("country") != "US"
-# Contains substring
-Col("email").str_contains("@company.com")
-
-# Starts/ends with
-Col("code").str_starts_with("PRD-")
-Col("filename").str_ends_with(".csv")
-# Check if value is in list
-Col("status").is_in(["active", "pending"])
-
-# Range check
-Col("age").between(18, 65)
-# AND: both conditions must be true
-(Col("age") >= 18) & (Col("status") == "active")
-
-# OR: at least one condition must be true
-(Col("role") == "admin") | (Col("role") == "moderator")
-
-# NOT: invert condition
-~(Col("deleted") == True)
-
-# Complex combinations
-(
- (Col("age") >= 18) &
- (Col("country").is_in(["US", "CA"])) &
- ~(Col("status") == "banned")
-)
-
-
-
-
- This section provides detailed API documentation for all TransformPlan classes and functions.
-| Class | -Description | -
|---|---|
TransformPlan |
-Main class for building transformation pipelines | -
Protocol |
-Audit trail capturing transformation history | -
Col |
-Column reference for building filter expressions | -
Filter |
-Base class for serializable filter expressions | -
| Class | -Description | -
|---|---|
ValidationResult |
-Result of schema validation | -
DryRunResult |
-Preview of pipeline execution | -
SchemaValidationError |
-Exception raised on validation failure | -
| Class | -Description | -
|---|---|
ChunkedProtocol |
-Protocol for tracking chunked file processing | -
ChunkValidationResult |
-Result of validating pipeline for chunked processing | -
ChunkingError |
-Exception raised when pipeline is incompatible with chunking | -
TransformPlan provides operations organized by category:
-| Category | -Description | -Examples | -
|---|---|---|
| Column Operations | -Add, drop, rename, cast columns | -col_drop, col_rename, col_cast |
-
| Math Operations | -Arithmetic on numeric columns | -math_add, math_multiply, math_round |
-
| Row Operations | -Filter, sort, deduplicate rows | -rows_filter, rows_sort, rows_unique |
-
| String Operations | -Text manipulation | -str_replace, str_lower, str_split |
-
| Datetime Operations | -Date and time extraction | -dt_year, dt_month, dt_parse |
-
| Map Operations | -Value mapping and discretization | -map_values, map_discretize |
-
All TransformPlan operations at a glance. Click method names for detailed documentation.
-| Method | -Description | -
|---|---|
col_drop |
-Drop a column from the DataFrame | -
col_rename |
-Rename a column | -
col_cast |
-Cast a column to a different dtype | -
col_reorder |
-Reorder columns (drops unlisted) | -
col_select |
-Keep only the specified columns | -
col_duplicate |
-Duplicate a column under a new name | -
col_fill_null |
-Fill null values in a column | -
col_drop_null |
-Drop rows with null values in specified columns | -
col_drop_zero |
-Drop rows where the specified column is zero | -
col_add |
-Add a new column with a constant value or expression | -
col_add_uuid |
-Add a column with unique random identifiers | -
col_hash |
-Hash one or more columns into a new column | -
col_coalesce |
-Take the first non-null value across multiple columns | -
| Method | -Description | -
|---|---|
math_add |
-Add a scalar value to a column | -
math_subtract |
-Subtract a scalar value from a column | -
math_multiply |
-Multiply a column by a scalar value | -
math_divide |
-Divide a column by a scalar value | -
math_clamp |
-Clamp column values to a range | -
math_abs |
-Take absolute value of a column | -
math_round |
-Round a column to specified decimal places | -
math_set_min |
-Set a minimum value for a column | -
math_set_max |
-Set a maximum value for a column | -
math_add_columns |
-Add two columns together into a new column | -
math_subtract_columns |
-Subtract one column from another | -
math_multiply_columns |
-Multiply two columns together | -
math_divide_columns |
-Divide one column by another | -
math_percent_of |
-Calculate percentage of one column relative to another | -
math_cumsum |
-Calculate cumulative sum (optionally grouped) | -
math_rank |
-Calculate rank of values | -
| Method | -Description | -
|---|---|
rows_filter |
-Filter rows using a Filter expression | -
rows_drop |
-Drop rows matching a filter | -
rows_drop_nulls |
-Drop rows with null values | -
rows_flag |
-Add a flag column based on a filter condition | -
rows_unique |
-Keep unique rows based on specified columns | -
rows_deduplicate |
-Deduplicate by keeping first/last based on sort order | -
rows_sort |
-Sort rows by one or more columns | -
rows_head |
-Keep only the first n rows | -
rows_tail |
-Keep only the last n rows | -
rows_sample |
-Sample rows from the DataFrame | -
rows_explode |
-Explode a list column into multiple rows | -
rows_melt |
-Unpivot from wide to long format | -
rows_pivot |
-Pivot from long to wide format | -
| Method | -Description | -
|---|---|
str_lower |
-Convert string column to lowercase | -
str_upper |
-Convert string column to uppercase | -
str_strip |
-Strip leading and trailing characters | -
str_pad |
-Pad a string column to a specified length | -
str_slice |
-Extract a substring from a string column | -
str_truncate |
-Truncate strings to a maximum length | -
str_replace |
-Replace occurrences of a pattern | -
str_extract |
-Extract substring using regex capture group | -
str_split |
-Split a string column by separator | -
str_concat |
-Concatenate multiple string columns | -
| Method | -Description | -
|---|---|
dt_year |
-Extract year from a datetime column | -
dt_month |
-Extract month from a datetime column | -
dt_day |
-Extract day from a datetime column | -
dt_week |
-Extract ISO week number | -
dt_quarter |
-Extract quarter (1-4) | -
dt_year_month |
-Create a year-month string | -
dt_quarter_year |
-Create a quarter-year string (e.g., 'Q1-2024') | -
dt_calendar_week |
-Create a year-week string (e.g., '2024-W05') | -
dt_format |
-Format a datetime column as a string | -
dt_parse |
-Parse a string column into a datetime | -
dt_diff_days |
-Calculate difference in days between two dates | -
dt_age_years |
-Calculate age in years from a birth date | -
dt_truncate |
-Truncate datetime to a specified precision | -
dt_is_between |
-Check if date falls within a range | -
| Method | -Description | -
|---|---|
map_values |
-Map values in a column using a dictionary | -
map_case |
-Apply case-when logic to a column | -
map_from_column |
-Map values using another column as lookup | -
map_discretize |
-Discretize a numeric column into bins | -
map_bool_to_int |
-Convert boolean to integer (True=1, False=0) | -
map_null_to_value |
-Replace null values with a specific value | -
map_value_to_null |
-Replace a specific value with null | -
| Function | -Description | -
|---|---|
frame_hash |
-Compute deterministic hash of a DataFrame | -
validate_chunked_pipeline |
-Validate pipeline compatibility with chunked processing | -
-
-
-
- Operations for adding, dropping, renaming, and transforming columns.
-Column operations modify the structure of a DataFrame by adding, removing, or transforming columns. All operations return the TransformPlan instance for method chaining.
-from transformplan import TransformPlan
-
-plan = (
- TransformPlan()
- .col_rename("old_name", "new_name")
- .col_drop("temp_column")
- .col_cast("price", pl.Float64)
- .col_add("status", value="active")
-)
- ColumnOps
-
-
-¶Mixin providing column-level operations.
- - - - - - - - - - - - col_drop
-
-
-¶ col_rename
-
-
-¶ col_cast
-
-
-¶ col_reorder
-
-
-¶Reorder columns. Unlisted columns are dropped.
- - - - col_select
-
-
-¶Keep only the specified columns (order preserved).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- columns
- |
-
- Sequence[str]
- |
-
-
-
- Columns to keep. - |
- - required - | -
transformplan/ops/column.py col_duplicate
-
-
-¶Duplicate a column under a new name.
- - - - col_fill_null
-
-
-¶Fill null values in a column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to fill. - |
- - required - | -
- value
- |
-
- Any
- |
-
-
-
- Value to fill nulls with (if strategy is None). - |
-
- None
- |
-
- strategy
- |
-
- str | None
- |
-
-
-
- Fill strategy - 'forward', 'backward', 'mean', 'min', 'max', 'zero', 'one'. - |
-
- None
- |
-
transformplan/ops/column.py col_drop_null
-
-
-¶Drop rows with null values in specified columns.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- columns
- |
-
- str | Sequence[str] | None
- |
-
-
-
- Column(s) to check for nulls. If None, checks all columns. - |
-
- None
- |
-
transformplan/ops/column.py col_drop_zero
-
-
-¶ col_add
-
-
-¶Add a new column with a constant value or expression.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- new_column
- |
-
- str
- |
-
-
-
- Name of the new column. - |
- - required - | -
- expr
- |
-
- str | int | float | None
- |
-
-
-
- Column name to copy from, or None for constant value. - |
-
- None
- |
-
- value
- |
-
- Any
- |
-
-
-
- Constant value to fill the column with. - |
-
- None
- |
-
transformplan/ops/column.py col_add_uuid
-
-
-¶Add a column with unique random identifiers.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Name of the new column. - |
- - required - | -
- length
- |
-
- int
- |
-
-
-
- Length of the identifier string. - |
-
- 16
- |
-
transformplan/ops/column.py col_hash
-
-
-¶Hash one or more columns into a new column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- columns
- |
-
- str | Sequence[str]
- |
-
-
-
- Column(s) to hash. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for the hash column. - |
- - required - | -
- salt
- |
-
- str
- |
-
-
-
- Optional salt to add to the hash. - |
-
- ''
- |
-
transformplan/ops/column.py col_coalesce
-
-
-¶Take the first non-null value across multiple columns.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- columns
- |
-
- Sequence[str]
- |
-
-
-
- Columns to coalesce (in priority order). - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for the result column. - |
- - required - | -
transformplan/ops/column.py# Drop a column
-plan = TransformPlan().col_drop("temp")
-
-# Rename a column
-plan = TransformPlan().col_rename("old", "new")
-
-# Cast to a different type
-plan = TransformPlan().col_cast("price", pl.Float64)
-# Keep only specific columns (in order)
-plan = TransformPlan().col_select(["id", "name", "value"])
-
-# Reorder columns (drops unlisted columns)
-plan = TransformPlan().col_reorder(["value", "name", "id"])
-# Add column with constant value
-plan = TransformPlan().col_add("status", value="pending")
-
-# Copy from existing column
-plan = TransformPlan().col_add("price_backup", expr="price")
-
-# Add unique identifiers
-plan = TransformPlan().col_add_uuid("row_id", length=16)
-# Fill nulls with a value
-plan = TransformPlan().col_fill_null("score", value=0)
-
-# Fill with strategy
-plan = TransformPlan().col_fill_null("value", strategy="forward")
-
-# Drop rows with nulls
-plan = TransformPlan().col_drop_null(columns=["required_field"])
-# Create hash from multiple columns
-plan = TransformPlan().col_hash(
- columns=["first_name", "last_name", "email"],
- new_column="user_hash",
- salt="my_salt"
-)
-
-# Take first non-null from multiple columns
-plan = TransformPlan().col_coalesce(
- columns=["primary_email", "secondary_email", "backup_email"],
- new_column="contact_email"
-)
-
-# Duplicate a column
-plan = TransformPlan().col_duplicate("original", "copy")
-
-
-
-
- Date and time extraction and manipulation operations.
-Datetime operations allow you to extract components from date/datetime columns, parse date strings, and perform date arithmetic.
-from transformplan import TransformPlan
-
-plan = (
- TransformPlan()
- .dt_parse("date_string", fmt="%Y-%m-%d")
- .dt_year("order_date", new_column="order_year")
- .dt_diff_days("end_date", "start_date", new_column="duration")
-)
- DatetimeOps
-
-
-¶Mixin providing datetime operations on columns.
- - - - - - - - - - - - dt_year
-
-
-¶Extract year from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_month
-
-
-¶Extract month from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_day
-
-
-¶Extract day from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_week
-
-
-¶Extract ISO week number from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_quarter
-
-
-¶Extract quarter (1-4) from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_year_month
-
-
-¶Create a year-month string from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
- - required - | -
- fmt
- |
-
- str
- |
-
-
-
- Output format string. - |
-
- '%Y-%m'
- |
-
transformplan/ops/datetime.py dt_quarter_year
-
-
-¶Create a quarter-year string (e.g., 'Q1-2024') from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
- - required - | -
transformplan/ops/datetime.py dt_calendar_week
-
-
-¶Create a year-week string (e.g., '2024-W05') from a datetime column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
- - required - | -
transformplan/ops/datetime.py dt_parse
-
-
-¶Parse a string column into a datetime.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source string column. - |
- - required - | -
- fmt
- |
-
- str
- |
-
-
-
- Date format string. - |
-
- '%Y-%m-%d'
- |
-
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_format
-
-
-¶Format a datetime column as a string.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- fmt
- |
-
- str
- |
-
-
-
- Output format string. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_diff_days
-
-
-¶Calculate difference in days between two date columns (a - b).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column_a
- |
-
- str
- |
-
-
-
- First date column. - |
- - required - | -
- column_b
- |
-
- str
- |
-
-
-
- Second date column. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
- - required - | -
transformplan/ops/datetime.py dt_age_years
-
-
-¶dt_age_years(
- birth_column: str,
- reference_column: str | None = None,
- new_column: str = "age",
-) -> Self
-Calculate age in years from a birth date.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- birth_column
- |
-
- str
- |
-
-
-
- Column containing birth dates. - |
- - required - | -
- reference_column
- |
-
- str | None
- |
-
-
-
- Column containing reference dates (None = today). - |
-
- None
- |
-
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
-
- 'age'
- |
-
transformplan/ops/datetime.py dt_truncate
-
-
-¶Truncate datetime to a specified precision.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- every
- |
-
- str
- |
-
-
-
- Truncation interval ('1d', '1mo', '1y', '1h', etc.). - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/datetime.py dt_is_between
-
-
-¶dt_is_between(
- column: str, start: str, end: str, new_column: str, closed: str = "both"
-) -> Self
-Check if date falls within a range.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Source datetime column. - |
- - required - | -
- start
- |
-
- str
- |
-
-
-
- Start date (string, will be parsed). - |
- - required - | -
- end
- |
-
- str
- |
-
-
-
- End date (string, will be parsed). - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for boolean result column. - |
- - required - | -
- closed
- |
-
- str
- |
-
-
-
- Which endpoints to include ('both', 'left', 'right', 'none'). - |
-
- 'both'
- |
-
transformplan/ops/datetime.py# Extract year
-plan = TransformPlan().dt_year("date", new_column="year")
-
-# Extract month
-plan = TransformPlan().dt_month("date", new_column="month")
-
-# Extract day
-plan = TransformPlan().dt_day("date", new_column="day")
-
-# Extract week number
-plan = TransformPlan().dt_week("date", new_column="week")
-
-# Extract quarter
-plan = TransformPlan().dt_quarter("date", new_column="quarter")
-# Year-month string (e.g., "2024-01")
-plan = TransformPlan().dt_year_month("date", new_column="year_month")
-
-# Quarter-year string (e.g., "Q1-2024")
-plan = TransformPlan().dt_quarter_year("date", new_column="quarter_year")
-
-# Calendar week string (e.g., "2024-W05")
-plan = TransformPlan().dt_calendar_week("date", new_column="calendar_week")
-# Parse string to date
-plan = TransformPlan().dt_parse(
- column="date_string",
- fmt="%Y-%m-%d",
- new_column="date"
-)
-
-# Format date to string
-plan = TransformPlan().dt_format(
- column="date",
- fmt="%B %d, %Y",
- new_column="formatted_date"
-)
-# Calculate difference in days
-plan = TransformPlan().dt_diff_days(
- column_a="end_date",
- column_b="start_date",
- new_column="duration_days"
-)
-
-# Calculate age in years
-plan = TransformPlan().dt_age_years(
- birth_column="birth_date",
- new_column="age"
-)
-
-# Age relative to reference column
-plan = TransformPlan().dt_age_years(
- birth_column="birth_date",
- reference_column="event_date",
- new_column="age_at_event"
-)
-# Truncate to month start
-plan = TransformPlan().dt_truncate("timestamp", every="1mo")
-
-# Truncate to day
-plan = TransformPlan().dt_truncate("timestamp", every="1d")
-
-# Truncate to year
-plan = TransformPlan().dt_truncate("timestamp", every="1y")
-
-
-
-
- Value mapping, discretization, and transformation operations.
-Map operations transform column values using dictionaries, bins, or other columns. They're useful for categorization, value replacement, and data normalization.
-from transformplan import TransformPlan
-
-plan = (
- TransformPlan()
- .map_values("status", {"A": "Active", "I": "Inactive"})
- .map_discretize("age", bins=[18, 35, 55], labels=["Young", "Adult", "Senior"])
-)
- MapOps
-
-
-¶Mixin providing value mapping and transformation operations.
- - - - - - - - - - - - map_values
-
-
-¶map_values(
- column: str,
- mapping: dict[Any, Any],
- default: Any = None,
- keep_unmapped: bool = True,
-) -> Self
-Map values in a column using a dictionary.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to transform. - |
- - required - | -
- mapping
- |
-
- dict[Any, Any]
- |
-
-
-
- Dictionary mapping old values to new values. - |
- - required - | -
- default
- |
-
- Any
- |
-
-
-
- Default value for unmapped values (if keep_unmapped=False). - |
-
- None
- |
-
- keep_unmapped
- |
-
- bool
- |
-
-
-
- If True, keep original value when not in mapping. - |
-
- True
- |
-
transformplan/ops/map.py map_discretize
-
-
-¶map_discretize(
- column: str,
- bins: Sequence[float],
- labels: Sequence[str] | None = None,
- new_column: str | None = None,
- right: bool = True,
-) -> Self
-Discretize a numeric column into bins/categories.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to discretize. - |
- - required - | -
- bins
- |
-
- Sequence[float]
- |
-
-
-
- Bin edges (e.g., [0, 18, 65, 100] creates 4 bins). - |
- - required - | -
- labels
- |
-
- Sequence[str] | None
- |
-
-
-
- Labels for each bin (must be len(bins)+1 if provided). - |
-
- None
- |
-
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
- right
- |
-
- bool
- |
-
-
-
- If True, bins are (left, right]. If False, [left, right). - |
-
- True
- |
-
transformplan/ops/map.py map_case
-
-
-¶map_case(
- column: str,
- cases: list[tuple[Any, Any]],
- default: Any = None,
- new_column: str | None = None,
-) -> Self
-Apply case-when logic to a column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to evaluate. - |
- - required - | -
- cases
- |
-
- list[tuple[Any, Any]]
- |
-
-
-
- List of (condition_value, result_value) tuples. - |
- - required - | -
- default
- |
-
- Any
- |
-
-
-
- Default value if no case matches. - |
-
- None
- |
-
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
.map_case('grade', [(90, 'A'), (80, 'B'), (70, 'C')], default='F') -Maps: >= 90 -> A, >= 80 -> B, >= 70 -> C, else F
-transformplan/ops/map.py map_bool_to_int
-
-
-¶Convert a boolean column to integer (True=1, False=0).
- - - - map_null_to_value
-
-
-¶Replace null values with a specific value.
- - - - map_value_to_null
-
-
-¶Replace a specific value with null.
- - - - map_from_column
-
-
-¶map_from_column(
- column: str,
- lookup_column: str,
- value_column: str,
- new_column: str | None = None,
- default: Any = None,
-) -> Self
-Map values using another column as lookup (like vlookup).
-This maps values from column using lookup_column -> value_column mapping
-from the same DataFrame. Useful for denormalization.
Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column containing keys to look up. - |
- - required - | -
- lookup_column
- |
-
- str
- |
-
-
-
- Column containing lookup keys. - |
- - required - | -
- value_column
- |
-
- str
- |
-
-
-
- Column containing values to map to. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
- default
- |
-
- Any
- |
-
-
-
- Default value if lookup fails. - |
-
- None
- |
-
transformplan/ops/map.py# Map values using a dictionary
-plan = TransformPlan().map_values(
- column="country_code",
- mapping={"US": "United States", "CA": "Canada", "MX": "Mexico"}
-)
-
-# With default for unmapped values
-plan = TransformPlan().map_values(
- column="status",
- mapping={"A": "Active", "I": "Inactive"},
- default="Unknown",
- keep_unmapped=False
-)
-# Discretize numeric values into categories
-plan = TransformPlan().map_discretize(
- column="age",
- bins=[0, 18, 35, 55, 100],
- labels=["Child", "Young Adult", "Adult", "Senior"],
- new_column="age_group"
-)
-
-# Auto-generated labels
-plan = TransformPlan().map_discretize(
- column="score",
- bins=[0, 50, 75, 100],
- new_column="score_band"
-)
-# Apply case-when transformations
-plan = TransformPlan().map_case(
- column="score",
- cases=[
- (90, "A"),
- (80, "B"),
- (70, "C"),
- (60, "D"),
- ],
- default="F",
- new_column="grade"
-)
-# Replace null with a value
-plan = TransformPlan().map_null_to_value("status", "Unknown")
-
-# Replace a value with null
-plan = TransformPlan().map_value_to_null("status", "N/A")
-# Convert boolean to integer
-plan = TransformPlan().map_bool_to_int("is_active")
-# True -> 1, False -> 0
-# Map using values from other columns (vlookup-style)
-plan = TransformPlan().map_from_column(
- column="category_id",
- lookup_column="category_id",
- value_column="category_name",
- new_column="category_label",
- default="Unknown"
-)
-# Income brackets
-plan = TransformPlan().map_discretize(
- column="income",
- bins=[0, 30000, 60000, 100000, 200000],
- labels=["Low", "Lower-Middle", "Middle", "Upper-Middle", "High"],
- new_column="income_bracket"
-)
-# Standardize department codes
-plan = TransformPlan().map_values(
- column="dept",
- mapping={
- "ENG": "Engineering",
- "MKT": "Marketing",
- "SAL": "Sales",
- "HR": "Human Resources"
- }
-)
-
-
-
-
- Arithmetic and numeric operations on DataFrame columns.
-Math operations perform arithmetic on numeric columns. They support both scalar operations (column with a constant) and column-wise operations (column with column).
-from transformplan import TransformPlan
-
-plan = (
- TransformPlan()
- .math_multiply("price", 1.1) # 10% increase
- .math_round("price", decimals=2)
- .math_add_columns("subtotal", "tax", "total")
-)
- MathOps
-
-
-¶Mixin providing mathematical operations on columns.
- - - - - - - - - - - - math_add
-
-
-¶ math_subtract
-
-
-¶ math_multiply
-
-
-¶ math_divide
-
-
-¶ math_clamp
-
-
-¶Clamp column values to a range.
- - -transformplan/ops/math.py math_set_min
-
-
-¶Set a minimum value for a column (values below are raised to min).
- - -transformplan/ops/math.py math_set_max
-
-
-¶Set a maximum value for a column (values above are lowered to max).
- - -transformplan/ops/math.py math_abs
-
-
-¶ math_round
-
-
-¶Round a column to specified decimal places.
- - - - math_add_columns
-
-
-¶Add two columns together into a new column.
- - -transformplan/ops/math.py math_subtract_columns
-
-
-¶Subtract column_b from column_a into a new column.
- - -transformplan/ops/math.py math_multiply_columns
-
-
-¶Multiply two columns together into a new column.
- - -transformplan/ops/math.py math_divide_columns
-
-
-¶Divide column_a by column_b into a new column.
- - -transformplan/ops/math.py math_percent_of
-
-
-¶math_percent_of(
- column: str, total_column: str, new_column: str, multiply_by: float = 100.0
-) -> Self
-Calculate percentage of one column relative to another.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Numerator column. - |
- - required - | -
- total_column
- |
-
- str
- |
-
-
-
- Denominator column. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
- - required - | -
- multiply_by
- |
-
- float
- |
-
-
-
- Multiplier (default 100 for percentage). - |
-
- 100.0
- |
-
transformplan/ops/math.py math_cumsum
-
-
-¶math_cumsum(
- column: str,
- new_column: str | None = None,
- group_by: str | list[str] | None = None,
-) -> Self
-Calculate cumulative sum.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to sum. - |
- - required - | -
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
- group_by
- |
-
- str | list[str] | None
- |
-
-
-
- Optional column(s) to group by. - |
-
- None
- |
-
transformplan/ops/math.py math_rank
-
-
-¶math_rank(
- column: str,
- new_column: str,
- method: str = "ordinal",
- descending: bool = False,
- group_by: str | list[str] | None = None,
-) -> Self
-Calculate rank of values.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to rank. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for result column. - |
- - required - | -
- method
- |
-
- str
- |
-
-
-
- Ranking method ('ordinal', 'dense', 'min', 'max', 'average'). - |
-
- 'ordinal'
- |
-
- descending
- |
-
- bool
- |
-
-
-
- Rank in descending order. - |
-
- False
- |
-
- group_by
- |
-
- str | list[str] | None
- |
-
-
-
- Optional column(s) to group by. - |
-
- None
- |
-
transformplan/ops/math.py# Add to every value
-plan = TransformPlan().math_add("price", 10)
-
-# Subtract from every value
-plan = TransformPlan().math_subtract("score", 5)
-
-# Multiply every value
-plan = TransformPlan().math_multiply("quantity", 1.5)
-
-# Divide every value
-plan = TransformPlan().math_divide("total", 100)
-# Add two columns into a new column
-plan = TransformPlan().math_add_columns("base", "bonus", "total")
-
-# Subtract columns
-plan = TransformPlan().math_subtract_columns("revenue", "cost", "profit")
-
-# Multiply columns
-plan = TransformPlan().math_multiply_columns("price", "quantity", "total")
-
-# Divide columns
-plan = TransformPlan().math_divide_columns("score", "max_score", "percentage")
-# Clamp to range
-plan = TransformPlan().math_clamp("score", lower=0, upper=100)
-
-# Set minimum value
-plan = TransformPlan().math_set_min("quantity", min_value=0)
-
-# Set maximum value
-plan = TransformPlan().math_set_max("discount", max_value=50)
-# Absolute value
-plan = TransformPlan().math_abs("difference")
-
-# Round to decimal places
-plan = TransformPlan().math_round("price", decimals=2)
-# Calculate percentage
-plan = TransformPlan().math_percent_of(
- column="part",
- total_column="whole",
- new_column="percentage",
- multiply_by=100 # default
-)
-
-
-
-
- Operations for filtering, sorting, and transforming rows.
-Row operations modify which rows are included in the DataFrame and how they are ordered. Use the Col class to build filter expressions.
from transformplan import TransformPlan, Col
-
-plan = (
- TransformPlan()
- .rows_filter(Col("status") == "active")
- .rows_sort("created_at", descending=True)
- .rows_unique(columns=["email"])
-)
- RowOps
-
-
-¶Mixin providing row-level operations.
- - - - - - - - - - - - rows_filter
-
-
-¶rows_filter(filter: Filter | dict) -> Self
-Filter rows using a serializable Filter expression.
- - -from transformplan.filters import Col
-.rows_filter(Col("age") > 18) -.rows_filter((Col("status") == "active") & (Col("score") >= 50))
-transformplan/ops/rows.py rows_drop
-
-
-¶rows_drop(filter: Filter | dict) -> Self
-Drop rows matching a filter (inverse of rows_filter).
- - -.rows_drop(Col("status") == "deleted")
-transformplan/ops/rows.py rows_drop_nulls
-
-
-¶Drop rows with null values in specified columns (or any column if None).
- - -transformplan/ops/rows.py rows_unique
-
-
-¶rows_unique(
- columns: str | Sequence[str] | None = None,
- keep: Literal["first", "last", "any", "none"] = "first",
-) -> Self
-Keep unique rows based on specified columns.
- - -transformplan/ops/rows.py rows_deduplicate
-
-
-¶rows_deduplicate(
- columns: str | Sequence[str],
- sort_by: str,
- keep: Literal["first", "last"] = "first",
- descending: bool = False,
-) -> Self
-Deduplicate rows by keeping first/last based on sort order.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- columns
- |
-
- str | Sequence[str]
- |
-
-
-
- Columns that define duplicates. - |
- - required - | -
- sort_by
- |
-
- str
- |
-
-
-
- Column to sort by before deduplication. - |
- - required - | -
- keep
- |
-
- Literal['first', 'last']
- |
-
-
-
- Keep 'first' or 'last' after sorting. - |
-
- 'first'
- |
-
- descending
- |
-
- bool
- |
-
-
-
- Sort in descending order. - |
-
- False
- |
-
transformplan/ops/rows.py rows_sort
-
-
-¶Sort rows by one or more columns.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- by
- |
-
- str | Sequence[str]
- |
-
-
-
- Column(s) to sort by. - |
- - required - | -
- descending
- |
-
- bool | Sequence[bool]
- |
-
-
-
- Sort direction (single bool or list matching columns). - |
-
- False
- |
-
transformplan/ops/rows.py rows_flag
-
-
-¶rows_flag(
- filter: Filter | dict,
- new_column: str,
- true_value: Any = True,
- false_value: Any = False,
-) -> Self
-Add a flag column based on a filter condition (without dropping rows).
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- filter
- |
-
- Filter | dict
- |
-
-
-
- Filter condition. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for the flag column. - |
- - required - | -
- true_value
- |
-
- Any
- |
-
-
-
- Value when condition is True. - |
-
- True
- |
-
- false_value
- |
-
- Any
- |
-
-
-
- Value when condition is False. - |
-
- False
- |
-
transformplan/ops/rows.py rows_head
-
-
-¶ rows_tail
-
-
-¶ rows_sample
-
-
-¶rows_sample(
- n: int | None = None, fraction: float | None = None, seed: int | None = None
-) -> Self
-Sample rows from the DataFrame.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- n
- |
-
- int | None
- |
-
-
-
- Number of rows to sample. - |
-
- None
- |
-
- fraction
- |
-
- float | None
- |
-
-
-
- Fraction of rows to sample (0.0 to 1.0). - |
-
- None
- |
-
- seed
- |
-
- int | None
- |
-
-
-
- Random seed for reproducibility. - |
-
- None
- |
-
transformplan/ops/rows.py rows_explode
-
-
-¶ rows_melt
-
-
-¶rows_melt(
- id_columns: Sequence[str],
- value_columns: Sequence[str],
- variable_name: str = "variable",
- value_name: str = "value",
-) -> Self
-Unpivot a DataFrame from wide to long format.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- id_columns
- |
-
- Sequence[str]
- |
-
-
-
- Columns to keep as identifiers. - |
- - required - | -
- value_columns
- |
-
- Sequence[str]
- |
-
-
-
- Columns to unpivot. - |
- - required - | -
- variable_name
- |
-
- str
- |
-
-
-
- Name for the variable column. - |
-
- 'variable'
- |
-
- value_name
- |
-
- str
- |
-
-
-
- Name for the value column. - |
-
- 'value'
- |
-
transformplan/ops/rows.py rows_pivot
-
-
-¶rows_pivot(
- index: str | Sequence[str],
- columns: str,
- values: str,
- aggregate_function: str = "first",
-) -> Self
-Pivot from long to wide format.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- index
- |
-
- str | Sequence[str]
- |
-
-
-
- Column(s) to use as row identifiers. - |
- - required - | -
- columns
- |
-
- str
- |
-
-
-
- Column whose unique values become new columns. - |
- - required - | -
- values
- |
-
- str
- |
-
-
-
- Column containing values to fill. - |
- - required - | -
- aggregate_function
- |
-
- str
- |
-
-
-
- How to aggregate ('first', 'sum', 'mean', 'count', etc.). - |
-
- 'first'
- |
-
transformplan/ops/rows.pyfrom transformplan import Col
-
-# Keep rows matching condition
-plan = TransformPlan().rows_filter(Col("age") >= 18)
-
-# Drop rows matching condition
-plan = TransformPlan().rows_drop(Col("status") == "deleted")
-
-# Complex filters
-plan = TransformPlan().rows_filter(
- (Col("score") >= 50) & (Col("active") == True)
-)
-Add a boolean column based on a condition without removing rows:
-plan = TransformPlan().rows_flag(
- filter=Col("score") >= 90,
- new_column="is_excellent",
- true_value=True,
- false_value=False
-)
-# Sort by single column
-plan = TransformPlan().rows_sort("name")
-
-# Sort descending
-plan = TransformPlan().rows_sort("score", descending=True)
-
-# Sort by multiple columns
-plan = TransformPlan().rows_sort(
- by=["category", "price"],
- descending=[False, True]
-)
-# Keep first occurrence of each unique value
-plan = TransformPlan().rows_unique(columns=["email"])
-
-# Keep last occurrence
-plan = TransformPlan().rows_unique(columns=["user_id"], keep="last")
-
-# Deduplicate with specific sort order
-plan = TransformPlan().rows_deduplicate(
- columns=["user_id"],
- sort_by="updated_at",
- keep="last",
- descending=True
-)
-# Drop rows with nulls in any column
-plan = TransformPlan().rows_drop_nulls()
-
-# Drop rows with nulls in specific columns
-plan = TransformPlan().rows_drop_nulls(columns=["required_field"])
-# Keep first n rows
-plan = TransformPlan().rows_head(10)
-
-# Keep last n rows
-plan = TransformPlan().rows_tail(10)
-
-# Random sample
-plan = TransformPlan().rows_sample(n=100, seed=42)
-plan = TransformPlan().rows_sample(fraction=0.1, seed=42)
-# Explode list column into multiple rows
-plan = TransformPlan().rows_explode("tags")
-
-# Unpivot from wide to long format
-plan = TransformPlan().rows_melt(
- id_columns=["id", "name"],
- value_columns=["q1", "q2", "q3", "q4"],
- variable_name="quarter",
- value_name="sales"
-)
-
-# Pivot from long to wide format
-plan = TransformPlan().rows_pivot(
- index=["id"],
- columns="quarter",
- values="sales",
- aggregate_function="sum"
-)
-
-
-
-
- Text manipulation operations on string columns.
-String operations allow you to transform text data in DataFrame columns. Operations include case conversion, trimming, splitting, concatenation, and pattern matching.
-from transformplan import TransformPlan
-
-plan = (
- TransformPlan()
- .str_lower("email")
- .str_strip("name")
- .str_replace("phone", "-", "")
-)
- StrOps
-
-
-¶Mixin providing string operations on columns.
- - - - - - - - - - - - str_replace
-
-
-¶Replace occurrences of a pattern in a string column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to modify. - |
- - required - | -
- pattern
- |
-
- str
- |
-
-
-
- Pattern to search for. - |
- - required - | -
- replacement
- |
-
- str
- |
-
-
-
- String to replace with. - |
- - required - | -
- literal
- |
-
- bool
- |
-
-
-
- If True, treat pattern as literal string. If False, treat as regex. - |
-
- True
- |
-
transformplan/ops/string.py str_slice
-
-
-¶Extract a substring from a string column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to modify. - |
- - required - | -
- offset
- |
-
- int
- |
-
-
-
- Start position (0-indexed, negative counts from end). - |
- - required - | -
- length
- |
-
- int | None
- |
-
-
-
- Number of characters to extract (None = to end). - |
-
- None
- |
-
transformplan/ops/string.py str_truncate
-
-
-¶Truncate strings to a maximum length with optional suffix.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to modify. - |
- - required - | -
- max_length
- |
-
- int
- |
-
-
-
- Maximum length of the string (including suffix). - |
- - required - | -
- suffix
- |
-
- str
- |
-
-
-
- Suffix to append to truncated strings. - |
-
- '...'
- |
-
transformplan/ops/string.py str_lower
-
-
-¶ str_upper
-
-
-¶ str_strip
-
-
-¶Strip leading and trailing characters from a string column.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to modify. - |
- - required - | -
- chars
- |
-
- str | None
- |
-
-
-
- Characters to strip (None = whitespace). - |
-
- None
- |
-
transformplan/ops/string.py str_pad
-
-
-¶Pad a string column to a specified length.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to modify. - |
- - required - | -
- length
- |
-
- int
- |
-
-
-
- Target length. - |
- - required - | -
- fill_char
- |
-
- str
- |
-
-
-
- Character to pad with. - |
-
- ' '
- |
-
- side
- |
-
- str
- |
-
-
-
- 'left' or 'right'. - |
-
- 'left'
- |
-
transformplan/ops/string.py str_split
-
-
-¶str_split(
- column: str,
- separator: str,
- new_columns: list[str] | None = None,
- keep_original: bool = False,
-) -> Self
-Split a string column by separator.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to split. - |
- - required - | -
- separator
- |
-
- str
- |
-
-
-
- String to split on. - |
- - required - | -
- new_columns
- |
-
- list[str] | None
- |
-
-
-
- Names for the resulting columns. If None, explodes into rows. - |
-
- None
- |
-
- keep_original
- |
-
- bool
- |
-
-
-
- Whether to keep the original column. - |
-
- False
- |
-
transformplan/ops/string.py str_concat
-
-
-¶Concatenate multiple string columns into one.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- columns
- |
-
- list[str]
- |
-
-
-
- Columns to concatenate. - |
- - required - | -
- new_column
- |
-
- str
- |
-
-
-
- Name for the new column. - |
- - required - | -
- separator
- |
-
- str
- |
-
-
-
- Separator between values. - |
-
- ''
- |
-
transformplan/ops/string.py str_extract
-
-
-¶str_extract(
- column: str,
- pattern: str,
- group_index: int = 1,
- new_column: str | None = None,
-) -> Self
-Extract substring using regex capture group.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- column
- |
-
- str
- |
-
-
-
- Column to extract from. - |
- - required - | -
- pattern
- |
-
- str
- |
-
-
-
- Regex pattern with capture group(s). - |
- - required - | -
- group_index
- |
-
- int
- |
-
-
-
- Which capture group to extract (1-indexed). - |
-
- 1
- |
-
- new_column
- |
-
- str | None
- |
-
-
-
- Name for result column (None = modify in place). - |
-
- None
- |
-
transformplan/ops/string.py# Convert to lowercase
-plan = TransformPlan().str_lower("email")
-
-# Convert to uppercase
-plan = TransformPlan().str_upper("code")
-# Strip whitespace
-plan = TransformPlan().str_strip("name")
-
-# Strip specific characters
-plan = TransformPlan().str_strip("code", chars="-_")
-
-# Pad to fixed length
-plan = TransformPlan().str_pad("id", length=10, fill_char="0", side="left")
-# Replace literal string
-plan = TransformPlan().str_replace("phone", "-", "")
-
-# Replace with regex
-plan = TransformPlan().str_replace(
- column="text",
- pattern=r"\s+",
- replacement=" ",
- literal=False
-)
-# Extract substring by position
-plan = TransformPlan().str_slice("code", offset=0, length=3)
-
-# Truncate with suffix
-plan = TransformPlan().str_truncate("description", max_length=100, suffix="...")
-# Split into rows (explode)
-plan = TransformPlan().str_split("tags", separator=",")
-
-# Split into columns
-plan = TransformPlan().str_split(
- column="full_name",
- separator=" ",
- new_columns=["first_name", "last_name"],
- keep_original=False
-)
-# Concatenate columns
-plan = TransformPlan().str_concat(
- columns=["first_name", "last_name"],
- new_column="full_name",
- separator=" "
-)
-
-
-
-
- The main class for building and executing transformation pipelines.
-TransformPlan uses a deferred execution model: operations are registered via method chaining, then executed together when you call process(), validate(), or dry_run().
from transformplan import TransformPlan, Col
-
-plan = (
- TransformPlan()
- .col_drop("temp_column")
- .math_multiply("price", 1.1)
- .rows_filter(Col("active") == True)
-)
-
-# Execute
-df_result, protocol = plan.process(df)
- TransformPlan
-
-
-¶
- Bases: TransformPlanBase, ColumnOps, DatetimeOps, MapOps, MathOps, RowOps, StrOps
Data processor with tracked transformations.
- - -result, protocol = ( - TransformPlan() - .col_drop("temp") - .math_multiply("price", 1.1) - .rows_filter(Col("active") == True) - .process(df) -)
-transformplan/core.pyExecute all registered operations and return transformed data with an audit protocol.
- -Validate operations against the DataFrame schema without executing.
- -Preview what the pipeline will do without executing it.
- -For large Parquet files that exceed available RAM, use chunked processing methods.
-Process a large Parquet file in chunks, optionally keeping related rows together.
-result, protocol = plan.process_chunked(
- source="large_file.parquet",
- partition_key="patient_id", # Keep patient rows together
- chunk_size=100_000,
-)
-protocol.print()
-See Chunked Processing for details on operation compatibility.
-Validate that a pipeline is compatible with chunked processing before executing.
-validation = plan.validate_chunked(
- schema={"id": pl.Int64, "name": pl.Utf8},
- partition_key="id"
-)
-if not validation.is_valid:
- print(validation.errors)
-Pipelines can be saved and loaded as JSON:
-# Save
-plan.to_json("pipeline.json")
-
-# Load
-loaded = TransformPlan.from_json("pipeline.json")
-Or generate executable Python code:
- - - - - - - - - - - - - - -
-
-
-
- The Protocol class captures transformation history for auditability and reproducibility.
-When you process data with a TransformPlan, you receive both the transformed DataFrame and a Protocol object. The protocol contains:
-from transformplan import TransformPlan
-
-plan = TransformPlan().col_drop("temp").math_multiply("price", 1.1)
-df_result, protocol = plan.process(df)
-
-# View the protocol
-protocol.print()
-
-# Save for audit
-protocol.to_json("audit_trail.json")
- Protocol
-
-
-¶Captures the transformation history for auditability.
- - - - - - - - -transformplan/protocol.py input_hash
-
-
-
- property
-
-
-¶Hash of the input DataFrame.
- - output_hash
-
-
-
- property
-
-
-¶Hash of the final output DataFrame.
- - metadata
-
-
-
- property
-
-
-¶Protocol metadata.
- - set_input
-
-
-¶Set the hash and shape of the input DataFrame.
- - - - set_metadata
-
-
-¶Set arbitrary metadata on the protocol.
- - -protocol.set_metadata(author="alice", project="analysis-v2")
- add_step
-
-
-¶add_step(
- operation: str,
- params: dict[str, Any],
- old_shape: tuple[int, int],
- new_shape: tuple[int, int],
- elapsed: float,
- output_hash: str,
-) -> None
-transformplan/protocol.py to_dataframe
-
-
-¶transformplan/protocol.py to_csv
-
-
-¶Write protocol to CSV file.
-Params are serialized as JSON strings to avoid nested data issues.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- path
- |
-
- str | Path
- |
-
-
-
- File path to write to. - |
- - required - | -
transformplan/protocol.py to_dict
-
-
-¶Serialize protocol to a dictionary.
- - -transformplan/protocol.py from_dict
-
-
-
- classmethod
-
-
-¶from_dict(data: dict[str, Any]) -> Protocol
-Deserialize protocol from a dictionary.
- - -transformplan/protocol.py to_json
-
-
-¶Serialize protocol to JSON.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- path
- |
-
- str | Path | None
- |
-
-
-
- Optional file path to write to. - |
-
- None
- |
-
- indent
- |
-
- int
- |
-
-
-
- JSON indentation level. - |
-
- 2
- |
-
Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- JSON string. - |
-
transformplan/protocol.py from_json
-
-
-
- classmethod
-
-
-¶from_json(source: str | Path) -> Protocol
-Deserialize protocol from JSON.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- source
- |
-
- str | Path
- |
-
-
-
- Either a JSON string or a path to a JSON file. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- Protocol
- |
-
-
-
- Protocol instance. - |
-
transformplan/protocol.py summary
-
-
-¶Generate a clean, human-readable summary of the protocol.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- show_params
- |
-
- bool
- |
-
-
-
- Whether to include operation parameters. - |
-
- True
- |
-
Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- Formatted string summary. - |
-
transformplan/protocol.py294 -295 -296 -297 -298 -299 -300 -301 -302 -303 -304 -305 -306 -307 -308 -309 -310 -311 -312 -313 -314 -315 -316 -317 -318 -319 -320 -321 -322 -323 -324 -325 -326 -327 -328 -329 -330 -331 -332 -333 -334 -335 -336 -337 -338 -339 -340 -341 -342 -343 -344 -345 -346 -347 -348 -349 -350 -351 -352 -353 -354 -355 -356 -357 -358 -359 -360 -361 -362 -363 -364 -365 -366 -367 -368 -369 -370 -371 -372 -373 -374 -375 -376 -377 -378 -379 -380 -381 -382 -383 -384 -385 -386 -387 -388 -389 -390 -391 -392 -393 -394 -395 -396 -397 -398 -399 -400 -401 -402 -403 | |
print
-
-
-¶Print the protocol summary to stdout.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- show_params
- |
-
- bool
- |
-
-
-
- Whether to include operation parameters. - |
-
- True
- |
-
frame_hash
-
-
-¶Compute a deterministic hash of a DataFrame.
-The hash is: -- Row-order invariant (sorted row hashes) -- Column-order invariant (columns sorted before hashing) -- Content-sensitive (any value change = different hash)
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- df
- |
-
- DataFrame
- |
-
-
-
- The DataFrame to hash. - |
- - required - | -
Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- A 16-character hex string. - |
-
transformplan/protocol.pyThe print() method generates a formatted summary:
======================================================================
-TRANSFORM PROTOCOL
-======================================================================
-Input: 1000 rows x 5 cols [a1b2c3d4e5f6g7h8]
-Output: 850 rows x 4 cols [h8g7f6e5d4c3b2a1]
-Total time: 0.0234s
-----------------------------------------------------------------------
-
-# Operation Rows Cols Time Hash
-----------------------------------------------------------------------
-0 input 1000 5 - a1b2c3d4e5f6g7h8
-1 col_drop 1000 4 (-1) 0.0012s b2c3d4e5f6g7h8a1
- -> column='temp'
-2 math_multiply 1000 4 0.0008s c3d4e5f6g7h8a1b2
- -> column='price', value=1.1
-3 rows_filter 850 (-150) 4 0.0214s h8g7f6e5d4c3b2a1
- -> filter=(age >= 18)
-======================================================================
-The frame_hash function computes a deterministic hash that is:
-This enables verification that the same pipeline on the same input produces identical results.
- - - - - - - - - - - - - -
-
-
-
- Schema validation and dry-run preview for TransformPlan pipelines.
-TransformPlan validates operations against DataFrame schemas before execution. This catches errors like:
-from transformplan import TransformPlan
-
-plan = TransformPlan().col_drop("nonexistent")
-result = plan.validate(df)
-
-if not result.is_valid:
- for error in result.errors:
- print(error)
- # Step 1 (col_drop): Column 'nonexistent' does not exist
- ValidationResult
-
-
-¶Result of schema validation.
- - - - - - - - -transformplan/validation.py ValidationError
-
-
-
- dataclass
-
-
-¶A single validation error.
- - - - - - - - - - - - SchemaValidationError
-
-
-¶
- Bases: Exception
Raised when schema validation fails.
- - - - - - - - - - DryRunResult
-
-
-¶DryRunResult(
- input_schema: dict[str, DataType],
- steps: list[DryRunStep],
- validation: ValidationResult,
-)
-Result of a dry run showing what a pipeline will do.
- - - - - - - - -transformplan/validation.py is_valid
-
-
-
- property
-
-
-¶Whether the pipeline passed validation.
- - errors
-
-
-
- property
-
-
-¶errors: list[ValidationError]
-Validation errors.
- - steps
-
-
-
- property
-
-
-¶steps: list[DryRunStep]
-List of dry run steps.
- - input_schema
-
-
-
- property
-
-
-¶Input schema.
- - output_schema
-
-
-
- property
-
-
-¶Predicted output schema after all operations.
- - input_columns
-
-
-
- property
-
-
-¶Input column names.
- - output_columns
-
-
-
- property
-
-
-¶Predicted output column names.
- - summary
-
-
-¶Generate a human-readable summary.
- - -Parameters:
-| Name | -Type | -Description | -Default | -
|---|---|---|---|
- show_params
- |
-
- bool
- |
-
-
-
- Whether to show operation parameters. - |
-
- True
- |
-
- show_schema
- |
-
- bool
- |
-
-
-
- Whether to show full schema at each step. - |
-
- False
- |
-
Returns:
-| Type | -Description | -
|---|---|
- str
- |
-
-
-
- Formatted string. - |
-
transformplan/validation.py219 -220 -221 -222 -223 -224 -225 -226 -227 -228 -229 -230 -231 -232 -233 -234 -235 -236 -237 -238 -239 -240 -241 -242 -243 -244 -245 -246 -247 -248 -249 -250 -251 -252 -253 -254 -255 -256 -257 -258 -259 -260 -261 -262 -263 -264 -265 -266 -267 -268 -269 -270 -271 -272 -273 -274 -275 -276 -277 -278 -279 -280 -281 -282 -283 -284 -285 -286 -287 -288 -289 -290 -291 -292 -293 -294 -295 -296 -297 -298 -299 -300 -301 -302 | |
print
-
-
-¶ DryRunStep
-
-
-
- dataclass
-
-
-¶DryRunStep(
- step: int,
- operation: str,
- params: dict[str, Any],
- schema_before: dict[str, str],
- schema_after: dict[str, str],
- columns_added: list[str],
- columns_removed: list[str],
- columns_modified: list[str],
- error: str | None = None,
-)
-A single step in a dry run.
- - - - - - - - - - - -from transformplan import TransformPlan, Col
-
-df = pl.DataFrame({
- "name": ["Alice", "Bob"],
- "age": [25, 30],
- "salary": [50000, 60000]
-})
-
-plan = (
- TransformPlan()
- .col_drop("age")
- .rows_filter(Col("age") > 18) # Error: age was dropped!
-)
-
-result = plan.validate(df)
-print(result)
-# ValidationResult(valid=False, errors=1)
-
-for error in result.errors:
- print(error)
-# Step 2 (rows_filter): Column 'age' does not exist
-plan = (
- TransformPlan()
- .col_drop("temp")
- .col_add("bonus", value=1000)
- .math_multiply("salary", 1.1)
-)
-
-preview = plan.dry_run(df)
-preview.print()
-Output:
-======================================================================
-DRY RUN PREVIEW
-======================================================================
-Validation: PASSED
-----------------------------------------------------------------------
-Input: 3 columns
-----------------------------------------------------------------------
-
-# Operation Columns Changes
-----------------------------------------------------------------------
-1 col_drop 2 -['temp']
- -> column='temp'
-2 col_add 3 +['bonus']
- -> new_column='bonus', value=1000
-3 math_multiply 3 ~['salary']
- -> column='salary', value=1.1
-======================================================================
-Output: 3 columns
-Validation includes type checking for operations that require specific types:
-| Operation Type | -Required Column Type | -
|---|---|
math_* |
-Numeric (Int, Float) | -
str_* |
-String (Utf8) | -
dt_* |
-Datetime (Date, Datetime, Time) | -
0&&i[i.length-1])&&(p[0]===6||p[0]===2)){r=0;continue}if(p[0]===3&&(!i||p[1]>i[0]&&p[1]=e.length&&(e=void 0),{value:e&&e[o++],done:!e}}};throw new TypeError(t?"Object is not iterable.":"Symbol.iterator is not defined.")}function K(e,t){var r=typeof Symbol=="function"&&e[Symbol.iterator];if(!r)return e;var o=r.call(e),n,i=[],s;try{for(;(t===void 0||t-- >0)&&!(n=o.next()).done;)i.push(n.value)}catch(a){s={error:a}}finally{try{n&&!n.done&&(r=o.return)&&r.call(o)}finally{if(s)throw s.error}}return i}function B(e,t,r){if(r||arguments.length===2)for(var o=0,n=t.length,i;o