Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ Total time: 0.0247s
| **str_** | String operations | `str_lower`, `str_upper`, `str_strip`, `str_replace`, `str_split` |
| **dt_** | Datetime operations | `dt_year`, `dt_month`, `dt_parse`, `dt_age_years`, `dt_diff_days` |
| **map_** | Value mapping | `map_values`, `map_discretize`, `map_case`, `map_from_column` |
| **enc_** | Categorical encoding | `enc_onehot`, `enc_ordinal`, `enc_label` |

## Installation

Expand Down
187 changes: 187 additions & 0 deletions docs/api/ops/encoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Encoding Operations

Categorical encoding operations for machine learning preparation.

## Overview

Encoding operations transform categorical columns into numeric representations suitable for machine learning models. They support one-hot encoding, ordinal encoding, and label encoding.

```python
from transformplan import TransformPlan

plan = (
TransformPlan()
.enc_onehot("color", categories=["red", "green", "blue"], drop="first")
.enc_ordinal("size", categories=["small", "medium", "large"])
)
```

## Class Reference

::: transformplan.ops.encoding.EncodingOps
options:
show_root_heading: true
members:
- enc_onehot
- enc_ordinal
- enc_label

## Examples

### One-Hot Encoding

Creates binary indicator columns (0/1) for each category.

```python
# Basic one-hot encoding
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"]
)
# Creates columns: color_red, color_green, color_blue

# Drop first category to avoid multicollinearity (for regression models)
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"],
drop="first"
)
# Creates columns: color_green, color_blue (drops color_red)

# Drop last category
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"],
drop="last"
)
# Creates columns: color_red, color_green (drops color_blue)

# Drop specific category
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"],
drop="green"
)
# Creates columns: color_red, color_blue (drops color_green)

# Custom prefix for new columns
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"],
prefix="c"
)
# Creates columns: c_red, c_green, c_blue

# Keep original column
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"],
drop_original=False
)
# Keeps color column alongside color_red, color_green, color_blue
```

### Ordinal Encoding

Maps categories to integers based on explicit ordering (first=0, second=1, etc.).

```python
# Ordinal encoding with meaningful order
plan = TransformPlan().enc_ordinal(
column="size",
categories=["small", "medium", "large"]
)
# Maps: small -> 0, medium -> 1, large -> 2

# Output to new column
plan = TransformPlan().enc_ordinal(
column="size",
categories=["small", "medium", "large"],
new_column="size_encoded"
)

# Custom unknown value
plan = TransformPlan().enc_ordinal(
column="size",
categories=["small", "medium", "large"],
unknown_value=-1 # Default
)
# Values not in categories get -1
```

### Label Encoding

Simple integer encoding, alphabetically sorted by default. Similar to ordinal encoding but without semantic ordering.

```python
# Label encoding (alphabetically sorted)
plan = TransformPlan().enc_label(column="department")
# Maps alphabetically: Engineering -> 0, HR -> 1, Sales -> 2

# With explicit categories
plan = TransformPlan().enc_label(
column="department",
categories=["HR", "Engineering", "Sales"]
)
# Maps: HR -> 0, Engineering -> 1, Sales -> 2
```

## Use Cases

### Preparing Data for Machine Learning

```python
# One-hot encode categorical features, dropping first to avoid multicollinearity
plan = (
TransformPlan()
.enc_onehot("color", categories=["red", "green", "blue"], drop="first")
.enc_onehot("size", categories=["S", "M", "L", "XL"], drop="first")
.enc_ordinal("quality", categories=["low", "medium", "high"])
)
```

### Handling Unknown Categories

```python
# Unknown values get all zeros (one-hot)
plan = TransformPlan().enc_onehot(
column="color",
categories=["red", "green", "blue"],
unknown_value="all_zero" # Default
)

# Unknown values get -1 (ordinal/label)
plan = TransformPlan().enc_ordinal(
column="size",
categories=["small", "medium", "large"],
unknown_value=-1
)
```

### Deriving Categories from Data

When categories are not specified, they are derived from the data (sorted alphabetically):

```python
# Categories derived from data
plan = TransformPlan().enc_onehot("color")
# Uses sorted unique values from the column

# Note: For reproducibility, explicitly specify categories
plan = TransformPlan().enc_onehot(
column="color",
categories=["blue", "green", "red"] # Explicit is better
)
```

## Multicollinearity Note

When using one-hot encoding for linear models (regression, logistic regression), you should drop one category to avoid the [dummy variable trap](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)). Use the `drop` parameter:

```python
# For regression models, drop one category
plan = TransformPlan().enc_onehot("color", drop="first")

# Tree-based models (random forest, XGBoost) don't require this
plan = TransformPlan().enc_onehot("color") # Keep all
```
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ Total time: 0.0247s
| **str_** | String operations | `str_lower`, `str_upper`, `str_strip`, `str_replace`, `str_split` |
| **dt_** | Datetime operations | `dt_year`, `dt_month`, `dt_parse`, `dt_age_years`, `dt_diff_days` |
| **map_** | Value mapping | `map_values`, `map_discretize`, `map_case`, `map_from_column` |
| **enc_** | Categorical encoding | `enc_onehot`, `enc_ordinal`, `enc_label` |


## Getting Started
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,4 @@ nav:
- String Operations: api/ops/string.md
- Datetime Operations: api/ops/datetime.md
- Map Operations: api/ops/map.md
- Encoding Operations: api/ops/encoding.md
Loading