Numeric & Borrower Preprocessor

## 🧑‍💻 Subset 1 Numeric & Borrower Preprocessor

** Build a functional + OOP numeric/borrower cleaning & transformation block (subset 1) **

You will work with these **feature columns**:


```
SUBSET1_COLS = [
    "id",
    "member_id",
    # numeric – amounts / risk basics
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "int_rate",
    "installment",
    "annual_inc",
    "dti",
    "delinq_2yrs",
    "inq_last_6mths",
    "mths_since_last_delinq",
    "mths_since_last_record",
    "open_acc",
    "pub_rec",
    "revol_bal",
    "revol_util",
    "total_acc",
    "out_prncp",
    # categorical / string
    "term",
    "grade",
    "sub_grade",
    "emp_title",
    "emp_length",
    "home_ownership",
    "verification_status",
    "issue_d",
]
```


> Treat `loan_status` as the **target (y)**, not as an input feature here.

---

### 🎯 Goal

Create a **numeric + basic borrower transformer** that:

- Cleans numeric values (invalid, missing, outliers) for subset 1 numeric columns  
- Converts some string fields (`term`, `emp_length`, `grade`) into useful numeric form  
- Is implemented as a **sklearn-compatible transformer** (OOP)  
- Uses **pure functions** internally for the cleaning logic (FP)

---

### ✅ Tasks

- [ ] In `src/cleaning_numeric.py`, implement pure numeric cleaning functions:


```
def clean_invalid_numeric(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    # Replace impossible numeric values (e.g. annual_inc <= 0, dti < 0 or > 200) with NaN.
    ...

def handle_missing_numeric(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    # Fill NaNs using median per column.
    ...

def clip_numeric_outliers(
    df: pd.DataFrame,
    cols: list[str],
    upper_quantile: float = 0.99,
) -> pd.DataFrame:
    # Clip numeric features at the given quantile.
    ...
```


- [ ] In `src/cleaning_borrower.py`, implement pure “string → numeric” functions:


```
def normalize_term(df: pd.DataFrame) -> pd.DataFrame:
    # Convert 'term' like '36 months' -> numeric 36.
    ...

def normalize_emp_length(df: pd.DataFrame) -> pd.DataFrame:
    # Map '10+ years', '< 1 year', etc. to numeric years (float).
    ...

def encode_grade_ordinal(df: pd.DataFrame) -> pd.DataFrame:
    # Map grade A–G to 1–7; optionally create sub_grade_ord too.
    ...
```


- [ ] Implement a functional composition helper:


```
def apply_numeric_steps(
    df: pd.DataFrame,
    cols: list[str],
    steps: list,
) -> pd.DataFrame:
    # Apply a list of functions(df, cols) -> df in sequence.
    for step in steps:
        df = step(df, cols)
    return df
```


- [ ] In `src/transformers.py`, implement `Subset1NumericBorrowerTransformer`:

```
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class Subset1NumericBorrowerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, numeric_cols: list[str], upper_quantile: float = 0.99):
        self.numeric_cols = numeric_cols
        self.upper_quantile = upper_quantile
        # store per-column medians, quantiles, etc.
        self.medians_ = {}
        self.quantiles_ = {}

    def fit(self, X, y=None):
        df = pd.DataFrame(X, columns=self.numeric_cols).copy()
        # learn medians, quantiles etc. from df
        for col in self.numeric_cols:
            self.medians_[col] = df[col].median()
            self.quantiles_[col] = df[col].quantile(self.upper_quantile)
        return self

    def transform(self, X):
        df = pd.DataFrame(X, columns=self.numeric_cols).copy()
        # 1) clean invalid values
        df = clean_invalid_numeric(df, self.numeric_cols)
        # 2) fill NaNs using stored medians
        for col in self.numeric_cols:
            df[col] = df[col].fillna(self.medians_[col])
        # 3) clip using stored quantiles
        for col in self.numeric_cols:
            df[col] = df[col].clip(upper=self.quantiles_[col])
        return df.values
```

- [ ] Add a small example (notebook/script):


```
subset1_numeric_cols = [
    "loan_amnt", "funded_amnt", "funded_amnt_inv", "int_rate", "installment",
    "annual_inc", "dti", "delinq_2yrs", "inq_last_6mths", "mths_since_last_delinq",
    "mths_since_last_record", "open_acc", "pub_rec", "revol_bal", "revol_util",
    "total_acc", "out_prncp",
]

transformer = Subset1NumericBorrowerTransformer(numeric_cols=subset1_numeric_cols)
transformer.fit(X_train[subset1_numeric_cols])
X_train_num = transformer.transform(X_train[subset1_numeric_cols])
```


- [ ] Add tests in `tests/test_subset1_transformer.py`:
  - After `transform`, no NaNs in subset-1 numeric columns  
  - No negative `annual_inc`, `dti` within a reasonable range  
  - If you expose them, `term`/`emp_length` are numeric

---

### ✅ Acceptance Criteria

- Subset-1 numeric & borrower-related features are cleaned and transformed  
- Implementation uses **pure functions + `Subset1NumericBorrowerTransformer` OOP**  
- Tests cover at least:
  - 1 invalid value scenario  
  - 1 NaN scenario  
  - 1 string→numeric conversion


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric & Borrower Preprocessor #6

🧑‍💻 Subset 1 Numeric & Borrower Preprocessor

🎯 Goal

✅ Tasks

✅ Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Numeric & Borrower Preprocessor #6

Description

🧑‍💻 Subset 1 Numeric & Borrower Preprocessor

🎯 Goal

✅ Tasks

✅ Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions