Skip to content

Numeric & Borrower Preprocessor #6

@RigelAlgebar

Description

@RigelAlgebar

🧑‍💻 Subset 1 Numeric & Borrower Preprocessor

** Build a functional + OOP numeric/borrower cleaning & transformation block (subset 1) **

You will work with these feature columns:

SUBSET1_COLS = [
    "id",
    "member_id",
    # numeric – amounts / risk basics
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "int_rate",
    "installment",
    "annual_inc",
    "dti",
    "delinq_2yrs",
    "inq_last_6mths",
    "mths_since_last_delinq",
    "mths_since_last_record",
    "open_acc",
    "pub_rec",
    "revol_bal",
    "revol_util",
    "total_acc",
    "out_prncp",
    # categorical / string
    "term",
    "grade",
    "sub_grade",
    "emp_title",
    "emp_length",
    "home_ownership",
    "verification_status",
    "issue_d",
]

Treat loan_status as the target (y), not as an input feature here.


🎯 Goal

Create a numeric + basic borrower transformer that:

  • Cleans numeric values (invalid, missing, outliers) for subset 1 numeric columns
  • Converts some string fields (term, emp_length, grade) into useful numeric form
  • Is implemented as a sklearn-compatible transformer (OOP)
  • Uses pure functions internally for the cleaning logic (FP)

✅ Tasks

  • In src/cleaning_numeric.py, implement pure numeric cleaning functions:
def clean_invalid_numeric(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    # Replace impossible numeric values (e.g. annual_inc <= 0, dti < 0 or > 200) with NaN.
    ...

def handle_missing_numeric(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    # Fill NaNs using median per column.
    ...

def clip_numeric_outliers(
    df: pd.DataFrame,
    cols: list[str],
    upper_quantile: float = 0.99,
) -> pd.DataFrame:
    # Clip numeric features at the given quantile.
    ...
  • In src/cleaning_borrower.py, implement pure “string → numeric” functions:
def normalize_term(df: pd.DataFrame) -> pd.DataFrame:
    # Convert 'term' like '36 months' -> numeric 36.
    ...

def normalize_emp_length(df: pd.DataFrame) -> pd.DataFrame:
    # Map '10+ years', '< 1 year', etc. to numeric years (float).
    ...

def encode_grade_ordinal(df: pd.DataFrame) -> pd.DataFrame:
    # Map grade A–G to 1–7; optionally create sub_grade_ord too.
    ...
  • Implement a functional composition helper:
def apply_numeric_steps(
    df: pd.DataFrame,
    cols: list[str],
    steps: list,
) -> pd.DataFrame:
    # Apply a list of functions(df, cols) -> df in sequence.
    for step in steps:
        df = step(df, cols)
    return df
  • In src/transformers.py, implement Subset1NumericBorrowerTransformer:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class Subset1NumericBorrowerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, numeric_cols: list[str], upper_quantile: float = 0.99):
        self.numeric_cols = numeric_cols
        self.upper_quantile = upper_quantile
        # store per-column medians, quantiles, etc.
        self.medians_ = {}
        self.quantiles_ = {}

    def fit(self, X, y=None):
        df = pd.DataFrame(X, columns=self.numeric_cols).copy()
        # learn medians, quantiles etc. from df
        for col in self.numeric_cols:
            self.medians_[col] = df[col].median()
            self.quantiles_[col] = df[col].quantile(self.upper_quantile)
        return self

    def transform(self, X):
        df = pd.DataFrame(X, columns=self.numeric_cols).copy()
        # 1) clean invalid values
        df = clean_invalid_numeric(df, self.numeric_cols)
        # 2) fill NaNs using stored medians
        for col in self.numeric_cols:
            df[col] = df[col].fillna(self.medians_[col])
        # 3) clip using stored quantiles
        for col in self.numeric_cols:
            df[col] = df[col].clip(upper=self.quantiles_[col])
        return df.values
  • Add a small example (notebook/script):
subset1_numeric_cols = [
    "loan_amnt", "funded_amnt", "funded_amnt_inv", "int_rate", "installment",
    "annual_inc", "dti", "delinq_2yrs", "inq_last_6mths", "mths_since_last_delinq",
    "mths_since_last_record", "open_acc", "pub_rec", "revol_bal", "revol_util",
    "total_acc", "out_prncp",
]

transformer = Subset1NumericBorrowerTransformer(numeric_cols=subset1_numeric_cols)
transformer.fit(X_train[subset1_numeric_cols])
X_train_num = transformer.transform(X_train[subset1_numeric_cols])
  • Add tests in tests/test_subset1_transformer.py:
    • After transform, no NaNs in subset-1 numeric columns
    • No negative annual_inc, dti within a reasonable range
    • If you expose them, term/emp_length are numeric

✅ Acceptance Criteria

  • Subset-1 numeric & borrower-related features are cleaned and transformed
  • Implementation uses pure functions + Subset1NumericBorrowerTransformer OOP
  • Tests cover at least:
    • 1 invalid value scenario
    • 1 NaN scenario
    • 1 string→numeric conversion

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions