Hello CPI team,
This is such a fantastic package and I'm so happy that I found it (and the suite of papers your team has written about state-of-the-art random forests for inference).
This may be a question that points to a broader theoretical question, but the {cpi} package is how I came to it, so I'm starting here...
tl;dr
What are the implications of using log loss as the loss function when the tuned classification threshold is not at 0.5? I'm sure someone has thought/written about this, but I'm having trouble finding those papers!
Example
A model prediction of 0.6 for an observation in the positive class seems like it should carry different information about loss if the optimal classification threshold for the model is 0.5 versus 0.2 (for instance). In both cases, a model prediction of 0.6 would correctly classify the observation as belonging to the positive class. But if the tuned classification threshold is 0.5, then a model prediction of 0.6 is barely over that threshold compared to the case where the tuned classification threshold is 0.2 (the model prediction is quite a bit over the 0.6 threshold). But the log loss would be equal in each case. Naively, I would have expected a loss function would show more loss for a prediction of 0.6 when the classification threshold is 0.5, and less loss when then the classification threshold is 0.2.
Possible solution?
Do we need to rescale the model's probability predictions to take into account a tuned classification threshold that isn't at 0.5 prior to calculating log loss? That is, all predictions below the classification threshold get rescaled to [0, 0.5] and all predictions above the classification threshold get rescaled to [0.5, 1]?
Eventual goal
I have a highly unbalanced binary classification problem with multicollinearity in the features, mostly continuous features (just 1 categorical feature with only 5 levels), and a desire to better understand which features are important and how (i.e,. the shape of their relationship to the target).
What I've tried
I've played around with modifying the {cpi} package to calculate loss at an aggregated scale (i.e., per test data set) rather than a per-observation scale using measures more robust to class imbalance (Matthew's Correlation Coefficient). The actual implementation of that modificaiton to {cpi} is here. In that case, I relied on the repeated spatial cross validation for "significance" of CPI for each feature, since the implemented statistical tests rely on having CPI on a per-observation scale (before taking the mean to report a per-feature CPI value). But this strikes me as perhaps being overly conservative, so I'm revisiting using the default CPI loss functions.
Hello CPI team,
This is such a fantastic package and I'm so happy that I found it (and the suite of papers your team has written about state-of-the-art random forests for inference).
This may be a question that points to a broader theoretical question, but the {cpi} package is how I came to it, so I'm starting here...
tl;dr
What are the implications of using log loss as the loss function when the tuned classification threshold is not at 0.5? I'm sure someone has thought/written about this, but I'm having trouble finding those papers!
Example
A model prediction of 0.6 for an observation in the positive class seems like it should carry different information about loss if the optimal classification threshold for the model is 0.5 versus 0.2 (for instance). In both cases, a model prediction of 0.6 would correctly classify the observation as belonging to the positive class. But if the tuned classification threshold is 0.5, then a model prediction of 0.6 is barely over that threshold compared to the case where the tuned classification threshold is 0.2 (the model prediction is quite a bit over the 0.6 threshold). But the log loss would be equal in each case. Naively, I would have expected a loss function would show more loss for a prediction of 0.6 when the classification threshold is 0.5, and less loss when then the classification threshold is 0.2.
Possible solution?
Do we need to rescale the model's probability predictions to take into account a tuned classification threshold that isn't at 0.5 prior to calculating log loss? That is, all predictions below the classification threshold get rescaled to [0, 0.5] and all predictions above the classification threshold get rescaled to [0.5, 1]?
Eventual goal
I have a highly unbalanced binary classification problem with multicollinearity in the features, mostly continuous features (just 1 categorical feature with only 5 levels), and a desire to better understand which features are important and how (i.e,. the shape of their relationship to the target).
What I've tried
I've played around with modifying the {cpi} package to calculate loss at an aggregated scale (i.e., per test data set) rather than a per-observation scale using measures more robust to class imbalance (Matthew's Correlation Coefficient). The actual implementation of that modificaiton to {cpi} is here. In that case, I relied on the repeated spatial cross validation for "significance" of CPI for each feature, since the implemented statistical tests rely on having CPI on a per-observation scale (before taking the mean to report a per-feature CPI value). But this strikes me as perhaps being overly conservative, so I'm revisiting using the default CPI loss functions.