Skip to content

Questions about evaluation protocol, prediction normalization, and data augmentation in the released training pipeline #42

@yujunzhao2003

Description

@yujunzhao2003

Hi, thank you for releasing the SCSegamba codebase. I am trying to reproduce and understand the reported results, but I found several points in the released implementation that seem unclear or potentially inconsistent with standard binary segmentation evaluation. I would appreciate clarification.

1. Prediction output is saved after per-image max normalization instead of sigmoid

In the testing/evaluation pipeline, the model output appears to be saved as:

out = out[0, 0, ...].cpu().numpy()
out = 255 * (out / np.max(out))
cv2.imwrite(..., out)

However, during training the model uses BCEWithLogitsLoss, which suggests that the model output should be treated as logits. In a standard binary segmentation pipeline, the evaluation would usually be:

prob = sigmoid(logits)
pred = prob > threshold

or equivalently, for threshold 0.5:

pred = logits > 0

The current per-image normalization changes the meaning of the threshold. For example, if two pixels have logits 8 and 20, both correspond to sigmoid probabilities close to 1.0. But after out / max(out), the logit 8 pixel becomes 0.4 if the maximum logit in that image is 20. Therefore, thresholding the normalized map is no longer equivalent to thresholding probabilities.

Could you clarify why the raw logits are normalized by the maximum value of each image instead of applying sigmoid?

2. Precision, Recall, and F1 seem to be reported at threshold = 0.0

In evaluate.py, cal_prf_metrics() scans thresholds from 0.0 to 1.0:

for thresh in np.arange(0.0, 1.0, thresh_step):
    pred_img = (pred / 255 > thresh).astype('uint8')
    ...

However, in eval(), the reported values are:

F1 = F_list[0]
Precision = Precision_list[0]
Recall = Recall_list[0]

This means that the reported Precision/Recall/F1 correspond to threshold = 0.0, not threshold = 0.5, and not the best F1 threshold.

Could you confirm whether this is intended? If so, should the reported F1 be interpreted as the result of pred / 255 > 0.0 rather than a standard probability threshold?

3. mIoU is computed using threshold sweeping and includes the background class

The mIoU implementation appears to compute:

IoU_crack = TP / (TP + FP + FN)
IoU_bg = TN / (TN + FP + FN)
mIoU = (IoU_crack + IoU_bg) / 2

and then takes the maximum value over all thresholds:

mIoU = np.max(final_iou)

Therefore, the reported mIoU seems to be:

max_threshold mean_image((IoU_crack + IoU_background) / 2)

rather than foreground/crack IoU at a fixed threshold.

Could you clarify whether the reported mIoU in the paper/table is the two-class mIoU including background, and whether it is selected using the best threshold on the evaluation set?

4. Best checkpoint appears to be selected by mIoU, not F1

From the training pipeline, the best checkpoint seems to be saved based on:

if max_mIoU < metrics['mIoU']:
    save checkpoint_best.pth

So the selected best model is based on the best-threshold two-class mIoU, not F1.

Could you confirm whether the reported best model is selected by mIoU or by F1?

5. Data augmentation does not seem to be used in the actual training dataset

I noticed that there are augmentation-related utility functions in the repository, but in the actual CrackDataset.__getitem__() pipeline, the data processing seems to consist mainly of:

  • image/mask loading
  • resizing
  • mask thresholding
  • ToTensor
  • normalization

I could not find random crop, random flip, random rotation, random affine, or other online augmentation being applied in the actual training path.

Could you clarify whether online data augmentation is used in the released training code? If not, were the reported results obtained using only resizing and normalization, or was offline augmentation applied before training?

Summary

To make the evaluation protocol easier to reproduce, it would be helpful to clarify the following:

  1. Whether predictions should be evaluated from raw logits, sigmoid probabilities, or per-image max-normalized maps.
  2. Whether reported Precision/Recall/F1 use threshold 0.0, threshold 0.5, or best threshold.
  3. Whether reported mIoU includes background and whether it is selected by threshold sweeping.
  4. Whether the best checkpoint is selected by mIoU or F1.
  5. Whether online data augmentation is actually enabled in the released training pipeline.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions