Hi, thank you for releasing the SCSegamba codebase. I am trying to reproduce and understand the reported results, but I found several points in the released implementation that seem unclear or potentially inconsistent with standard binary segmentation evaluation. I would appreciate clarification.
1. Prediction output is saved after per-image max normalization instead of sigmoid
In the testing/evaluation pipeline, the model output appears to be saved as:
out = out[0, 0, ...].cpu().numpy()
out = 255 * (out / np.max(out))
cv2.imwrite(..., out)
However, during training the model uses BCEWithLogitsLoss, which suggests that the model output should be treated as logits. In a standard binary segmentation pipeline, the evaluation would usually be:
prob = sigmoid(logits)
pred = prob > threshold
or equivalently, for threshold 0.5:
The current per-image normalization changes the meaning of the threshold. For example, if two pixels have logits 8 and 20, both correspond to sigmoid probabilities close to 1.0. But after out / max(out), the logit 8 pixel becomes 0.4 if the maximum logit in that image is 20. Therefore, thresholding the normalized map is no longer equivalent to thresholding probabilities.
Could you clarify why the raw logits are normalized by the maximum value of each image instead of applying sigmoid?
2. Precision, Recall, and F1 seem to be reported at threshold = 0.0
In evaluate.py, cal_prf_metrics() scans thresholds from 0.0 to 1.0:
for thresh in np.arange(0.0, 1.0, thresh_step):
pred_img = (pred / 255 > thresh).astype('uint8')
...
However, in eval(), the reported values are:
F1 = F_list[0]
Precision = Precision_list[0]
Recall = Recall_list[0]
This means that the reported Precision/Recall/F1 correspond to threshold = 0.0, not threshold = 0.5, and not the best F1 threshold.
Could you confirm whether this is intended? If so, should the reported F1 be interpreted as the result of pred / 255 > 0.0 rather than a standard probability threshold?
3. mIoU is computed using threshold sweeping and includes the background class
The mIoU implementation appears to compute:
IoU_crack = TP / (TP + FP + FN)
IoU_bg = TN / (TN + FP + FN)
mIoU = (IoU_crack + IoU_bg) / 2
and then takes the maximum value over all thresholds:
Therefore, the reported mIoU seems to be:
max_threshold mean_image((IoU_crack + IoU_background) / 2)
rather than foreground/crack IoU at a fixed threshold.
Could you clarify whether the reported mIoU in the paper/table is the two-class mIoU including background, and whether it is selected using the best threshold on the evaluation set?
4. Best checkpoint appears to be selected by mIoU, not F1
From the training pipeline, the best checkpoint seems to be saved based on:
if max_mIoU < metrics['mIoU']:
save checkpoint_best.pth
So the selected best model is based on the best-threshold two-class mIoU, not F1.
Could you confirm whether the reported best model is selected by mIoU or by F1?
5. Data augmentation does not seem to be used in the actual training dataset
I noticed that there are augmentation-related utility functions in the repository, but in the actual CrackDataset.__getitem__() pipeline, the data processing seems to consist mainly of:
- image/mask loading
- resizing
- mask thresholding
- ToTensor
- normalization
I could not find random crop, random flip, random rotation, random affine, or other online augmentation being applied in the actual training path.
Could you clarify whether online data augmentation is used in the released training code? If not, were the reported results obtained using only resizing and normalization, or was offline augmentation applied before training?
Summary
To make the evaluation protocol easier to reproduce, it would be helpful to clarify the following:
- Whether predictions should be evaluated from raw logits, sigmoid probabilities, or per-image max-normalized maps.
- Whether reported Precision/Recall/F1 use threshold 0.0, threshold 0.5, or best threshold.
- Whether reported mIoU includes background and whether it is selected by threshold sweeping.
- Whether the best checkpoint is selected by mIoU or F1.
- Whether online data augmentation is actually enabled in the released training pipeline.
Thank you.
Hi, thank you for releasing the SCSegamba codebase. I am trying to reproduce and understand the reported results, but I found several points in the released implementation that seem unclear or potentially inconsistent with standard binary segmentation evaluation. I would appreciate clarification.
1. Prediction output is saved after per-image max normalization instead of sigmoid
In the testing/evaluation pipeline, the model output appears to be saved as:
However, during training the model uses
BCEWithLogitsLoss, which suggests that the model output should be treated as logits. In a standard binary segmentation pipeline, the evaluation would usually be:or equivalently, for threshold 0.5:
The current per-image normalization changes the meaning of the threshold. For example, if two pixels have logits 8 and 20, both correspond to sigmoid probabilities close to 1.0. But after
out / max(out), the logit 8 pixel becomes 0.4 if the maximum logit in that image is 20. Therefore, thresholding the normalized map is no longer equivalent to thresholding probabilities.Could you clarify why the raw logits are normalized by the maximum value of each image instead of applying sigmoid?
2. Precision, Recall, and F1 seem to be reported at threshold = 0.0
In
evaluate.py,cal_prf_metrics()scans thresholds from 0.0 to 1.0:However, in
eval(), the reported values are:This means that the reported Precision/Recall/F1 correspond to
threshold = 0.0, notthreshold = 0.5, and not the best F1 threshold.Could you confirm whether this is intended? If so, should the reported F1 be interpreted as the result of
pred / 255 > 0.0rather than a standard probability threshold?3. mIoU is computed using threshold sweeping and includes the background class
The mIoU implementation appears to compute:
and then takes the maximum value over all thresholds:
Therefore, the reported mIoU seems to be:
rather than foreground/crack IoU at a fixed threshold.
Could you clarify whether the reported mIoU in the paper/table is the two-class mIoU including background, and whether it is selected using the best threshold on the evaluation set?
4. Best checkpoint appears to be selected by mIoU, not F1
From the training pipeline, the best checkpoint seems to be saved based on:
So the selected best model is based on the best-threshold two-class mIoU, not F1.
Could you confirm whether the reported best model is selected by mIoU or by F1?
5. Data augmentation does not seem to be used in the actual training dataset
I noticed that there are augmentation-related utility functions in the repository, but in the actual
CrackDataset.__getitem__()pipeline, the data processing seems to consist mainly of:I could not find random crop, random flip, random rotation, random affine, or other online augmentation being applied in the actual training path.
Could you clarify whether online data augmentation is used in the released training code? If not, were the reported results obtained using only resizing and normalization, or was offline augmentation applied before training?
Summary
To make the evaluation protocol easier to reproduce, it would be helpful to clarify the following:
Thank you.