I use some scenes in the nuscenes dataset as training and verification. To compare the contributions of different modal information, I set the characteristics of a certain mode to 0 in the model evaluation process, so that it only uses the information of one mode for prediction, and other processes remain unchanged. The results are as follows:
fusion mAP:0.1936
fusion NDS:0.2288
lidar_only mAP:0.1933
lidar_only NDS:0.2275
camera_only mAP:0.0000
camera_only NDS:0.0000
During the inference process, it was found that the recognition performance of the model came entirely from Lidar features. Upon checking the code, it was discovered that the spatial_features provided to downstream tasks in GlobalAlign were calculated using the following logic:
deformed_feature = self.deform_conv(lidar_bev * deform_weight)
batch_dict['spatial_features'] = deformed_feature
Is it the correct logic that mm-bev, which combines similar features and radar features in the code, has not been used?
I use some scenes in the nuscenes dataset as training and verification. To compare the contributions of different modal information, I set the characteristics of a certain mode to 0 in the model evaluation process, so that it only uses the information of one mode for prediction, and other processes remain unchanged. The results are as follows:
fusion mAP:0.1936
fusion NDS:0.2288
lidar_only mAP:0.1933
lidar_only NDS:0.2275
camera_only mAP:0.0000
camera_only NDS:0.0000
During the inference process, it was found that the recognition performance of the model came entirely from Lidar features. Upon checking the code, it was discovered that the spatial_features provided to downstream tasks in GlobalAlign were calculated using the following logic:
deformed_feature = self.deform_conv(lidar_bev * deform_weight)
batch_dict['spatial_features'] = deformed_feature
Is it the correct logic that mm-bev, which combines similar features and radar features in the code, has not been used?