Thanks for sharing. Table 2 in the paper shows that NeRAF audio-visual joint training improves vision performance. After reading the code, I have the following puzzle.
It is noted that the 3D grid in NeRAF is initialized as a tensor rather than parameters that require gradients:
def reset_grid(self,device=None):
device = self.device if device is None else device
self.grid = torch.zeros((7, int((self.grid_size[1] - self.grid_size[0]) / self.grid_step),
int((self.grid_size[3] - self.grid_size[2]) / self.grid_step),
int((self.grid_size[5] - self.grid_size[4]) / self.grid_step)),dtype=torch.float32,device=device)
# Add coordinates
grid_coordinates = torch.meshgrid(torch.arange(self.grid_size[0]+self.grid_step/2, self.grid_size[1], self.grid_step), torch.arange(self.grid_size[2]+self.grid_step/2, self.grid_size[3], self.grid_step), torch.arange(self.grid_size[4]+self.grid_step/2, self.grid_size[5], self.grid_step), indexing='ij')
grid_coordinates = torch.stack(grid_coordinates, dim=0)
self.grid[4:,:,:,:] = grid_coordinates
Then, when updating 3D grid values during training, the 3D grid is detached from the computation graph:
self.grid = self.grid.detach()
self.grid[0, xs, ys, zs] = color[:, 0].float().squeeze()
self.grid[1, xs, ys, zs] = color[:, 1].float().squeeze()
self.grid[2, xs, ys, zs] = color[:, 2].float().squeeze()
self.grid[3, xs, ys, zs] = alpha.float().squeeze()
Although such a 3D grid guides acoustic modeling, it seems that the audio loss can not be backpropagated to NeRF. So, why does NeRAF audio-visual joint training improve vision performance?
Thanks for sharing. Table 2 in the paper shows that NeRAF audio-visual joint training improves vision performance. After reading the code, I have the following puzzle.
It is noted that the 3D grid in NeRAF is initialized as a tensor rather than parameters that require gradients:
Then, when updating 3D grid values during training, the 3D grid is detached from the computation graph:
Although such a 3D grid guides acoustic modeling, it seems that the audio loss can not be backpropagated to NeRF. So, why does NeRAF audio-visual joint training improve vision performance?