Questions about "NeRAF audio-visual joint training improves vision performance"

Thanks for sharing. Table 2 in the paper shows that NeRAF audio-visual joint training improves vision performance. After reading the code, I have the following puzzle.

It is noted that the [3D grid in NeRAF](https://github.com/AmandineBtto/NeRAF/blob/0c162b8268d0b3835ab098a67d2924787bb2ec2f/NeRAF/NeRAF_model.py#L271) is initialized as a tensor rather than parameters that require gradients:

```python
def reset_grid(self,device=None):
        device = self.device if device is None else device
        self.grid = torch.zeros((7, int((self.grid_size[1] - self.grid_size[0]) / self.grid_step),
                                 int((self.grid_size[3] - self.grid_size[2]) / self.grid_step),
                                 int((self.grid_size[5] - self.grid_size[4]) / self.grid_step)),dtype=torch.float32,device=device)
        # Add coordinates
        grid_coordinates = torch.meshgrid(torch.arange(self.grid_size[0]+self.grid_step/2, self.grid_size[1], self.grid_step), torch.arange(self.grid_size[2]+self.grid_step/2, self.grid_size[3], self.grid_step), torch.arange(self.grid_size[4]+self.grid_step/2, self.grid_size[5], self.grid_step), indexing='ij')
        grid_coordinates = torch.stack(grid_coordinates, dim=0)
        self.grid[4:,:,:,:] = grid_coordinates
```

Then, when [updating 3D grid values during training](https://github.com/AmandineBtto/NeRAF/blob/0c162b8268d0b3835ab098a67d2924787bb2ec2f/NeRAF/NeRAF_model.py#L395), the 3D grid is detached from the computation graph:

```python
self.grid = self.grid.detach()

self.grid[0, xs, ys, zs] = color[:, 0].float().squeeze()
self.grid[1, xs, ys, zs] = color[:, 1].float().squeeze()
self.grid[2, xs, ys, zs] = color[:, 2].float().squeeze()
self.grid[3, xs, ys, zs] = alpha.float().squeeze()
```

Although such a 3D grid guides acoustic modeling, it seems that the audio loss can not be backpropagated to NeRF. So, why does NeRAF audio-visual joint training improve vision performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about "NeRAF audio-visual joint training improves vision performance" #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions about "NeRAF audio-visual joint training improves vision performance" #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions