Skip to content

Concept and Math Notes

IHaveThatPower edited this page Jun 24, 2023 · 13 revisions

Goal

Given a sufficient collection of (u,v) coordinates in several different images that correspond to matching (x,y,z) coordinates in a 3D space, we should be able to derive a description of the exact camera properties (position and rotation, or "pose", as well as intrinsic properties like focal length). With the camera properties derived, we can place reconstructed cameras into a 3D scene with the 2D images as background planes that can then be used as off-axis, non-orthographic references for 3D modeling.

Dense mesh reconstruction is not the goal. There are plenty of options that attempt dense mesh reconstruction using similar principles, including using automatic (and dense) feature extraction, but that is contrary the approach desired. We want to specify common points manually. We do not want an automatic reconstruction of the subject; we want to model the subject ourselves, with confidence that our references are posed to give us an accurate basis to do this.

Projection Equations

The fundamental concept of 3D projection is well-described on Wikipedia in 3D Projection and Camera resectioning, but reproduced here for reference.

Approach 1

  • $\mbox{a}_\left(x,y,z\right)$ - the feature coordinate in 3D space
  • $\mbox{c}_\left(x,y,z\right)$ - the camera coordinate in 3D space
  • $\theta_\left(x,y,z\right)$ - the camera rotation as Tait-Bryan/Euler angles
  • $\mbox{e}_\left(x,y,z\right)$ - the display surface relative to $c$
    • Convention usually treats $z$ as positive, though it being negative is more physically correct, but will flip the image
    • These values describe focal length and other camera intrinsics
  • $\mbox{b}_\left(u,v\right)$ - the feature coordinate in image space

To compute $\mbox{b}_\left(x,y\right)$ using the first formulation, we need to find $\mbox{d}_\left(x,y,z\right)$, which is the position of $\mbox{a}_\left(x,y,z\right)$ with respect to the camera's coordinate system. We do this by subtracting $\mbox{c}$ from $\mbox{a}$ and then applying a rotation of $-\theta$ to the result:

$$\begin{bmatrix}\mbox{d}_x\\ \mbox{d}_y\\ \mbox{d}_z\end{bmatrix} = \begin{bmatrix}1 & 0 & 0\\ 0 & \cos\left(\theta_x\right) & \sin\left(\theta_x\right)\\ 0 & -\sin\left(\theta_x\right) & \cos\left(\theta_x\right)\end{bmatrix} \begin{bmatrix}\cos\left(\theta_y\right) & 0 & -\sin\left(\theta_y\right)\\ 0 & 1 & 0\\ \sin\left(\theta_y\right) & 0 & \cos\left(\theta_y\right)\end{bmatrix} \begin{bmatrix}\cos\left(\theta_z\right) & \sin\left(\theta_z\right) & 0\\ -\sin\left(\theta_z\right) & \cos\left(\theta_z\right) & 0\\ 0 & 0 & 1\end{bmatrix} \left(\begin{bmatrix}\mbox{a}_x\\ \mbox{a}_y\\ \mbox{a}_z\\ \end{bmatrix} - \begin{bmatrix}\mbox{c}_x\\ \mbox{c}_y\\ \mbox{c}_z\end{bmatrix}\right)$$

Which in turn can be used to find $\mbox{b}$:

$$\begin{bmatrix} \mbox{b}_u\\ \mbox{b}_v \end{bmatrix} = \frac{\mbox{e}_z}{\mbox{d}_z} \begin{bmatrix} \mbox{d}_x\\ \mbox{d}_y \end{bmatrix} + \begin{bmatrix} \mbox{e}_x\\ \mbox{e}_y \end{bmatrix}$$

Approach 2

$$z_c\begin{bmatrix} u\\ v\\ 1 \end{bmatrix} = \mbox{K} \begin{bmatrix} \mbox{R} & \mbox{T}\\ 0_{1\times3} & 1 \end{bmatrix} \begin{bmatrix} x_w\\ y_w\\ z_w\\ 1 \end{bmatrix}$$

Where:

  • $z_c$ - the z-coordinate of the camera relative to the world origin
  • $\left(u, v\right)$ - the image coordinate
  • $\left(x_w, y_w, z_w\right)$ - the feature coordinate in world space
  • $\mbox{K}$ - the camera's intrinsic properties
    • $\alpha_x$ and $\alpha_y$ represent the focal length in terms of pixels and can be expressed instead as $f \cdot m_x$ and $f \cdot m_y$, where $m$ are the inverses of the width and height of a pixel on the projection plane and $f$ is the focal length in terms of distance
    • $\lambda$ is the skew coefficient between the x and y axis, usually 0
    • $u_o, v_o$ represent the principal point, ideally the center of the image

$$\mbox{K} = \begin{bmatrix} \alpha_x & \lambda & u_0 & 0\\ 0 & \alpha_y & v_0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{bmatrix}$$

  • $\mbox{R}$ - the camera's 3x3 rotation matrix
    • This is identical in form to the three expanded rotation matrices in Approach 1, above.
  • $\mbox{T}$ - the camera's position (translation) matrix
    • This is the position of the origin of world coordinates expressed in coordinates of the camera-centered coordinate system, not the coordinates of the camera in world space. The latter would be $C = -R^{-1}T$.

A Note On Camera Distortion

Lens-based cameras (as opposed to pinhole cameras) invariably undergo image distortion. Software sometimes corrects for this in the captured image. It is possible to estimate and account for this distortion, but it significantly complicates the above equations.

For now, images will be treated as undistorted. As such, users should pre-correct images, if able, before using them. Perhaps in the future, this will be revisited. For further reading on distortion, see this paper.

Rotation Matrix

Using $s$ and $c$ in lieu of $\sin\left(\theta\right)$ and $\cos\left(\theta\right)$ respectively, we can condense $\mbox{R}$ thus:

$$ \mbox{R} = \begin{bmatrix} 1 & 0 & 0\\ 0 & c_x & s_x\\ 0 & -s_x & c_x \end{bmatrix} \begin{bmatrix} c_y & 0 & -s_y\\ 0 & 1 & 0\\ s_y & 0 & c_y \end{bmatrix} \begin{bmatrix} c_z & s_z & 0\\ -s_z & c_z & 0\\ 0 & 0 & 1 \end{bmatrix}$$

$$ \mbox{R} = \begin{bmatrix} c_y & 0 & -s_y\\ s_x s_y & c_x & s_x c_y\\ c_x s_y & -s_x & c_x c_y \end{bmatrix} \begin{bmatrix} c_z & s_z & 0\\ -s_z & c_z & 0\\ 0 & 0 & 1 \end{bmatrix} $$

$$ \mbox{R} = \begin{bmatrix} c_y c_z & c_y s_z & -s_y\\ s_x s_y c_z - c_x s_z & s_x s_y s_z + c_x c_z & s_x c_y\\ c_x s_y c_z + s_x s_z & c_x s_y s_z - s_x c_z & c_x c_y \end{bmatrix} $$

Solving for the coordinate

See Eight-point algorithm

$$ z_c \begin{bmatrix} u\\ v\\ 1 \end{bmatrix} = \mbox{K} \begin{bmatrix} \mbox{R} & \mbox{T} \end{bmatrix} \begin{bmatrix} x_w\\ y_w\\ z_w \end{bmatrix} $$

Clone this wiki locally