-
Notifications
You must be signed in to change notification settings - Fork 0
Concept and Math Notes
Given a sufficient collection of (u,v) coordinates in several different images that correspond to matching (x,y,z) coordinates in a 3D space, we should be able to derive a description of the exact camera properties (position and rotation, or "pose", as well as intrinsic properties like focal length). With the camera properties derived, we can place reconstructed cameras into a 3D scene with the 2D images as background planes that can then be used as off-axis, non-orthographic references for 3D modeling.
Dense mesh reconstruction is not the goal. There are plenty of options that attempt dense mesh reconstruction using similar principles, including using automatic (and dense) feature extraction, but that is contrary the approach desired. We want to specify common points manually. We do not want an automatic reconstruction of the subject; we want to model the subject ourselves, with confidence that our references are posed to give us an accurate basis to do this.
The fundamental concept of 3D projection is well-described on Wikipedia in 3D Projection and Camera resectioning, but reproduced here for reference.
-
$\mbox{a}_\left(x,y,z\right)$ - the feature coordinate in 3D space -
$\mbox{c}_\left(x,y,z\right)$ - the camera coordinate in 3D space -
$\theta_\left(x,y,z\right)$ - the camera rotation as Tait-Bryan/Euler angles -
$\mbox{e}_\left(x,y,z\right)$ - the display surface relative to$c$ - Convention usually treats
$z$ as positive, though it being negative is more physically correct, but will flip the image - These values describe focal length and other camera intrinsics
- Convention usually treats
-
$\mbox{b}_\left(u,v\right)$ - the feature coordinate in image space
To compute
Which in turn can be used to find
Where:
-
$z_c$ - the z-coordinate of the camera relative to the world origin -
$\left(u, v\right)$ - the image coordinate -
$\left(x_w, y_w, z_w\right)$ - the feature coordinate in world space -
$\mbox{K}$ - the camera's intrinsic properties-
$\alpha_x$ and$\alpha_y$ represent the focal length in terms of pixels and can be expressed instead as$f \cdot m_x$ and$f \cdot m_y$ , where$m$ are the inverses of the width and height of a pixel on the projection plane and$f$ is the focal length in terms of distance -
$\lambda$ is the skew coefficient between the x and y axis, usually 0 -
$u_o, v_o$ represent the principal point, ideally the center of the image
-
-
$\mbox{R}$ - the camera's 3x3 rotation matrix- This is identical in form to the three expanded rotation matrices in Approach 1, above.
-
$\mbox{T}$ - the camera's position (translation) matrix- This is the position of the origin of world coordinates expressed in coordinates of the camera-centered coordinate system, not the coordinates of the camera in world space. The latter would be
$C = -R^{-1}T$ .
- This is the position of the origin of world coordinates expressed in coordinates of the camera-centered coordinate system, not the coordinates of the camera in world space. The latter would be
Lens-based cameras (as opposed to pinhole cameras) invariably undergo image distortion. Software sometimes corrects for this in the captured image. It is possible to estimate and account for this distortion, but it significantly complicates the above equations.
For now, images will be treated as undistorted. As such, users should pre-correct images, if able, before using them. Perhaps in the future, this will be revisited. For further reading on distortion, see this paper.
Using