Deriving the View Matrix
Computer graphics is less about knowing when to apply algorithms and more about knowing where. There are many possible wheres. Our 3D models are born in model space. To situate models in the virtual world, their model space coordinates are transformed into world space. Many lighting algorithms are applied in camera or eye space. From eye space, we pass on into clip space, normalized device space, and screen space.
In this post, we’ll restrict our discussion to the transformation that turns world space coordinates into eye space coordinates. We’ll call the matrix that performs this transformation the view matrix $\mathbf{V}$. What does $\mathbf{V}$ look like? When I first learned computer graphics, the structure of the view matrix was just handed down to me from on high. But you can only do so much with someone else’s magic.
We have the following four constraints to help us figure out what $\mathbf{V}$ looks like.
- The viewer’s right arm should extend along the x-axis in eye space.
- The viewer should be standing along the y-axis in eye space.
- The viewer should be looking along the z-axis in eye space.
- The viewer should be standing at the origin in eye space.
Let’s make our constraints a bit more precise by considering how the transformation turns world space coordinates into eye space coordinates.
- The vector leading out the viewer’s right arm in world space is $\mathbf{r}$. When we use $\mathbf{V}$ to transform $\mathbf{r}$, we should get the x-axis, like this: $$\mathbf{V} \times \begin{bmatrix} r_x \\r_y \\r_z \\ 0 \end{bmatrix} = \begin{bmatrix} 1 \\0 \\0 \\0 \end{bmatrix}$$
- The vector leading up out of the viewer’s head in world space is $\mathbf{u}$. When we use $\mathbf{V}$ to transform $\mathbf{u}$, we should get the y-axis, like this: $$\mathbf{V} \times \begin{bmatrix} u_x \\u_y \\u_z \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\1 \\0 \\0 \end{bmatrix}$$
- The vector leading along the viewer’s focal or forward direction in world space is $\mathbf{f}$. When we use $\mathbf{V}$ to transform $\mathbf{f}$, we should get the z-axis, like this: $$\mathbf{V} \times \begin{bmatrix} f_x \\f_y \\f_z \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\0 \\1 \\0 \end{bmatrix}$$
- The viewer’s position in world space is $\mathbf{p}$. When we use $\mathbf{V}$ to transform $\mathbf{p}$, we should get the origin, like this: $$\mathbf{V} \times \begin{bmatrix} p_x \\p_y \\p_z \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\0 \\0 \\1 \end{bmatrix}$$
We can put these four vector multiplications together into a single matrix-matrix multiplication.
Notice the structure of our matrix of viewer parameters. It is the product of a translation matrix $\mathbf{T}$ and a rotation matrix $\mathbf{R}$. Additionally, the right-hand side is the identity matrix.
Let’s solve for $\mathbf{V}$. We pull a few identities out of our hat to make this happen. First, multiplying a matrix by its inverse turns it into the identity matrix. Second, multiplying a matrix by the identity yields the original matrix. Third, the inverse of a product is the product of its operands’ inverses, commuted.
In separating our parameter matrix into $\mathbf{R}$ and $\mathbf{T}$, we have made finding the inverses easier. The job of the inverse of a transformation is to undo the transformation. The inverse of a translation, then, is just a translation by the negative offsets. The inverse of a rotation matrix “unrotates,” and the inverse just happens to be the transpose—though I do not find this obvious. Therefore, we have the following construction for $\mathbf{V}$:
One could also arrive at this matrix by thinking of it as a translation of the viewer to the origin and then a rotation the viewer’s frame to align with the standard axes. I prefer the construction above because I don’t trust myself to understand rotation matrices.