teaching machines

CS 488: Lecture 21 – Perspective

April 12, 2021 by . Filed under graphics-3d, lectures, spring-2021.

Dear students:

Why do things farther away appear smaller? And how do we achieve that effect in our renderers? We answer these questions today as we explore the perspective matrix.

How Real Eyes Work

Our eyes are complex organs, and we will reduce them to an abstraction that is easier to model in code. In the front, eyes have a lens through which light enters. We’ll assume the lens is the size of a pinhole, though our real lenses have area. In the back, the eyes have an image plane on which the light lands. In truth, eyes have a curved retina, but the curvature isn’t helpful.

Some ancient thinkers proposed that sight is achieved by a force leaving our eyes and going out to collect color from the surfaces it sees. When we blink, they reasoned, the world darkened. These days we understand that photons are emitted from light sources and hit the surfaces of the world. Photons of certain frequencies are absorbed. Photons of other frequencies are reflected or refracted. Some of these redirected photons bounce into our eyes and stimulate rod and cone cells in our retina. Some photons hit other objects and then bounce into our eyes. Some things we never see.

We can figure out where a photon from an object will land on our retina by drawing a line from the lens to the object. In the absence of black holes, photons travel in a straight line. Where that line hits the image plane is where the photon will hit the image plane. An object that is farther away will have a line with a smaller slope, and will therefore project to a small area of the image plane. If our retina had as much depth as the world, we wouldn’t experience this shrinking. How would we operate in such a world?

Because of our the arrangement of the image plane and lens, the world that we see projects upside down. The image does not need flipping. Our brain associates excited cells at the bottom of the retina with stimuli at the top of our field of view.

How Fake Eyes Work

In computer graphics, we take on the role of filling that image plane manually. The image plane is our framebuffer. We need to take the world and cast it toward our fake eye, recording what colors pass through the image plane and where.

Earlier in the semester we decided to introduce a notion of eye space. In eye space, the eye is the center of the universe, positioned at $[0, 0, 0]$ and looking down $[0, 0, -1]$. These conventions still stand in our discussion today.

Viewing Frustum

Since we’re not dealing with reality, we can take some liberties with how eyes work. Our first liberty will be to move the image plane in front of the lens, which totally wouldn’t work in reality but which will make things easier for us. We call the distance from the eye to the image plane $\mathrm{near}$. The image plane is no longer flipped.

Our second liberty will be to constrain how far we can see by establishing a $\mathrm{far}$ distance. Any object past this far clipping plane will not be projected on the image plane. This is not how the physical world behaves. In the physical world, things just get really small or disappear behind the curvature of Earth. In graphics, we implement far clipping mostly as an optimization. If something’s barely visible, we want to spend little time processing it. However, we must be careful. If $\mathrm{far}$ is too small, the user will notice objects popping into the horizon.

With the near and far clipping planes pinned down, we must determine how much of the world we want to see. There are various ways we can establish this periphery. Our approach will be to define the vertical field of view from the eye to the image plane. Once we have $\mathrm{near}$ and $\mathrm{fov}_y$, we can determine the eye space coordinate of the image plane’s top:

$$\begin{aligned}\frac{\mathrm{top}}{\mathrm{near}} &= \tan \frac{\mathrm{fov}_y}{2} \\\mathrm{top} &= \tan \frac{\mathrm{fov}_y}{2} \cdot \mathrm{near}\end{aligned}$$

How much of the world can we see to the right? Since the image plane will fit inside a window with a fixed aspect ratio, we want our visual field to conform to that same aspect ratio.

$$\begin{aligned}\frac{\mathrm{right}}{\mathrm{top}} &= \mathrm{aspect\ ratio} \\\mathrm{right} &= \mathrm{aspect\ ratio} \cdot \mathrm{top} \\\end{aligned}$$

The visual field is centered around the eye’s line of sight. We get the other edges of the image plane for free.

$$\begin{aligned}\mathrm{bottom} &= -\mathrm{top} \\\mathrm{left} &= -\mathrm{right}\end{aligned}$$

The four numbers $\mathrm{near}$, $\mathrm{far}$, $\mathrm{fov}_y$, and $\mathrm{aspect\ ratio}$ are all we need to establish the chunk of the world that we want to see. This chunk is called the viewing frustum. To avoid judgment, be careful not to put a extra R in “frustum.”

Plane Projection

With our frustum defined, we are ready to figure out where on the image plane each vertex projects. Let’s assume we have the vertex’s eye space position stored in $\mathbf{p}_\mathrm{eye}$. We want the projected position $\mathbf{p}_\mathrm{plane}$.

We know this without much effort:

$$z_\mathrm{plane} = \mathrm{near}$$

By setting up similar triangles, we figure out the y-component:

$$\begin{aligned}\frac{y_\mathrm{plane}}{\mathrm{near}} &= \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\y_\mathrm{plane} &= \mathrm{near} \cdot \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\\end{aligned}$$

The x-component is computed similarly:

$$\begin{aligned}x_\mathrm{plane} &= \mathrm{near} \cdot \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\\end{aligned}$$

Normalized Device Coordinates

Recall that WebGL cares nothing of our notional spaces. It only knows about the $[-1, 1]$ space. We need to fit our $\mathbf{p}_\mathrm{plane}$ into this normalized space. We want coordinates at the top of the frustum to map to 1. We want coordinates at the right of the frustum to map to 1. We know what the coordinates of these edges are; we computed them earlier as $\mathrm{top}$ and $\mathrm{right}$. To normalize, we divide by these values.

$$\begin{aligned}x_\mathrm{ndc} &= \frac{\mathrm{near}}{\mathrm{right}} \cdot \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\y_\mathrm{ndc} &= \frac{\mathrm{near}}{\mathrm{top}} \cdot \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\\end{aligned}$$

We ignore the z-component for the moment because it’s messy.

Matrix and Perspective Divide

Our transformation pipeline is built around matrices. We want to find a matrix that transforms our eye space coordinates into normalized device coordinates. We want this to happen:

$$\begin{bmatrix}? & ? & ? & ? \\? & ? & ? & ? \\? & ? & ? & ? \\? & ? & ? & ? \\\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} \\y_\mathrm{eye} \\z_\mathrm{eye} \\1\end{bmatrix} = \begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} \cdot \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\\frac{\mathrm{near}}{\mathrm{top}} \cdot \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\? \\1\end{bmatrix}$$

What row when dotted with the eye space position will produce $x_\mathrm{ndc}$? None. It’s not possible to bring both $x_\mathrm{ndc}$ and $z_\mathrm{ndc}$ into the same term with a dot product. What will we do? Our matrix system is broken.

Never fear. The GPU designers snuck in a hack. They decided that instead of targeting NDC space directly, we can target an intermediate space. After we emit a position in this intermediate space, the GPU will divide all components of the position by the w-component. Since our normalized device coordinates have $-z_\mathrm{eye}$ in their denominator, that’s what we want to drop into our w-component.

$$\begin{bmatrix}? & ? & ? & ? \\? & ? & ? & ? \\? & ? & ? & ? \\? & ? & ? & ? \\\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} \\y_\mathrm{eye} \\z_\mathrm{eye} \\1\end{bmatrix} = \begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} \cdot x_\mathrm{eye} \\\frac{\mathrm{near}}{\mathrm{top}} \cdot y_\mathrm{eye} \\? \\-z_\mathrm{eye}\end{bmatrix}$$

The division by $w$ is called the perspective divide. It lands us at the normalized device coordinates that we want:

$$\begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} \cdot x_\mathrm{eye} \\\frac{\mathrm{near}}{\mathrm{top}} \cdot y_\mathrm{eye} \\? \\-z_\mathrm{eye}\end{bmatrix} \cdot \frac{1}{-z_\mathrm{eye}} = \begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} \cdot \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\\frac{\mathrm{near}}{\mathrm{top}} \cdot \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\? \\1\end{bmatrix}$$

The intermediate space right before the perspective divide is called clip space. It’s in clip space that the GPU performs clipping, deciding whether or not a vertex is in the viewing frustum.

The perspective divide frees us up to deduce a few rows of our perspective transformation matrix. The x- and y-components are scaled, and the bottom row selects out and negates the z-component to form the correct $w$:

$$\begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} & 0 & 0 & 0 \\0 & \frac{\mathrm{near}}{\mathrm{top}} & 0 & 0 \\? & ? & ? & ? \\0 & 0 & -1 & 0 \\\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} \\y_\mathrm{eye} \\z_\mathrm{eye} \\1\end{bmatrix} = \begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} \cdot x_\mathrm{eye} \\\frac{\mathrm{near}}{\mathrm{top}} \cdot y_\mathrm{eye} \\? \\-z_\mathrm{eye}\end{bmatrix}$$

Row 3

All that’s left is row 3 of the matrix. We know that this dot product operation is going to happen:

$$\begin{bmatrix}? & ? & ? & ?\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} &y_\mathrm{eye} &z_\mathrm{eye} &1\end{bmatrix} = z_\mathrm{clip}$$

But we must reason out what the unknowns should be. A position’s $z_\mathrm{clip}$ does not depend on its x- or y-components, so we can fill in a couple values:

$$\begin{bmatrix}0 & 0 & ? & ?\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} &y_\mathrm{eye} &z_\mathrm{eye} &1\end{bmatrix} = z_\mathrm{clip}$$

It’s not clear what the other two unknowns should be. Let’s name them so we can do some algebra:

$$\begin{bmatrix}0 & 0 & a & b\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} &y_\mathrm{eye} &z_\mathrm{eye} &1\end{bmatrix} = z_\mathrm{clip}$$

Let’s multiply the dot product through to simplify:

$$a \cdot z_\mathrm{eye} + b = z_\mathrm{clip}$$

The perspective divide will get applied. The divide lands us in NDC space. Let’s apply this division:

$$\frac{a \cdot z_\mathrm{eye} + b}{-z_\mathrm{eye}} = z_\mathrm{ndc}$$

Our two unknowns are still unknown. However, we have a couple of truths that will help us resolve them. First, we know what $z_\mathrm{ndc}$ should be at $-\mathrm{near}$ and $-\mathrm{far}$:

$$\begin{aligned}\frac{a \cdot -\mathrm{near} + b}{\mathrm{near}} &= -1 \\\frac{a \cdot -\mathrm{far} + b}{\mathrm{far}} &= 1 \\\end{aligned}$$

Two equations with two unknowns is a linear system that we can solve. Let’s solve the first equation for $b$:

$$\begin{aligned}\frac{a \cdot -\mathrm{near} + b}{\mathrm{near}} &= -1 \\a \cdot -\mathrm{near} + b &= -\mathrm{near} \\b &= -\mathrm{near} – a \cdot -\mathrm{near} \\ &= a \cdot \mathrm{near} – \mathrm{near} \\\end{aligned}$$

Now we drop this expression for $b$ in our equation and solve for $a$:

$$\begin{aligned}\frac{a \cdot -\mathrm{far} + b}{\mathrm{far}} &= 1 \\\frac{a \cdot -\mathrm{far} + a \cdot \mathrm{near} – \mathrm{near}}{\mathrm{far}} &= 1 \\a \cdot -\mathrm{far} + a \cdot \mathrm{near} – \mathrm{near} &= \mathrm{far} \\a \cdot -\mathrm{far} + a \cdot \mathrm{near} &= \mathrm{near} + \mathrm{far} \\a(\mathrm{near} – \mathrm{far}) &= \mathrm{near} + \mathrm{far} \\a &= \frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} – \mathrm{far}} \\\end{aligned}$$

We drop this expression for $a$ back into the equation for $b$ and simplify:

$$\begin{aligned}b &= a \cdot \mathrm{near} – \mathrm{near} \\ &= \frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} – \mathrm{far}} \cdot \mathrm{near} – \mathrm{near} \\ &= \mathrm{near} \cdot \left(\frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} – \mathrm{far}} – 1\right) \\ &= \mathrm{near} \cdot \left(\frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} – \mathrm{far}} – \frac{\mathrm{near} – \mathrm{far}}{\mathrm{near} – \mathrm{far}}\right)\\ &= \mathrm{near} \cdot \frac{\mathrm{near} + \mathrm{far} – \mathrm{near} + \mathrm{far}}{\mathrm{near} – \mathrm{far}} \\ &= \mathrm{near} \cdot \frac{2 \cdot \mathrm{far}}{\mathrm{near} – \mathrm{far}} \\ &= \frac{2 \cdot \mathrm{near} \cdot \mathrm{far}}{\mathrm{near} – \mathrm{far}} \\\end{aligned}$$

Whew. That algebra plugs in the last two holes in our matrix. Altogether, our perspective transformation looks like this:

$$\begin{bmatrix}\frac{\mathrm{near}}{\mathrm{right}} & 0 & 0 & 0 \\0 & \frac{\mathrm{near}}{\mathrm{top}} & 0 & 0 \\0 & 0 & \frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} – \mathrm{far}} & \frac{2 \cdot \mathrm{near} \cdot \mathrm{far}}{\mathrm{near} – \mathrm{far}} \\0 & 0 & -1 & 0 \\\end{bmatrix} \cdot \begin{bmatrix}x_\mathrm{eye} \\y_\mathrm{eye} \\z_\mathrm{eye} \\1\end{bmatrix} = \begin{bmatrix}x_\mathrm{clip} \\y_\mathrm{clip} \\z_\mathrm{clip} \\-z_\mathrm{eye}\end{bmatrix}$$


The perspective matrix is complex enough that many of us would prefer to jam into a library and only ever revisit as a client. However, knowledge of its construction and behavior is useful at times:

Most importantly, some knowledge of the perspective matrix is essential for our upcoming discussion of projective texturing and shadows.

See you next time.


P.S. It’s time for a haiku!

It’s shrinking each year
The Washington Monument
From all the pinching