R&P -- Homogeneous Matrix Tips for Gamedevs

Intro

Matrices are part of the bedrock of any game engine, and if you’re a game developer or graphics engineer, you’ve likely needed to manipulate, construct, and leverage matrices of many forms. Homogeneous matrices, in particular, are really worth having a complete grasp over. As with many things in math, there are many levels of understanding when it comes to matrices and linear algebra. This post is billed as a “tips and tricks” type of post, but the real goal here is to build the intuition behind these tricks.

A quick word on notation

I need to clarify some things about notation before actually diving into the math to make sure we’re on the same page.

Take this matrix for example:

a_{0} b_{0} c_{0} a_{1} b_{1} c_{1} a_{2} b_{2} c_{2}

If I asked almost any math student in the world to identify the rows and columns of this matrix are, they would be able to do so without any hesitation. In this case, the rows are, well, rows (consistenting here of elements with the same letter), and the columns consist of vertical elements with the same subscript.

In the game dev community, however, some folks would ask strange questions like “are our matrices row-major or column-major” or “is this a DirectX codebase or an OpenGL codebase” and so on. The idea being, of course, that something about the storage convention might flip the notion of “rows” and “columns” about.

Rows are rows! Columns are columns! How I choose to layout these matrix elements in memory has no bearing on what are rows, and what are columns. Furthermore, what I choose to represent within a matrix row or a matrix column also has no relation to the storage. Your storage convention may change the way matrix elements are presented in a debugger perhaps, but this doesn’t affect your choice of what a matrix row or column means in your engine. I don’t blame folks for this confusion, but it’s unfortunate that a lot of posts I’ve seen in the wild actually write out the matrix “transposed” (the way you might see values presented in a debugger if you leverage a row-major storage convention) which certainly confuses matters.

Also, a $3 \times 4$ matrix is a matrix that has $3$ rows and $4$ columns. The convention here is to describe matrix shape by specifying the number of rows first, and the number of columns after. This is the case regardless of if you choose to use a row-major or column-major storage convention. The math notation (thankfully) doesn’t care about the order in which you lay out bits in your address space.

With that out of the way, let’s get into some math!

Submatrices

Before diving into any specific tips related to homogeneous matrices, let’s start simple. If I asked you to multiply a $2 \times 2$ matrix and a $2 \times 1$ vector, you’d probably be able to come up with the following easily enough.

[a c b d] [e f] = [a e + b f ce + df]

Let me ask you though, what are the types of $a$ , $b$ and all the other elements there? I imagine most people would assume these matrix elements are scalar numbers. What if I told you that all of these elements were complex numbers? It’s probably not difficult to see that the “answer” doesn’t change.

What if $a$ , $b$ , $c$ , and $d$ were actually matrices? Does this still work?

[a_{0} a_{2} a_{1} a_{3}] [c_{0} c_{2} c_{1} c_{3}] [b_{0} b_{2} b_{1} b_{3}] [d_{0} d_{2} d_{1} d_{3}] [e f] = ?

Well, provided that $e$ and $f$ have the right shape, we should still expect this to work also. By “right shape” here, I really mean that $e$ and $f$ are allowed to be any matrix with $2$ rows for this product to make sense. If we suppose then that, $e$ and $f$ are both $2 \times 1$ column vectors, our initial unassuming matrix product now looks like this:

[a_{0} a_{2} a_{1} a_{3}] [c_{0} c_{2} c_{1} c_{3}] [b_{0} b_{2} b_{1} b_{3}] [d_{0} d_{2} d_{1} d_{3}] [e_{0} e_{1}] [f_{0} f_{1}] = [a_{0} a_{2} a_{1} a_{3}] [e_{0} e_{1}] + [b_{0} b_{2} b_{1} b_{3}] [f_{0} f_{1}] [c_{0} c_{2} c_{1} c_{3}] [e_{0} e_{1}] + [d_{0} d_{2} d_{1} d_{3}] [f_{0} f_{1}]

This certainly looks strange, but if you squint, you can still make out the rough structure of the matrix multiplication is the same as before. If we write out the entire expression without the internal brackets, in fact, we see that what we have written is very similar to a more familiar $4 \times 4$ by $4 \times 1$ matrix-vector product, with “normal” scalar elements instead of matrix elements:

a_{0} a_{2} c_{0} c_{2} a_{1} a_{3} c_{1} c_{3} b_{0} b_{2} d_{0} d_{2} b_{1} b_{3} d_{1} d_{3} e_{0} e_{1} f_{0} f_{1}

I’ll leave it to you to confirm that performing this product will net you with the same final results as the nested matrix version (albeit with fewer brackets).

“OK this is all very cute,” you might say (and indeed, it’s quite cute), “but who cares?”

Well, this sort of viewpoint turns out to be surprisingly handy! The main insight is that the matrix product really functions irrespective of what the elements are, provided that all nested operations performed between elements are well-defined. Matrices encode linear combinations, and if the elements behave, this sort of matrix nesting works.

To see one such application, let’s take a detour into homogeneous coordinates.

Homogeneous coordinate space recap

I’m going to assume you have at least a passing familiarity with homogeneous coordinates. By way of a brief recap, N-dimensional vectors can be thought of as arrows emanating from the origin. Given a 3D vector (a vector with 3 components) and a $3 \times 3$ matrix that acts on it, the result must always be another 3D vector, meaning that while we can scale the vector and move its tip around the origin, we can’t “move” the vector in space. There’s no place to stuff that information. As such, graphics programmers often operate on 4D vectors and $4 \times 4$ transformations which enable us to encode translations (among other types of transformations).

Continuing our review, let’s take two $4 \times 1$ vectors like the following:

x y z 0 and x y z 1

In a standard homogeneous space, which vector represents a position, and which vector represents a direction?

I like to think of the w coordinate (4th coordinate) as the value that scales the effect of any translational effect applied to the vector prior to the projective divide. In this example, the left vector represents a direction because we expect it to be invariant under any translation, and the right vector should be a position because any translation applied has a direct impact on its final location. Let’s leverage our submatrix trick to see why.

Decomposing a 4x4 matrix

Let’s start with a $4 \times 4$ matrix that only performs rotations and translation and consider applying it to a position.

r_{0} r_{1} r_{2} 0 r_{3} r_{4} r_{5} 0 r_{6} r_{7} r_{8} 0 t_{x} t_{y} t_{z} 1 x y z 1

With some ~~movie~~ math magic, let’s turn this into something a bit simpler to work with:

[R 0 t 1] [p 1]

Here, $R$ is a $3 \times 3$ rotation matrix, $t$ is a 3D column vector, $0$ is a $1 \times 3$ row vector of zeroes, $p$ is a 3D column vector, and the $1$ values are just scalars. Using our “trick” wherein matrix multiplication does not care about the types of the elements provided operations performed are valid, we can now write the following:

[R 0 t 1] [p 1] = [Rp + t \cdot 1 0 \cdot p + 1 \cdot 1] = [Rp + t 1]

Performing the same computation on a direction yields the following:

[R 0 t 1] [d 0] = [Rd + t \cdot 0 0 \cdot d + 1 \cdot 0] = [Rd 0]

Hopefully now, you can see why I say that the w coordinate in a homogeneous vector “scales any translation present” in a matrix that operates on it. You can literally see it doing just that above. Note that a general value in the w coordinate doesn’t quite behave as a translation scale once the perspective divide is performed (it actually scales the position before applying the rotation and translation, can you see why?). The point though is that being able to view matrices as smaller matrices with richer elements allowed us to arrive at this conclusion and perform the computation much more quickly.

Fast matrix inversions

A crime against matrix math I sometimes see is a general matrix inversion routine applied to a matrix product that admits a very straightforward direct inverse.

Suppose we have a transform that we know only has rotation and translation (scale is easy to handle too, but we’ll ignore it for now to keep the example simpler). How should we compute the inverse of this matrix $A$ ?

A^{- 1} = [R_{a} 0 t_{a} 1]^{- 1} = ?

Well, assuming you’ve memorized the common formula for a $2 \times 2$ matrix inverse, you might be tempted to write down the following almost immediately.

[R 0 t 1]^{- 1} \neq = \frac{1}{det A} [1 - t 0 R] ?

This result isn’t correct because the standard formula you might have memorized for a $2 \times 2$ matrix was derived with scalar values and relies on operations that aren’t correct if the elements themselves aren’t all scalar.

As a reminder, in most contexts, the inverse of a 2x2 matrix is computed by swapping the diagonal elements, negating the off-diagonal elements, and then scaling the whole result by the inverse of the determinant. It's worth sitting down and actually working this out on paper every so often (there are several ways to arrive at this answer, and each approach conveys different insight).

OK, so that didn’t work, but how should we proceed? Here’s one way to do it. Start by writing down our identity invariant with some unknowns:

[R 0 t 1] [a c b d] = [1001]

Here, $a$ , $b$ , $c$ , and $d$ are unknowns that constitute our desired inverse. Your job is to work out what these unknowns are. Feel free to try to work this out on paper before reading on!

I have been careful here to bold the values that are not actually scalar quantities. It's worth pausing for a moment to infer what the dimensionalities of each of the unknown elements are. There is a single unique choice that gives the equation as written meaning.

… elevator music

… ok back? Let’s see how you did. The equation written above can be distilled down into four equations in four unknowns.

Ra + t \cdot c Rb + d t 0 \cdot a + c 0 \cdot b + d = 1 = 0 = 0 = 1

First, notice that $a$ must be a $3 \times 3$ matrix, $b$ must be a $3 \times 1$ column vector, $c$ is a $1 \times 3$ row vector, and $d$ is a scalar value. In other words, the original matrix, the inverse, and the final identity matrix all have the same shape. This isn’t an accident, but is worth mulling over.

Equations (3) and (4) tell us that $c$ is a zero vector and $d$ is one, so we can substitute these values in the first two equations and simplify:

Ra Rb + t a b = 1 = 0 = R^{- 1} = - R^{- 1} t

Recalling that the inverse of a rotation matrix is just its transpose, we can now substitute our variables into the answer we were looking for:

[R 0 t 1]^{- 1} = [a c b d] = [R^{⊺} 0 - R^{⊺} t 1]

Because our original matrix had no scale, we know that its determinant must have been 1. But this means the determinant of the inverse matrix must also be one. Can you confirm this?

Making observations like this as you go is a good habit to have, because each of these observations keep your calculations on track. If you happen to make an incorrect observation, you end up learning something along the way when trying to uncover your mistake. This can often be an even more valuable experience than the situation where everything just goes swimmingly.

Suffice it to say, computing the inverse of a rigid transformation in this way is both more numerically stable and faster than computing a general $4 \times 4$ matrix inverse. To test your understanding, try to repeat the process, but replace $R$ with a more general $3 \times 3$ matrix $M$ and compute the $4 \times 4$ inverse. What changes?

A different approach to the same inverse

Let’s compute that inverse again using a different mechanism. First, let’s start by asking what the matrix looks like when we compose two separate operations, a rotation, followed by a translation.

A rotation without any translation would be encoded in the following way:

[R 0 01]

And a translation without any rotation looks like:

[10 t 1]

where the upper-left rotation was replaced with the identity matrix. The matrix that first rotates, and then translates is:

[10 t 1] [R 0 01] = [R 0 t 1]

Because we are representing our homogeneous vectors as columns, the rightmost matrix factor in a matrix product chain is the transform that is applied first. It’s sort of like how in code, two functions f and g compose from the inside out. g(f(x)) applies f first, and g second, so if this confuses you, think “inside-out” and not “left-to-right” or “right-to-left.”

From the expression above, let’s take the inverses of both sides:

([10 t 1] [R 0 01])^{- 1} = [R 0 t 1]^{- 1}

A super important fact to know (deep down in your soul) is that the inverse of a matrix product is the reverse-product of their inverses, since

(AB) (B^{- 1} A^{- 1}) = AB B^{- 1} A^{- 1} = A A^{- 1} = 1

This means we can manipulate the product inverse to try to arrive at the entire inverse:

([10 t 1] [R 0 01])^{- 1} = [R 0 01]^{- 1} [10 t 1]^{- 1} = [R^{⊺} 0 01] [10 - t 1] = [R^{⊺} 0 - R^{⊺} t 1]

To nobody’s surprise, this is precisely the same result as before! Of course, it’s a useful confirmation, and yet another reminder that splitting a larger problem into smaller sub-problems through the use of identities is a useful technique.

Your engine should have a 3x4 “matrix” type

All this is to say, $4 \times 4$ matrices that have a $[0, 0, 0, 1]$ row vector on the bottom really can be thought of in terms of two separate pieces:

The upper-left 3x3 non-affine transformation matrix
The upper-right 3x1 translation column vector

Both of these elements live comfortably in the context of a nice and simple $2 \times 2$ “super matrix”. As this matrix shape is so common, actually storing those zeroes and the one is fairly wasteful, and having these values be implicit in a $3 \times 4$ representation is an immediate 25% reduction in memory used. I did this exact optimization in a previous shipped game and immediately recovered nearly 150 megabytes of memory at one point in production! This was no doubt, a very extreme case since the savings scale with scene size, but I would say it’s non-optional for any engine that has a lot of stuff to render.

The Rookery has a float3x4 type that performs matrix operations, determinants, inverses, and so on with this submatrix view in mind. It also supports operations between float3x4 matrices and the more general float4x4 matrix by “promoting” the float3x4 to a full float4x4 (the optimizer realizes where the zeroes and ones are and can cut the final operation down a bit).

You might be curious whether I chose to adopt a column-major storage convention or a row-major one for my matrices. I chose a column-major storage convention (so, columns are stored contiguously in memory, with earlier columns present at lower memory addresses).

This convention is really handy because I often want to do operations on the 3x3 submatrix of a 3x4, and also often want to extract or assign a specific column far more often than a specific row. Promoting a 3x3 matrix to a 3x4 matrix or a 3x4 matrix to a 4x4 matrix is also a single memcpy.

Rookery is right-handed-z-up-column-major gang 😎

It doesn't really matter at the end of the day of course, since all those values will end up in CPU or GPU registers one way or another, but I have appreciated the small affordances this choice has given me. The choice to work primarily with column or row subspaces is a personal one. However, I do encourage you to be deliberate about this choice, whatever it ends up being.

Let’s make a `look_at` function

Deriving the formula for a “look at” camera matrix is a good exercise to use what we’ve covered so far.

The look_at function accepts the following arguments:

View subject position
View camera position
Up direction

For our purposes, we’ll suppose that “up” is generally oriented along the positive unit z-vector, and the coordinate system is right-handed. These three arguments uniquely determine the three directions of our camera coordinate frame $f$ , $r$ , and $u$ , which are the forward, right, and up directions given in world-space coordinates still. Let’s also refer to the camera position as lowercase $e$ (for “eye”).

Every time I set up a new renderer, I go through the exercise of setting up view and camera projection matrices. I actually think it's worth going through this process manually, as opposed to blindly following any blog post, including this one. There are quite a few posts where there are a bunch of sign flips in a bunch of mysterious places and the math "works out" at the end, although I wouldn't say the steps are sufficiently motivated. It's your engine, and your coordinate spaces. Treat them with care!

From those arguments, the look_at function computes a camera matrix $C$ that has the following actions:

The product with the forward direction is the positive unit z-vector direction.
The product with the up direction is the positive unit y-vector direction.
The product with the right direction is the positive unit x-vector direction.
The product with the camera position itself is the zero vector.

I’m continuing to say “direction” and “position” so that we can be explicit about the intended value of the homogeneous coordinate when we work this out. Of course, you can set up your world or view coordinate system differently if you like. The setup above is just one of many choices.

In math notation, the matrix $C$ needs to satisfy the following four equations.

C [f 0] C [u 0] C [r 0] C [e 0] = 0010 = 0100 = 1000 = 0001

Let’s consolidate the four equations above into one matrix equation instead:

C [r 0 u 0 f 0 e 1] = 1000010000100001

If rearranging expressions into matrices like this doesn't feel super comfortable, you likely think of matrix products in an element-by-element fashion, where each element is the dot product between a row vector from the first matrix factor and a column vector from the second matrix factor. This isn't a wrong interpretation, and definitely produces the correct result. However, a view that is often more helpful is one where each column of the matrix product is a linear combination of the columns of the first matrix factor, using a column in the second matrix factor as weights of said linear combination.

This interpretation of what matrix multiplcation "does" isn't unique to me, but the view encouraged by several prominent linear algebra texts. I definitely don't expect this appeal to authority to convince you however, so this may be the subject of a future post on matrices. In the meantime, however, I encourage you to try and weigh this interpretation of matrix multiplication against the element-by-element one. Being able to swap between these views depending on which one is more useful in the moment is an extremely useful skill.

We’re just one hop away from solving for $C$ now. Since the right-hand-side of the equation is just the identity, we can just multiply both sides of the equation by the inverse of that second factor.

C [r 0 u 0 f 0 e 1] [r 0 u 0 f 0 e 1]^{- 1} C = 1000010000100001 [r 0 u 0 f 0 e 1]^{- 1} = [r 0 u 0 f 0 e 1]^{- 1}

Recognizing that the $3 \times 3$ matrix $[ruf]$ is just a rotation matrix and calling it $R$ , we now know we want the inverse of the same rotation/translation transform we’ve been working with this whole time! Writing down “the answer” one last time for good measure:

C where R = [r 0 u 0 f 0 e 1]^{- 1} = [R^{⊺} 0 - R^{⊺} e 1] = [ruf]

In code, this is what this routine looks like in the Rookery:

// This version assumes a fixed world_up direction,
// and stipulates that the look direction is not
// colinear with world_up.
float3 forward = normalize(subject - eye);
float3 right   = normalize(cross(forward, world_up));
float3 up      = cross(right, forward);

// Assign the transpose of our coordinate frame to
// the upper-left 3x3 submatrix.
data.world_to_view_[0].x = right.x;
data.world_to_view_[0].y = up.x;
data.world_to_view_[0].z = forward_.x;

data.world_to_view_[1].x = right.y;
data.world_to_view_[1].y = up.y;
data.world_to_view_[1].z = forward_.y;

data.world_to_view_[2].x = right.z;
data.world_to_view_[2].y = up.z;
data.world_to_view_[2].z = forward_.z;

// NOTE: The "principal submatrix" in this engine is the
// upper-left 3x3 matrix of a 3x4 or 4x4 matrix object.
data.world_to_view_[3] = mul(
    data.world_to_view_.principal_submatrix(),
    -position_);

This wasn’t a lot of “work” to arrive at, but it’s worth mentioning that when I first sat down to work this out decades ago, it was really not that easy. The main thing that changed between then and now is the development of an intuition and “feel” for matrices that goes beyond “matrices are bags of numbers.” Math can sometimes be like that – excessive amounts of effort needed to build a foundation, but a huge payoff where what seemed like a hard problem at one point in time suddenly becomes solveable with a few scribbles.

My hope here is that the main thing you take away from this section is that matrices are very often not bags of numbers. They have an inherent structure, and as you develop an understanding of that structure, you will find manipulating matrices more natural and less error-prone. There is certainly an upfront cost to learning that structure, but this initial investment pays dividends!

To test your understanding, here are some things to consider:

Our submatrix view was useful in computing the inverse and products. Does this help for computing determinants also?
How does the inverse result we computed before change if the upper-left 3x3 submatrix isn’t a rotation matrix?
Try implementing a float3x4 matrix abstraction, that actually implicitly represents a 4x4 matrix with [0, 0, 0, 1] set as the bottom row.

Recognizing matrices at a glance

Shifting gears a bit, another really useful skill for game devs and graphics programmers is the ability to recognize matrices in the various contexts they show up. While it isn’t helpful to think of matrices as “bags of numbers,” that’s certainly what the CPU and GPU view them as. When examining a graphics capture in RenderDoc, PIX, or any other debugger, it can be useful to look at bags of numbers and have some heuristics in mind to determine if various matrices appear plausibly correct or not.

Recognizing rotation matrices

Given the following $3 \times 3$ matrix, how would you determine if it encodes a rigid rotation?

0.447213560 - 0.894427180 0.000000000 0.301297426 1.150648713 0.941554368 0.842151880 0.421075910 - 0.336860776

If you spotted that this is not a rotation, great! Some key facts worth knowing about rotation matrices are:

All column vectors have unit length.
All row vectors have unit length.
Each column vector is orthogonal to every other column vector.
Each row vector is orthogonal to every other row vector.
The inverse of a rotation matrix is equal to the transpose of the rotation matrix.

In this case, the middle element is greater than $1$ , so we can immediately see that this matrix couldn’t possibly represent a rigid rotation.

0.447213560 - 0.894427180 0.000000000 0.301297426 0.150648713 0.941554368 0.842151880 0.421075910 - 0.336860776

This matrix, on the other hand, is a valid rotation. Feel free to check some of the properties mentioned previously.

Recognizing projection matrices

A projection matrix transforms view space to clip space, where clip space spans the $[- 1, 1]$ interval in the $x$ and $y$ directions, and the $[0, 1]$ interval in the $z$ direction (OpenGL spans the $[- 1, 1]$ interval in the $z$ direction also, but this is a very ill-advised choice due to how it shapes z-precision).

Consider the following matrix. Is it an orthogonal projection matrix, or a perspective projection matrix?

1.063455642 0.000000000 0.000000000 0.000000000 0.00000000 1.49660575 0.00000000 0.00000000 0.00000000 0.00000000 - 0.35180056 1.00000000 0.00000000 0.00000000 0.13518006 0.00000000

The easiest “trick” to tell is to see how the matrix produces the w coordinate. In this case, this matrix maps the z coordinate to the w coordinate, so we expect that this is a perspective projection matrix of some kind. Intuitively, we know that objects farther in the distance should appear smaller, and the perspective divide will divide the x and y coordinates by w.

An orthogonal matrix on the other hand, will keep w fixed at $1$ because objects should appear to be the same size at any distance from the camera. Ergo, for an orthogonal matrix, the final matrix row would be $[0001]$ .

Closing Thoughts

Matrices in games generally have some semantic meaning. They do things, like moving or rotating objects in space, changing object sizes, transferring coordinate systems and so on. Because these matrices do things, there is an inherent structure to them, and the matrices we work with actually have some very nice properties. While we can use a very general set of linear algebra routines to do our job, this is no substitute for a deeper appreciation for that structure.

I think what makes matrix math in games frustrating is that when things don’t work or there are bugs, the results tend to be catastrophic. Your first triangle doesn’t show up on screen at all perhaps. Or you’ve introduced NaN values or infinities somewhere. Debugging is quite tricky because you know the bag-of-numbers in your capture or in-memory is wrong somehow, but you aren’t quite sure how to fix it.

When I get stuck, I generally go back to the basics. It’s important to understand how the operations work on tiny examples, similar to how we set up the construction of the look_at camera matrix. Practice makes perfect, and matrix math and linear algebra are no exceptions. It’s tempting to just flip signs or operand orders and such until things appear to work, but this type of YOLO approach will definitely invite trouble down the road. Be patient, stop, and do things correctly before proceeding. That’s how the really complicated stuff gets put together.

If you’d like to brush up or learn linear algebra in a much more complete sense, here are some books I’ve leveraged in the past:

Numerical Linear Algebra by Trefethen and Bau is an old book but an amazing reference I still refer to from time to time. It covers a very applied view on Linear Algebra that considers algorithm stability and numerical precision, in additional to performance. The book is from 1997, but still very relevant today!
Linear Algebra Done Wrong by Sergei Treil (pdf link) is another delightful book. Compared to Numerical Linear Algebra, this book is more proof-based so a bit “heavier” mathematically while focusing on developing an intuition that leans more towards geometry as opposed to algebra. If the first book was what I leveraged to gain a practical working knowledge of the algorithms in LA, this is one of the books that I used to develop a good mathematical foundation.
Linear Algebra Done Right by Sheldon Axler (pdf link) should be mentioned as well. LADR, compared to LADW, takes a different viewpoint wherein determinants are introduced very late, and most of the theoretical structure is built without determinants altogether. I don’t actually think determinants are bad, they are quite useful in fact (the author would be inclined to agree I believe). However, LADR is an excellent resource because it presents a familiar topic with a completely different viewpoint. As an aside, this is how I believe math should be studied. Many different lenses, many different perspectives, all consolidated together to produce a firm understanding.

Homogeneous Matrix Tips for Gamedevs