We traditionally think of images as a rectangular grid, where a color is specified at every grid point, or pixel. While a user might notice each individual pixel at smaller scales, at larger scales these pixels are virtually invisible, as shown in the example below. We’ll talk about some of the artifacts this results in later on in the course. (Interesting reading: A Pixel is Not a Little Square)


How do we actually represent this data in our computer? Colors are almost always represented in RGB format - many colors are representable by combinations of red, green, and blue channels. You might also occasionally see a fourth alpha channel that represents transparency, but that isn’t quite as important in the context of this course.

Aside from RGB, there are also a number of different color representations, such as HSV or CIELab that claim to better represent the human visual system. However, we’ll mostly ignore these other formats because our displays directly operate on RGB - every pixel on your screen is made up of a red subpixel, green subpixel, and blue subpixel.

As we’ll see later on in the class, it turns out that RGB actually isn’t quite enough to represent all the colors that a human eye can see, so people over the years have tried introducing another yellow channel, including building displays that also have yellow subpixels. Unfortunately, not enough content was created that utilized that extra yellow channel, so most of these efforts ultimately failed.

Numerical Data Types

Traditionally, we restrict each of our color channels to the floating point range [0 … 1]. However, it turns out many of our image file formats and displays only support 8 bits, or 1 byte per channel. We thus often represent our color channels as 8 bit unsigned integers, with a range of the integers in [0 … 255]. In C++, we write this with the data type uint8_t. Be aware of what data type you're working with at any given moment!

It turns out that linearly quantizing color data to 8 bit and displaying that on a screen loses a lot of fine details that our human eyes would normally be able to distinguish. As a result, people have started building high dynamic range, or HDR displays that are capable of displaying beyond 8 bit. On the flip side, people have also developed HDR tone-mapping techniques that aim to intelligently map our floating point color data to 8 bit, to more accurately capture the contrast that our human eyes can see.


Indexing into Images

With that in mind, how do we actually represent the whole grid of colors that composes the entire image? To demonstrate, imagine the image as a 2D coordinate system, where [0, 0] is the top left corner (this is common practice - don’t use the bottom left!). We can think of row 0 as the first row from the top, row 1 as the second row, and so forth. Similarly, we can think of column 0 as the first column from the left, column 1 as the second column, and so forth. Thus, a pixel at x = 50, y = 100 would be the pixel at column 50 from the left, row 100 from the top. Don’t get confused - x = column, y = row.

coordinate system

An intuitive way to represent this would be just simply as a 2d array with dim height, width - the first dimension corresponds to the row (y), and the second dimension corresponds to the column (x). We call this row-major format. Some programs also use column-major, where the first dimension corresponds to the column (x), and the second dimension corresponds to the row (y). Be careful about which format your libraries and programs use!

However, fundamentally the memory inside your computer is 1 dimensional, not 2 dimensional. Thus, another common practice is to basically flatten this 2d data structure into a one dimensional array with length width * height. As we’ll talk about in discussion, we basically place each row after the previous one in this new flattened data structure. Thus, a pixel at position [x, y] would be located at y * width + x in our buffer. If we incorporate the 3 RGB color channels, each pixel would take 3 bytes to represent. If we assume R = 0, G = 1, and B = 2, the color channels of the pixel would be located at (y * width + x) * 3 + channel. If we include the fourth alpha channel, the color channels of the pixel would be located at (y * width + x) * 4 + channel.

One thing to keep in mind during your projects in this class is that images are huge. A 1920 by 1280 image contains 2,457,600 pixels. With three color channels, that’s 7,372,800 data points, which is a whole load of data. Thus, when working with your projects, try and start with smaller examples, to make it easier to debug your code.