How JPEG or JPG Image Compression Works ?

Archisman Karmakar
Dec 22, 2022
10 min read

Updated: Dec 23, 2022

Do you know what is JPG or JPG 2000?

A file format
An image file format
An image compression algorithm
Joint Photographic Experts Group

Content

Introduction
Working
Deep Dive

JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable trade-off between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality. Since its introduction in 1992, JPEG has been the most widely used image compression standard in the world, and the most widely used digital image format, with several billion JPEG images produced every day as of 2015.

The term "JPEG" is an acronym for the Joint Photographic Experts Group, which created the standard in 1992. JPEG was largely responsible for the proliferation of digital images and digital photos across the Internet, and later social media.

JPEG compression is used in several image file formats. JPEG/Exif is the most common image format used by digital cameras and other photographic image capture devices; along with JPEG/JFIF, it is the most common format for storing and transmitting photographic images on the World Wide Web. These format variations are often not distinguished and are simply called JPEGs.

The MIME media type for JPEG is image/jpeg, except in older Internet Explorer versions, which provide a MIME type of image/pjpeg when uploading JPEG images. JPEG files usually have a filename extension of .jpg or .jpeg. JPEG/JFIF supports a maximum image size of 65,535×65,535 pixels, hence up to 4 gigapixels for an aspect ratio of 1:1. In 2000, the JPEG group introduced a format intended to be a successor, JPEG 2000, but it was unable to replace the original JPEG as the dominant image standard.

Working Principle

Lossy compression techniques like JPEG are designed to reduce the file size of digital images by discarding some of the data that is deemed less important to human perception. This is achieved by applying various techniques to transform and analyze the image data, and then selectively discarding the data that is less important.

JPEG compression is based on two psychovisual principles: changes in brightness are more important than changes in color, and low-frequency changes are more important than high-frequency changes. These principles are used to prioritize the data that is retained during the compression process. Human eye retina has about 120 million rod cells for brightness & luminance sensing and 6 million cone cells for discrete color sensing. So, basically human eyes can see black & white images with ease, but if the luminance is removed and only the RGB components are left in an image then it will look too odd.

Black & White Image looks same detailed as the color image

But if brightness & luminance components are removed the image looks too odd.

To implement these principles, JPEG compression uses a series of steps to transform and analyze the image data, and then selectively discard the data that is deemed less important. First, the image is transformed from the RGB color space to the YCbCr color space, which separates the luminance (brightness) information from the chrominance (color) information. Then, the image is divided into small blocks of pixels called "DCT blocks," and the DCT (Discrete Cosine Transform) is applied to each block to transform it from the spatial domain to the frequency domain. This allows the image data to be analyzed based on its frequency content, with the low-frequency components being more important than the high-frequency components. Finally, the data is quantized and entropy-coded to reduce the file size. The resulting compressed data is then stored in the JPEG file format.

When the JPEG image is decompressed and displayed, the process is reversed: the compressed data is decompressed, and the image is transformed back to the RGB color space for display on the monitor. Despite the loss of some data during the compression process, JPEG is able to achieve a good balance between file size and image quality, making it a widely used image format for digital photography and the internet.

Deep Dive

Sampling

JPEG uses the YCbCr color space to represent images, which separates the luminance (brightness) and chrominance (color) information. The Y component represents the luminance (brightness) of the image, while the Cb and Cr components represent the chrominance (color) information.

In the RGB color space, each pixel is represented by three color channels: red, green, and blue. These channels are combined to create the full range of colors that can be displayed on a monitor. However, human eyes are more sensitive to changes in brightness than they are to changes in color, so JPEG uses the YCbCr color space to allocate more bits to the luminance (Y) component and fewer bits to the chrominance (Cb and Cr) components. This allows JPEG to compress the image more efficiently while still preserving most of the visual quality.

The YCbCr color space is a variant of the YUV color space, which is commonly used in video and image processing. It is based on the way that the human eye perceives color and luminance, and it is designed to be more efficient for image and video compression.

As you mentioned, typical computer images are stored in the RGB color space, with each pixel represented by three 8-bit values for the red, green, and blue components. However, JPEG uses a different color space called YCbCr (also known as YUV) to encode and compress images.

YCbCr is a "luminance-chrominance" color space, which means that it separates the brightness (luminance) information from the color (chrominance) information. The Y component represents the luminance or brightness of the image, while the Cb and Cr components represent the chrominance or color information.

To convert an image from the RGB color space to the YCbCr color space, JPEG uses a mathematical transformation that separates the brightness and color information. This allows JPEG to allocate more bits to represent the luminance component, which is more important for image quality, while using fewer bits for the chrominance components. The YCbCr image is then compressed using traditional lossless compression techniques to reduce the file size as much as possible. When the image is decompressed and displayed, the YCbCr components are converted back to the RGB color space for display on the monitor.

Converting from RGB to YCbCr:

Y  =    0.299*R  + 0.587*G  + 0.144*B
Cb = - 0.1687*R - 0.3313*G    + 0.5*B + 128
Cr =      0.5*R - 0.4187*G - 0.0813*B + 128

Converting from YCbCr to RGB:

R = Y                      + 1.402*(Cr-128)
G = Y - 0.34414*(Cb-128) - 0.71414*(Cr-128)
B = Y   + 1.772*(Cb-128)

Note: These values must be truncated to the range 0–255.

Scaling Down (Down Sampling)

After transforming RGB values into YCbCr values, Cb and Cr are down sampled by a factor of 2 (or 4). That means every 4 pixels (or 16 pixels) are averaged into one pixel.

This down sampling is almost unnoticeable to the human eye.

This results in color bands with only 25% (or 6.25%) the size of the original ones. Since color bands make up two-thirds of the original raw data (YCbCr data), we will be reducing the size of the image by 50% (or 62.5%).

Discrete Cosine Transform (DCT)

The Discrete Cosine Transform (DCT) is a mathematical transformation that is used in JPEG compression to separate the image data into elements of different frequencies. This is done by representing the image as a sum of cosine functions of different frequencies, which can be used to identify the parts of the image that contain small details (high frequencies) that are less important for image quality.

The DCT is a type of Fourier-related transform, which means that it is based on the principles of the Fourier transform, a mathematical tool used to represent functions as a sum of sine and cosine functions. By representing the image as a sum of cosine functions, the DCT allows JPEG to identify the parts of the image that contain small details and discard them, while preserving the larger, more important details.

Overall, the goal of the DCT is to reduce the file size of the image by discarding small details that are less important for image quality, while preserving the larger, more important details. This is done in a way that minimizes the impact on the perceived quality of the image, allowing JPEG to achieve a good balance between image quality and file size.

The idea behind the Cosine Transform is to approximate functions using cosine functions. If we consider the following triangular wave:

We can approximate it using a single cosine function:

1*cos(2*pi*t)

Approximated triangular wave using one harmonic.

While this approximation is very good, we can do much better by summing two cosine functions together.

0.9*cos(2*pi*t) + 0.1*cos(2*pi*3*t)

Summing these two cosine functions, we get:

Approximated triangular wave using two harmonics.

We can approximate this function using much more cosine functions to get even better results. With these 5 cosine functions:

We get:

Approximated triangular wave using 8 harmonics.

In the above examples we approximated a function f(t) using n cosine functions:

The Cosine Transform is all about finding these coefficients a_k. These coefficients are found mathematically found using this infinite integral formula:

For the triangular wave above we can calculate these coefficients:

The cosine sum equation discussed before is, sort of, inverting the cosine transform.

The Discrete Cosine Transform does the same thing but for finite and discrete functions (sequences). In this case, it’s a 2-dimensional function (the image).

To approximate an 8×8 square of pixels (64 pixels), we use these 64 base functions:

An 8×8 block of pixels is represented as a weighted sum of these base functions requiring the use of 64 coefficients, usually simplified as an 8×8 matrix. For example:

[[ 191    1 -225  -31  128  104  -45 -170]
 [ -50   27   46  -54    0   36  -19   -5]
 [-152  -14  186   55 -118 -113   51  160]
 [  88  -47  -82   96    0  -64   34    9]
 [ -64   34   59  -69    0   46  -24   -7]
 [ -18    9   16  -19    0   13   -7   -2]
 [ 132  -31 -141   52   49   -3    5  -59]
 [ -75   40   69  -81    0   54  -29   -8]]

Here are a couple of examples of 8×8 blocks and their DCT.

The top-left most coefficient is called the DC coefficient, while the others are AC coefficients.

Quantization

Now that we separated these values into different detail levels, we need to “discard” some of these details. This is where quantization comes into play.

Quantization reduces the number of bits required to store a number. In other words, it reduces the precision of the number. The basis behind it is to divide our number by a quantum and then round to the nearest integer.

To get back the original number, we simply multiply by the quantum.

To better understand what this does, look at what happens to the numbers from 0 to 63 when quantized and brought back using different values of quanta (plural of quantum).

In JPEG, every DCT value has its own quantum, this means that we need an 8×8 matrix to represent all the quanta. The quantization matrix is the same for the whole image and it is not unique to every block.

A quantization matrix (or table) would look something like this:

[[16 11 10 16 24  40  51  61] 
 [12 12 14 19 26  58  60  55] 
 [14 13 16 24 40  57  69  56] 
 [14 17 22 29 51  87  80  62] 
 [18 22 37 56 68  109 103 77] 
 [24 35 55 64 81  104 113 92] 
 [49 64 78 87 103 121 120 101] 
 [72 92 95 98 112 100 103 99]]

When quantizing a DCT matrix, we divide every value in the DCT matrix with its corresponding quantum in the quantization matrix.

The idea here, is to use big quanta for the high frequencies (high detail, or bottom right part of the matrix)

If we quantize this DCT matrix using quantization matrix above:

[[ 191    1 -225  -31  128  104  -45 -170]
 [ -50   27   46  -54    0   36  -19   -5]
 [-152  -14  186   55 -118 -113   51  160]
 [  88  -47  -82   96    0  -64   34    9]
 [ -64   34   59  -69    0   46  -24   -7]
 [ -18    9   16  -19    0   13   -7   -2]
 [ 132  -31 -141   52   49   -3    5  -59]
 [ -75   40   69  -81    0   54  -29   -8]]

We get the following results:

[[ 12   0 -22  -2   5   3  -1  -3]
 [ -4   2   3  -3   0   1   0   0]
 [-11  -1  12   2  -3  -2   1   3]
 [  6  -3  -4   3   0  -1   0   0]
 [ -4   2   2  -1   0   0   0   0]
 [ -1   0   0   0   0   0   0   0]
 [  3   0  -2   1   0   0   0  -1]
 [ -1   0   1  -1   0   1   0   0]]

Multiplying by the quantization matrix, we get a matrix close to the original.

[[ 192    0 -220  -32  120  120  -51 -183]
 [ -48   24   42  -57    0   58    0    0]
 [-154  -13  192   48 -120 -114   69  168]
 [  84  -51  -88   87    0  -87    0    0]
 [ -72   44   74  -56    0    0    0    0]
 [ -24    0    0    0    0    0    0    0]
 [ 147    0 -156   87    0    0    0 -101]
 [ -72    0   95  -98    0  100    0    0]]

The difference matrix (original-approximated) shows that there’s a small difference in low frequencies and a moderate difference in high frequencies.

[[  1  -1   5  -1  -8  16  -6 -13]
 [  2  -3  -4  -3   0  22  19   5]
 [ -2   1   6  -7  -2  -1  18   8]
 [ -4  -4  -6  -9   0 -23 -34  -9]
 [ -8  10  15  13   0 -46  24   7]
 [ -6  -9 -16  19   0 -13   7   2]
 [ 15  31 -15  35 -49   3  -5 -42]
 [  3 -40  26 -17   0  46  29   8]]

JPEG compression does this for both the luminance and chrominance channels.

By changing the quantization matrix, we can create different levels of compression.

Zigzag Scan

After quantization, there is a big chance that most of the high-frequency coefficients (bottom right of DCT matrix) are zero. If we order the coefficients from lowest frequencies to highest frequencies, we will get, with great probability, a lot of consecutive zeros. Instead of storing, for example, “0 0 0 0 0 0 0 0”, we can store 8×”0”.

We do this using a zigzag scan.

Run Length Encoding (RLE) on AC Components

Run length encoding groups consecutive values together to save space. Instead of storing a value repeating n time, we store (n, value).

JPEG compression applies this to AC coefficients of all blocks.

Differential Pulse Code Modulation (DPCM) on DC Components

The DC coefficient is usually large (relative to AC components) and it varies from block to block, but not so much. Instead of storing individual DC components for every block, we store the difference between the current block and the one before it. Since the variation isn’t that big, we will store a smaller number thus use less space.

Entropy Coding

This one is a bit more complicated and deserves a whole article explaining it. The basis of entropy coding is to use fewer bits to store “symbols” that are common in our data and use more bits for less common “symbols”.

If you want to learn more about the math behind it, check this great webpage.

Recap

Image file formats are an important part of image processing. In this article, I only looked at how JPEG works. But there are, however, other lossy file formats like TIFF and MNG. The downside of using lossy compression is losing information. When the exactness of information is important, other file formats like PNG or GIF are better candidates since they are lossless.