Understanding Run Length Encoding

In my last post, I went over the basics of Burrows-Wheeler Transform, a compression technique used to rearrange data into a more repetitive form. Today, I'm going to write about run-length encoding, which is different in that it actually compresses data, rather than just rearranging it. Like BWT, this is also used in bzip2's compression stack among many other pieces of compression software.

There are a few different implementations of RLE, the most common is simply to list runs of characters using integers to denote how many follow. For example:

AAABBBBBBCCDEEEEE

could simply be:

3A6B2C1D5E

reducing the initial data by 7 bytes in this case. A variant of this used within bzip2 will only denote runs of 4 characters or more. To be precise, between 4 and 255. The worse case for this type of RLE is that the file contains only runs of bytes in multiples of 4, meaning the file size would increase.

Here's an example of that kind of RLE:

AAAAAAAAAABCCCCDDDDD

would become:

AAAA6BCCCC0DDDD1

reducing the initial data from 20 to only 16 bytes. Getting back to the original form is fairly straight forward in this case, as we now just have to look for runs of characters in multiples of 4. When we see one, we know the following byte is the remainder of how many bytes of the same value follow.

You might ask, why wait until there are 4 of the same bytes? Imagine if we had data that looked like this:

ABCDEFGHIJ

Performing RLE for each byte in this case (denoting 1 for each byte) would increase the file size from 10 bytes to 20. In the case that we wait for 4 of the same bytes to appear, the file size may still increase, but only every time 4 (and only 4) of the same bytes run in sequence.

So that's a quick explanation on run-length encoding. As you can see it's deceptively simple and also very effective in a lot of cases.