Understanding Burrows-Wheeler Transform

This is the first piece on my new website and I'd like to write about some compression algorithms that I've been reading about recently, starting off with Burrows-Wheeler Transform. This is part one and is just a simple overview of how things work.

Burrows-Wheeler Transform has been around since 1994 and was invented by Michael Burrows and David Wheeler. Although it is used in many pieces of compression software, bzip2 is one of the more well known ones to use BWT as part of it's compression stack. There are other types of encoding that occur in bzip2 such as run-length encoding, move to front transform and delta encoding among others but I'll write about those in future posts.

The first thing to note, is that BWT does not actually compress any data. If anything, by itself, it can actually increase the size of a file. So it is used in preparation before compression takes place. You can think of it as rearranging the bytes into a form where more repetition takes place in such a way that allows us to get back to the original data. To simplify this idea, take the following string:

AAAAABBBCDDDDDEEE

As you can see there is a lot of repetitive data in here. To make this smaller, we could represent it like so:

A5B3C1D5E3

reducing our initial data from 17 bytes, to only 10. Note, although the letter C does only appear once here, for consistency we still write the count next to it. This may not make a lot of sense, as potentially the string would be much larger if our data was more entropic. This is a problem with run-length encoding for example, as it can potentially make a file much bigger if it is structured in a certain way. But I'll get on to that in another post.

So where does BWT come in? Take the following string:

GOOGOL

As you can see we have here a couple of Gs and 3 Os. But can we make this string contain more repetitive data, in such a way that we can get back to it's original form?

First, we list all rotations of the string like so:

GOOGOL
OOGOLG
OGOLGO
GOLGOO
OLGOOG
LGOOGO

Next, we sort this list of strings into alphabetical order:

GOLGOO
GOOGOL
LGOOGO
OGOLGO
OLGOOG
OOGOLG

We then take a note of where the original string exists within this sorted list. In this case, that would be the second word (2). This is what I'll now refer to as our pointer. And finally we take the last letter of each string in this list and form our new word:

OLOOGG

Now this isn't perfect, as there are 3 Os that could potentially be in line. But it's an improvement over the original word. And given that we take into account our pointer (2), we can now get back to the original string. And that's the encoding finished.

Now to decode this back to it's original form. Let's first list each letter of the encoded string in a list:

O
L
O
O
G
G

We then create new forms of this list by completing two processes over and over until we have a string with the original number of bytes. The first process is to sort this list alphabetically:

O G
L G
O L
O O
G O
G O

The next process which we repeat here is to prepend the first letter (on the left) to our sorted list:

O G OG
L G LG
O L OL
O O OO
G O GO
G O GO

We then sort this new list alphabetically, then prepend the first letter and so on until we end up with something like this:

A = function() sort list alphabetically
B = function() prepend starting letter

A B A  B  A   B   A    B    A     B     A
O G OG GO OGO GOL OGOL GOLG OGOLG GOLGO OGOLGO GOLGOO
L G LG GO LGO GOO LGOO GOOG LGOOG GOOGO LGOOGO GOOGOL <- (2)
O L OL LG OLG LGO OLGO LGOO OLGOO LGOOG OLGOOG LGOOGO
O O OO OG OOG OGO OOGO OGOL OOGOL OGOLG OOGOLG OGOLGO
G O GO OL GOL OLG GOLG OLGO GOLGO OLGOO GOLGOO OLGOLG
G O GO OO GOO OOG GOOG OOGO GOOGO OOGOL GOOGOL OOGOLG

So as you can see, here we've repeated the steps of sorting each new list alphabetically and then prepending the far left letter to each new created word. And if you remember our pointer (2), you'll see that in the final list, our original word is listed second.

And that's it. We've encoded and decoded a word back to it's original form using BWT. Rather than storing our pointer separately like we've done here, you could instead use a delimiter that is lexicographically smaller than characters in our available alphabet. In this case that's just A-Z. So something that you may see people do is to add a $ or some other symbol that does not exist within the original alphabet on to the end of the original string. This way, when we decode, you will know what the pointer was, based on there the $ is within the list.

It's 6 and two threes, and will give you the same result. But I'll leave this for now as a small introduction to how BWT works.