Adventures With Base-64

This’ll be a quick one, I promise.

Recently a friend was looking at generating random-but-unique URL identifiers for a web project, and I immediately thought about this awesome video by Tom Scott about how YouTube encodes its IDs for URLs. (If you haven’t seen the video, or any of Tom Scott’s videos, I highly recommend them.) Rewatching the video, I was like ohhhh, base-64 encoding, I know how that works…. wait a minute. This is not the base-64 encoding I am used to. Are there two types of base-64 encoding? Am I missing something?

Base-64 encoding, if you’re not familiar with it, involves:

  • Taking a string
  • Breaking it up into individual characters
  • Converting those characters into their Unicode decimal values
  • Converting those numbers into UTF-8 binary
  • Making one long list of all the binary numbers
  • Splitting the list into six-bit chunk (because 2⁶ is 64, hence ‘base-64’!)
  • And finally using a mapping table to translate each binary chunk into a character. The mapping table typically uses the characters A to Z, a to z, 0 to 9, and + and /, to represent the digits 0 (000000) to 63 (111111).

(For all examples here, I’ll be using the base 64 table as listed on Wikipedia)

# Base-64 encoding, explained

I’ll step through an example of encoding, using the word cat. (I like cats.)

cat
c        a        t                # after splitting into characters
01100011 01100001 01110100         # after getting the UTF-8 binary representation of each character
011000   110110   000101   110100  # after splitting the long binary into six-bit chunks - extra spaces are filled by 0s
Y        2        F        0       # after mapping back to characters

So the string cat in base-64 encoding is the string Y2F0. Cool. You can verify this in your online tool or local interpreter of choice:

iex(1)> Base.encode64("cat")
"Y2F0"

Does it work for non-ASCII characters? Let’s try the string (the Japanese hiragana character for ‘a’).

あ
227      129      130              # in UTF-8 decimal numbers - see Wikipedia link above
11100011 10000001 10000010         # in binary
111000   111000   000110   000010  # in six-bit chunks
4        4        G        C       # after mapping back to characters
iex(1)> Base.encode64("あ")
"44GC"

The logic is sound!

Strings encoded in base-64 are always going to be longer than normal strings, because a character in UTF-8 is always at least 8 bits (sometimes many more, like in the case of ) and so will always cover more than one six-bit chunk.

But how does that relate to YouTube IDs? Isn’t that packing lots of information, one ID out of <73,786,976,294,838,206,464> possible unique IDs, into a short string? 73,786,976,294,838,206,464 is much longer than the 11 characters that YouTube can encode it in. So what gives?

# Base 64.... explained?

Well, it turns out that what he’s talking about in the video is to do with numbers, not strings, which is a bit different. He’s simply converting a number from being in base 10 (our decimal system) to being in base 64 - nothing to do with “encoding” at all, any more than binary (base 2), octal (base 8), or hexadecimal (base 16) are “encodings”. (They’re not, they’re just different ways of writing the exact same number.)

If we were to convert a number, say 53, to octal (base 8), we would repeatedly divide it by the base (8) and use the remainders to get the digits of the result, from right to left.

div(53, 8) = 6, remainder 5
div(6, 8) = 0, remainder 6

So 53 in octal is 65 - math checks out. It’s the same underlying algorithm as the base-64 encoding with strings - just shortcircuited a little. With strings we converted to a list of integers before combining them into one long binary and splitting it into chunks using the base number (the divide/remainder process) - with a number we already have an integer, albeit possibly a larger one than the 255 limit with strings, so it feels like a different process, one that makes more sense with dividing and remainders.

We can do exactly the same thing with different bases like 64, and use the same mapping table we used for the final step of base-64 encoding to get a single representation of all the ‘digits’.

The number 53 in base 64 using such a mapping table would be just one ‘digit’ - 1. (Not to be confused with 1 in decimal, which would be B in base 64.) A larger number, say, 4734712468963, would be:

div(4734712468963, 64) = 73979882327, remainder 35
div(73979882327, 64) = 1155935661, remainder 23
div(1155935661, 64) = 18061494, remainder 45
div(18061494, 64) = 282210, remainder 54
div(282210, 64) = 4409, remainder 34
div(4409, 64) = 68, remainder 57
div(68, 64) = 1, remainder 4
div(1, 64) = 0, remainder 1

Or in the binary-splitting representation -

4734712468963
1000100111001100010110110101101010111100011              # converted to binary
000001 000100 111001 100010 110110 101101 010111 100011  # split into chunks, with leading spaces filled with 0s
1      4      57     34     54     45     23     35

So 4734712468963 can be encoded in base 64 as the characters for the numbers 1, 4, 57, 34, 54, 45, 23 and 35. You might be able to verify this in your local interpreter of choice:

iex(1)> Integer.digits(4734712468963, 64)
[1, 4, 57, 34, 54, 45, 23, 35]

And after converting them using the mapping table, you’ll get something like BD5i2tXj - which is a much shorter representation of the same thing.

The fact that YouTube just happened to use 64 different characters is a bit of a red herring here - typically number bases don’t go that high, and if you naively try to convert an integer to a string before using normal base 64 encoding methods you’ll get a much different result. Even built-in methods like Elixir’s Integer.to_string/2 don’t work for base 64 when encoding pure numbers - they only go as high as base 36, using A to Z and 0 to 9 for the digits (because numbers really shouldn’t be case sensitive…..)

Today I learned! (And maybe you did too!)

(And it didn’t turn out to be so quick after all….)

Special thanks to Luke Rollans and Paul Fioravanti for proof-reading and sanity checking this post, and @microtherion for further sanity checking the algorithm.