Text re-encoding for optimising storage capacity in the browser

by Andrew Betts,

Web apps vs native apps. Sadly it’s not a fair fight, particularly when it comes to storage. A native app can use all the storage it wants, while a web app hits quotas and limits at every turn. Here’s a table of the restrictions by storage API and browser, which we determined by a combination of reading specs, consulting browser vendors and testing experimentally:

WebSQL (w) / IndexedDB (i) Appcache localStorage Cookies
Safari/iOS 5 5MB / 50MB (w) 10MB 5MB 4KB total
up to 50 per domain
Android Browser / ICS ? Unlimited 5MB
BB Browser / PlayBookOS ? ? ?
Safari 5.2 5MB / Unlimited (w) Unlimited 5MB
Chrome 20 Unlimited (wi) 5MB 5MB
IE 10 500MB (i) 50MB 10MB
Opera 11 5MB / Unlimited (w) 50MB / Unlimited 5MB / Unlimited
Firefox 11 50MB / Unlimited (i) Unlimited 10MB

You might look at these and think that actually they’re pretty generous – 50MB seems to be the lowest common denominator. But the FT publishes hundreds of stories a day, along with more than a hundred high resolution images, not to mention podcasts and videos. If a user wants anything like a reasonable chunk of the FT for offline reading, 50MB is nothing.

So we need ways of trying to do more with less, and fortunately, there is an obvious place to start.

Internally, JavaScript encodes text as UTF-16, a simple form of Unicode that allocates two bytes per character. Even characters from basic ASCII get two bytes. This is actually a pretty good choice – it allows JavaScript to be fully Unicode compliant and also makes string operations fast, because counting characters is just a matter of counting bytes.

However, English speakers have always been spoiled by the ability to consider memory capacity in bytes to roughly equal the number of characters that can be stored, because English is typically encoded uisng ASCII (exactly one byte per character) or UTF-8 (typically one byte per character with occasional multi-byte characters). When we start encoding English text using UTF-16, and then store it somewhere with limited space, you quickly find you can only store half the number of characters that you thought you’d get.

Here’s an illustration of how Hello World is stored in UTF-16:

00 48 00 45 00 4C 00 4C 00 4F 00 20 00 57 00 4F 00 52 00 4C 00 44
H E L L O W O R L D

Disaster. Every other byte is a null byte and 50MB of storage just became effectively 25MB. So, if you, like us, are worried about storing more stuff offline, and you want your 25MB back, here are a few options:

1. Create a UTF-8 bytestream, and pretend it’s UTF-16

The obvious answer is to remove those null bytes by changing to UTF-8 encoding. But JavaScript doesn’t have UTF-8 encoding as an option, so whatever we do we have to pretend it’s UTF-16. That’s not particularly hard, as long as we can take the characters apart again afterwards and reassemble the correct ones from the byte values. Essentially we’re redrawing the boundaries between characters – two one-byte UTF-8 characters will turn into one UTF-16 character. A three byte UTF-8 character will be spread across two UTF-16 ones. This sounds complicated, but conceptually it’s actually quite simple. Here’s an illustration:

Source string: R o y s
.charCodeAt(): 82 111 121 8217 115
As binary 01010010 01101111 01111001 00100000 00011001 01110011
As UTF-8 01010010 01101111 01111001 11100010 10000000 10011001 01110011
Use in pairs 01010010 01101111 01111001 11100010 10000000 10011001 01110011 00000000

The key bit here is converting the char codes, whose byte values make up a UTF-16 string, into UTF-8 byte values. Adding in the UTF-8 control bits changes the number that is being stored, because we have to spread the actual ‘value bits’ out to accomodate the control bits. The resulting number is much higher: 8217 for the apostrophe becomes 14844057.

Once you have the new UTF-16 string, you can write it to storage. If you look at it in Dev Tools, it will look like a bizarre mixture of a Eastern languages, but that’s only because the dev tools viewer is displaying it using UTF-16 encoding. If it interpreted it as UTF-8, you would see your source text.

UTF-8 data being interpreted as UTF-16 in WebSQL
UTF-8 data being interpreted as UTF-16 in WebSQL

The problem with this approach is that making the conversion from the UTF-16 byte form to the UTF-8 one requires lots of code, and the overall algorithm is slow – we managed only 100KB per second at best in Chrome 17 on a MacBook Pro.

2. As you were, but with ASCII

So if working out the UTF-8 encoding for a character code is expensive, one option for simplifying the problem is to simplify the source text. Let’s assume that we’ve converted our source text to plain ASCII by representing multi-byte characters using HTML entities (eg &raqo; instead of »). Now we can just glue the existing character codes together rather than doing any conversion, because we know the sourc text contains only one-byte characters:

Source string: S i m p l e
.charCodeAt(): 83 105 109 112 108 101
As binary 01010011 01101001 01101101 01110000 01101100 01100101
Use in pairs 01010011 01101001 01101101 01110000 01101100 01100101

By using an ASCII source text we eliminated a lot of bit shifting and UTF-8 encoding. Much faster. And shorter. So short, in fact, that the entire algorithm is just this:

function compress(s) {
	var i, l, out='';
	if (s.length % 2 !== 0) s += ' ';
	for (i = 0, l = s.length; i < l; i+=2) {
		out += String.fromCharCode((s.charCodeAt(i)*256) + s.charCodeAt(i+1));
	}

	// Add a snowman prefix to mark the resulting string as encoded (more on this later)
	return String.fromCharCode(9731)+out;
}

It’s still not the best we can do though. Many of our images are stored offline by creating base-64 encoded strings on the server, because JavaScript can’t store binary files offline (except using AppCache, which is not flexible enough for content images). The process of encoding binary data as text using base-64 increases the volume of data by about 30%, because whilst an 8-bit byte can offer 256 possible values, we are only using 64 values, and 64 values would only need 6 bits to store efficiently. Since you can’t store data in units of 6/8ths of a byte, the data grows larger.

3. Bit shifting base64 data into UTF-16

So, to option 3, which can only be applied to text that is base-64 encoded. We start by mapping each character code used in base64 encoding (which are not contiguous and do not begin at zero) onto the range of numbers from 0 to 63. Now we step through the source string, taking each character and converting it to the correct index in our range, which can be represented in binary using 6 bits (because two to the power of six is 64, six bits equals 64 possible values). Since 6 does not divide into 16 we can’t fit a precise number of base64 characters in one UTF-16 character, so we have to convert in larger groups – every eight base64 characters can be precisely fitted into three UTF-16 characters.

Here’s how that works:

Source string: A B C D o p q 9
b64 index 0 1 2 3 40 41 42 61
As binary 000000 000001 000010 000011 101000 101001 101010 111101
Groups of 16 00000000 00010000 10000011 10100010 10011010 10111101

This is not much more complicated than option 2, but delivers the best yet in storage efficiency – a full 63% less storage per character than UTF-16. Here’s the code:

function compress(s) {
	var i, l, out='', map={A:0, B:1, C:2, ... "+":62, "/":63};
	var bits = 16, chr = 0, rem = 0;
	for (i = 0, l = s.length; i < l; i++) {
		if (bits > 6) {
			bits -= 6; chr += map[i] < < bits;
 		} else {
			rem = 6 - bits;
			chr += map[i] >> rem;
			out += String.fromCharCode(chr);
			chr = (map[i] % rem) < < (16-rem);
			bits = 16 – rem;
		}
	}
	return out;
}

Detecting encoded text and unpacking

Rather than add special cases around our code, and also because we need to be backwards compatibile with previously stored data that is not encoded, we decided to use a prefix to denote the kind of encoding that had been applied to stored data, so we know whether to decode it or not. We figured that despite the creativity and renowned intellect of FT journalists, the snowman character is not likely to be used very much, so we now use this prefix to signal that we need to unpack encoded data.

Using a Unicode snowman to mark encoded data

To unpack then, just detect that the string starts with a snowman, and reverse the process. Here’s the code to unpack an ASCII string packed using the first compress function above:

function decompress(s) {
	var i, l, n, m, out = '';

	// If not prefixed with a snowman, just return the (already uncompressed) string
	if (s.charCodeAt(0) !== 9731) return s;

	for (i = 1, l = s.length; i < l; i++) {
		n = s.charCodeAt(i);
		m = Math.floor(n/256);
		out += String.fromCharCode(m, n%256);
	}
	return out;
}