Tokenizing JavaScript - A look at what’s left after minification
Minifiers
JavaScript minifiers are popular these days. Closure, YUI Compressor, Microsoft Ajax Minifier, to name a few. Using one is essential for any site that uses more than a little script and cares about performance.
Each tool of course has advantages and disadvantages. But they all do a pretty good job. The results vary only slightly in the grand scheme of things. Not enough to make so much of a difference that I’d say you should always use one over the other – use whatever fits in with your environment best.
Tag Clouds
Anyway, it got me thinking. After crunching a script through one of these bad boys, what’s left?
The first thing I did was take jQuery 1.4.2 (the minified version) and push it into the tag cloud creator, Wordle. BAM! Beautiful, isn’t it?
It’s not that surprising that two of the longest keywords in JavaScript also happen to use up the most space: return, and function. What does it matter? The word function appears in jQuery 404 times, adding up to 3,232 bytes. That’s about 4.5% of the size of the library! return appears 385 times, adding up to 2,310 bytes, or 3.2%. So there you go – return and function make up a total of almost 8% of the size of jQuery!
Really makes me wish JavaScript had C# style llamas – err, lambdas.
Tokenizing
There are some problems here though. No tag cloud generator I could find was intended to be run on code. So it ignores things like operators, brackets, etc. And you know things like semicolons are frequent. Nor do they provide any kind of data feed of the document’s tokens. So, I created my own tool to generate the data.
Basically, I just have a list of possible tokens, and I run a regular expression on the code to determine how many times each occurs. Then, multiply by its length to get the total size of that token in the script.
Results
It’s amazing what the results show. The top 15 tokens represent 35% of the entire script, mostly single-character tokens:
It makes sense that function is the top token in size since 8 characters long, even though it only occurs 404 times. But look at “.”. Yes, dot. It’s only one character long, but it represents 2,565 bytes. “(” and “)” together make up over 5k. return, despite being 6 characters long, is in 5th place.
What does it mean?
Well, actually, the fact that these syntax related tokens are so high on the list is partially a testament to the effectiveness of minifiers. A minifier can only remove so much of the syntax, and you can’t shorten them. So if a minifier is doing it’s job, they should tend bubble up to the top of the list.
Thankfully, Wordle supports an advanced mode that lets you enter the tokens and their weight manually. Armed with the output of this tool, here is the entire result set in tag cloud form. The relative sizes of the tags aren’t really correct though, simply because a “.” is smaller in whatever font than any letters. Also, I don’t really know why the first use of Wordle produced a cloud that shows return bigger than function – I guess just a bug in how it counts. All the more reason to use the advanced mode.
One thing it does is show ways that minifiers could do an even better job, or ways that we can code that reduce these tokens. For example, in theory a minifier could convert functions that use ‘return’ to assign to a parent-scope variable instead. That’s a fairly complex thing to do, so probably not worth it (performance seems equivalent, though). You can also try and structure your code so a function only has one ‘return’ instead of multiple.
This tool can help you find tokens in your code that use a lot of space besides these, too. For example, I applied this tool to the MicrosoftAjax.js script from .NET 3.5, and found to my horror that ‘JavaScriptSerializer’ was near the top of the list. And that is why in the AjaxControlToolkit you will find this script has been greatly reduced in size. Despite having many new features, it’s 10k smaller - in minified form - than it was in .NET 3.5., partially thanks to this tool to help me identify the areas that needed improvement.
Notice ‘getElementsByTagName’ appears in jQuery a noticeable amount. It occurs 17 times, or 340 bytes. Also not so obvious is how often characters ‘a’, ‘b’, etc occur. These are local variables that the minifier has converted to. ‘a’ is high on the list, since it is the first one used, but there are many in the top 100, all the way up to the letter ‘o’, totaling 5,228 bytes. So, a minifier could do well to understand how local variables are used and reuse existing ones when they are no longer needed.
The code is fairly simple. Again, this was something I wrote one evening off the cuff, so it’s not perfect (although, thanks to Brad and Damian for the nice LINQ way of converting the char array to a string array).
This is the code in C# for .NET 3.5.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public static class Tokenizer {
public static IEnumerable<KeyValuePair<string, int>> Tokenize(string content) {
string[] tokens =
{ "===", "!==", "==", "<=", ">=", "!=", "-=", "+=", "*=", "/=", "|=", "%=", "^=", ">>=", ">>>=", "<<=",
"++", "--", "+", "-", "*", "\\", "/", "&&", "||", "&", "|", "%", "^", "~", "<<", ">>>", ">>",
"[", "]", "(", ")", ";", ".", "!", "?", ":", ",", "'", "\"", "{", "}" };
var escapedTokens = from token in tokens
select ("\\" + string.Join("\\",
(from c in token.ToCharArray() select c.ToString()).ToArray()));
string pattern = "[a-zA-Z_0-9\\$]+|" + string.Join("|", escapedTokens.ToArray());
var r = new Regex(pattern, RegexOptions.Compiled | RegexOptions.ExplicitCapture);
return from m in r.Matches(content).Cast<Match>()
group m by m.Value into g
orderby g.Count() descending
select new KeyValuePair<string, int>(g.Key, g.Count());
}
}
You can also download the raw CSV file with the results for jQuery here.
Update: Thanks to David Fowler for linq-ifying the code and making it half as long!
Happy coding!