Cache sharing between sites
There's been some debate recently about good ways that we could enable web sites to share the browser cache in the future. The problem is that popular JavaScript frameworks currently end up being downloaded several times from different sites that use them and this is a great waste of resources. Of course, there are some ways to achieve re-use of scripts across sites today by hosting those frameworks in a central location, but that is an expensive thing to do for framework developers, most of which are open source projects (it basically amounts to asking the framework developers to pay for the hosting cost of everyone that uses them).
To summarize the debate, Doug Crockford has been mentioning a possible solution. He also wrote another piece on JavaScript that is disconnected from this debate.
Brendan answered on what really happened back in the Netscape days and mentioned in passing that he didn't like Doug's proposal and that he preferred another approach.
I have to admit I wasn't captivated by the whole debate about the qualities (or lack thereof) of JavaScript and regretted that the debate around such an important feature would be drowned in that. So let me summarize the interesting part...
Doug wants all elements that have a "src" or "href" attribute to also have an optional "hash" attribute that is computed from the contents of that file with a well-defined cryptographic hash algorithm. This way, when the browser encounters another tag that has the same hash value, and it already has a cache entry with that hash, it would just get the resource from the cache without looking at the remote file.
Brendan doesn't like this because crypto hashes are not that secure in that it is possible (but highly unlikely) to build a different (malicious) file that has the same hash, and also because a crypto hash in otherwise clean HTML would look weird and out of place.
He proposes an alternate approach where the tag has a readable "shared" attribute that would typically be a url. The mechanism is pretty much the same as the hash, except that it's readable.
I don't know if it's Brendan or me who is missing something here but his proposition looks a lot more insecure than Doug's. Here's how an attacker would compromise that system:
- EvlH4ckr666 sends spam with links to his new cute penguin image site.
- As everyone loves cute penguin images, a large number of people go to http://cutepenguinpictures.com (not a real site as I'm writing this), some of them with an empty cache.
- Our cute penguin site contains (in addition to cute penguin images) a script tag with src="evil.js" and shared=http://sharedscripthosting.com/pasteYourFavoriteFramework.js (also not a real site as I write this).
- A while later, some of those users will visit another web site that references a legitimate copy of pasteYourFavoriteFramework.js, but as it has the same shared value that evil.js maliciously used, the browser will use what it believes is a legitimate script, but that is in fact evil.js.
- Chaos ensues.
Really, am I missing something here?
Also, another variation of those ideas that would be a little chattier but would keep the html clean and could probably be more secure would be to have the shared attribute but have another attribute that points to a hashing web service. Here's how that could work:
- When the browser sees a tag with a shared attribute and it has a cache entry with that shared value, it would generate a public key, send it to the validation service url to challenge it to return a hash of the script using the provided public key.
- The browser receives the response to its challenge under the form of a hash. It performs the same hashing with the same public key on the cached version and compares it with what the validation service returned. If they are the same, use the cache entry, otherwise hit the src or href.
Of course, this is less simple than the other approaches, but I think it's more secure than both and still avoids sending redundant versions of the same potentially huge scripts. Instead, there is a small negotiation that should be fairly small in terms of network payload.
What are your thoughts on this? Worth the trouble?
Doug's post: http://blog.360.yahoo.com/blog-TBPekxc1dLNy5DOloPfzVvFIVOWMB0li?p=789
Brendan's answer to Doug: http://weblogs.mozillazine.org/roadmap/archives/2008/04/popularity.html
UPDATE: I had a mail exchange with Brendan and it seems like what he meant was that it's the shared attribute url that is hit when present. That sure removes reasonable possibilities of poisoning the cache but I don't see what values it brings: it just seems to replace src and to have exactly the same pros and cons. In particular, it still puts the burden of shared hosting on the script author, whereas Doug's proposal (and mine) distribute this burden across all user sites.
Also removed a word that he found abusive.
UPDATE 2: so apparently the only thing @shared brings when compared with the regular @src is that src can be used as a fallback if @shared is unavailable. The shared url is still queried every time the cache doesn't contain it, which means that it still requires some massive hosting capabilities. There is no distribution of the burden. Brendan even suggested that for performance reasons, both urls get queried whenever the cache is empty! But of course, anyone who gave thought to it had inferred all this from the following ;) :
"If the browser has already downloaded the shared
URL, and it still is valid according to HTTP caching rules, then it can use the cached (and pre-compiled!) script instead of downloading the src
URL. This avoids hash poisoning concerns. It requires only that the content author ensure that the src
attribute name a file identical to the canonical ("popular") version of the library named by the shared
attribute. [...] only the @shared value would be shared among script tags. The @src would be loaded only if there was no cache entry for @shared."