RSS is out of order? The whole system is out of order!

Wednesday, September 15, 2004

Summary

RSS doesn't scale. As blogs and RSS aggregators get more popular, they overwhelm web servers. The kind folks who run weblogs.asp.net and blogs.msdn.com were getting slammed for bandwidth and made a bad decision - they chopped all RSS feeds all posts to 500 characters, and the aggregated web page feed to 500 characters, unformatted.

After a lot of complaints, they've enabled HTTP compression and gone back to full text in the RSS feeds and 1250 characters on the main page feed. It's not what it was - the RSS only lists the last 25 posts, and site only shows 25 posts trimmed to 1250 characters with no links to the individual blogs - but it's a lot better. I think compression may just be a stopgap, though, so I'd like to suggest some next steps. I'll start with a recap of some ideas others have talked about, throw in a gripe about treating community sites like they're "check out my band" web pages, and finish by a modest proposal for a new RSS schema that scales.

Background: RSS Doesnt Scale

Weblogs are becoming an important source of timely information. RSS aggregators have made it possible to monitor hundreds of feeds - weblog and otherwise - without checking them for updates, since the aggregator checks all your feeds for updates at set intervals. This is convenient for end users, and gives authors a way to broadcast their information in near real-time.

The problem is that RSS feeds are an extremely inefficient (i.e. expensive) way of sending updated information. Let's say I had a list of 500 children in a class and you wanted to be informed whenever a new child joined the class. Here's how RSS would do it: you'd call me ever hour and I'd read the entire class roster to you. You'd check them against your list and add the new ones. Now remember that RSS is used for articles, not just names, and it gets a lot worse. The only way to find out that there's nothing new is to pull down the whole feed - headlines, article text, and all.

There are two possible cases here, and RSS doesn't scale for either of them. If the RSS contents change rarely, I'm downloading the same document over and over, just to verify that it hasn't changed. If the RSS changes frequently, you've probably got an active community which means you've got lots of readers, many of whom may have their aggregators cranked up to higher refresh rates. Both cases have bandwidth efficiency issues.

What's Been Suggested

HTTP Compression

The solution that was taken in this case was HTTP compression. Scott G. has been recommending this for months now; it's good to see it's being implemented now. That's a reasonably simple solution, especially in IIS 6.

Conditional Gets

The next step is most useful for infrequently updated RSS feeds: Conditional Gets. The idea is that you tell the web server what you've got, and it only gives you content if it's changed. Otherwise, it returns a tiny message saying you've already got the latest content. HTTP has included this support since HTTP 1.1. It's the technology that lets your browser cache images, and it makes a lot of sense for RSS.

The plumbing of the system invloves HTTP/1.1 Etag, If-None-Match, and If-Modified-Since headers. The ETag (Entity Tag) is just a resource identifier, chosen by the server. It could be a checksum or hash, but doesn't need to be. This isn't hard to implement - the client just needs to save the ETags and send them as If-None-Match headers on the next request for the same resource. The server can check if the client has the latest version of the file and just send them an HTTP 304 if they're current. This has been suggested many times:

http://nick.typepad.com/blog/2004/09/rss_bandwidth_c.html
http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers
http://www.pocketsoap.com/weblog/stories/2002/05/0015.html

More on the HTTP headers:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/act/htm/actml_ref_href.asp
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.3.4

The HTTP approach can be taken further with a little immagination. Justin Rudd wants customized feeds per user (based on ETag and LastModified, or unique request keys), for example.

What's Been Done (on weblogs.asp.net)

Well, HTTP Compression did eventually get implemented, but the first fix was to chop the RSS feed to 500 characters and to cut the HTML main feed down to something that was (and still is, in my opinion) barely usable.

What's worse is that this was done with no warning and with no request for suggestions. That kind of support for a growing community site is bad manners and bad business. It's bad manners because it's disrespectful of the effort the authors put into their content, and it's bad business because it drives away both authors and readers. Starting a community site represents a committment to respect and support the community, and that's not what we saw last week.

What makes this worse is that it disregarded the input of those who could likely have helped. The authors on weblogs.asp.net and blogs.msdn.com represent many thousands of years - at least - of varied software development experience. Include the active readership and you're in the hundreds of thousands to millions of years of development experience. That's a mind boggling resource. A general post or a message at the top of the HTML feed - "We're hitting the wall on bandwidth and are planning to trim the feeds in two weeks" would probably have elicited the comments that brought about the HTTP Compression implementation before rather than after the weeping and gnashing of teeth. If not, at least we knew it was coming...

This unilateral action on a blog is reminiscent of Dave Winer's stunt with weblogs.com. Dave's done a lot for blogging, but shutting off thousands of blogs without warning was a really bad idea. Chopping the feeds was not near as bad, but it's the same sort of thinking. The value in weblogs.asp.net is not in the .Text (oops, I mean Community Server:: Blogs) engine, it's in the content. I'm not pretending to be half as smart as Scott W. or anyone else at Telligent, but weblogs.asp.net community (both authors and readers) definitely is.

What I'm Suggesting

Fixing RSS

Nothing big, just restructuring RSS a bit. Yeah, it's a big deal, but I think it's got to happen eventually. HTTP tricks are great, but they won't work on busy feeds that are regularly updated.

The idea is to normalize RSS a bit by separating the items into separate resources that can be requested individually. That turns the top level RSS into a light document which references posts as resources (similar to image references in HTML). The RSS Master could include ETag and IMS info (preferable), or it could be returned on request for each individual item. Regardless, this would limit the unnecessesary resending of largely unchanged RSS files. Here's a simplified example.

RSS today is one big blob of XML with everything in it:

<rss>
<channel><title>example blog</title>
<item>lots and lots of stuff</item>
<item>lots and lots of stuff</item>
<item>lots and lots of stuff</item>
<item>lots and lots of stuff</item>
</channel>
</rss>

I'm proposing we normalize this, so the main RSS feed contains a list of external references to items which can be separately requested:

<rss>
<channel><title>example blog</title>
<item id=1>
<item id=2>
<item id=3>
<item id=4>
</rss>

<item id=1>lots and lots of stuff</item>

<item id=2>lots and lots of stuff</item>

<item id=3>lots and lots of stuff</item>

<item id=4>lots and lots of stuff</item>

The benefit is that an aggregator can just pull down the light RSS summary to see if there's anything new or updated, then pull down just what it needs. My current RSS is about 40K, and this would chop it down to just over 1K.

This is even technically possible in RSS 2.0 through the RSS extensibility model. Here's a rough example (I've used the namespace RSSI for "RSS Index"):

<rss>
<channel><title>example blog</title>
<item>
<title>Simple RSS feed not supported</title>
<description>This feed requires RSSI support. Please use an RSSI compliant aggregregator such as...</description>
</item>
<rssi:item checksum= "8b7538a34e0156c"ref=http://www.tempuri.org/blogs/rss.aspx?item=8924 />
<rssi:item checksum= "a34e0156c8b7538"ref=http://www.tempuri.org/blogs/rss.aspx?item= 8925 />
</rss>

Eventually - maybe RSS 3.0 - we add checksum and ref as optional attributes of the <item> tag.

This would also simplify some other things, like synchronizing RSS information between computers. Dare has proposed something similar to what I've recommended with SIAM (Synchronization of Information Aggregators using Markup), but he's just looking at how to synchronize the RSS at the aggregator level once it's come down. If RSS supported this natively, SIAM would be even simpler to implement at the aggregator level.

Another Idea I Considered, Then Rejected

I was thinking about other HTTP tricks that could reduce the amount of RSS going across the wire, and I briefly considered partial downloads. The idea is that the server would return a complex ETag that would give individual checksums and byte ranges for each post. This is possible since the ETag is very loosely defined. Then the aggregator could request specific byte ranges in the RSS using the HTTP Range header. This would work just fine for static XML based RSS, but would be inefficient for dynamic RSS since it would cause multiple individual requests to the same big RSS page.

Another technology I considered and rejected was using XInclude and XPointer to assemble the RSS. MSDN Article here. Here's a syntax sample:

<xi:include href="http://www.tempuri.org/rss.aspx" xpointer="xpointer(//feed[@author='WTDoor'])"/>

It's interesting, but doesn't help when all the data's in one big XML file.

And what's the deal with OPML?

While I'm at it, someone needs to Pimp My OPML. OPML is a great way to pull down a list of feeds, but once you've got them you can't keep in sync. It'd be great if aggregators could subscribe to OPML feeds, and if someone would write an OPML Diff tool, and...

And also... Happy Birthday to me. and my mom. and my daughter. I was born on my mom's birthday, and we cheated a bit by scheduling a c-section for our daughter, Esther. She's one today, I'm 34, and my mom's not telling.

So what you're saying is that Real Simple Syndication is about to get somewhat more complex. <grin>

Chris Szurgot - Wednesday, September 15, 2004 10:21:00 AM

Scott-

First of all, thanks for restoring the full feeds. I appreciate all your work on this... please don't delete my account... ;-)

Regarding the Conditional Gets and HttpCompression, you're still at the mercy of aggregators to support these things. I was a bit unclear here by talking specifically about weblogs.asp.net while talking generally about RSS in general. I think .Text / Community Server::Blogs is ahead of the game here, but the whole blogging infrastructure - with all its custom one-off blog systems - needs to support this.

I feel that weblogs.asp.net is not ahead of the game with the community support on this matter, though. Sure, you can't run a site of this size by vote, but you can do better than just making a change without any warning. It would be much better to make an announcement saying "We're sorry, but due to bandwidth requirements we're going to make the following changes to our site and feeds on Sept 5th...". The comments you'd get on such a post might help in avoiding the changes. You wouldn't do that kind of change on a bank or news site without a warning announcement, would you? I sure don't in my day to day work.

I realize that we're all busy, but it would take 5 minutes to post a warning message on a big change like that, and it would be a lot better recieved than just going for it.

Jon Galloway - Wednesday, September 15, 2004 11:08:00 AM

Why not allow RSS readers to send a parameter containing the last time they pulled the feed and then only send back items that have been added since that time.

Mike - Wednesday, September 15, 2004 11:32:00 AM

Ah the fun of etag and last updated. I simply can't get the damned things to work in an ashx. Despite adding them to the headers they never get sent.

Bum.

As for compression, blowery.org (currently dead) has/had a nice .net compression handler.

Barry Dorrans - Wednesday, September 15, 2004 11:48:00 AM

Mike -

You can't just key off post date, since you'll miss updated items that way. It's a step in the right direction, but I don't think it would work.

Also, you'd need the time zones in the RSS to be correct if you want to do time conditional RSS feeds. They're often not. The post time for this article, for instance, says "Wed, 15 Sep 2004 09:37:00 GMT" when it was really 9:37:00 Pacific time.

Jon Galloway - Wednesday, September 15, 2004 12:28:00 PM

Then how about we just move to UTC to cut all the confusion (wrt GMT dates)?

This will mean the conditional RSS feeds can be made to work.

Though it would mean a restructuring of the way dates are published in RSS be redefined.

And aggregators can convert it to the users' local time on display. Displaying the DateTime on the blog would also depend upon the blog authors' TimeZone they would enter into the options area of their blog.

William Luu - Wednesday, September 15, 2004 7:10:00 PM

William -

I think it would be great to move to UTC. I'm tired of my posts being out of order in my aggregator due to improper date output from all the various blog engines.

And yet, this is not the solution to the problem. RSS still needs to be able to send updates to older posts, and posts probably need to be tracked at the message level. None of the protocols we depend on use DateTime to determine what's been sent and what hasn't (HTTP, POP, etc.). I think RSS needs to work at the resource level. I agree that we need to be able to ask for all the resources that are new after a certain datetime, but we can't limit it to that.

Jon Galloway - Wednesday, September 15, 2004 9:10:00 PM