RSS is out of order? The whole system is out of order!
Summary
RSS doesn't scale. As blogs and RSS aggregators get more popular, they overwhelm web servers. The kind folks who run weblogs.asp.net and blogs.msdn.com were getting slammed for bandwidth and made a bad decision - they chopped all RSS feeds all posts to 500 characters, and the aggregated web page feed to 500 characters, unformatted.
After a lot of complaints, they've enabled HTTP compression and gone back to full text in the RSS feeds and 1250 characters on the main page feed. It's not what it was - the RSS only lists the last 25 posts, and site only shows 25 posts trimmed to 1250 characters with no links to the individual blogs - but it's a lot better. I think compression may just be a stopgap, though, so I'd like to suggest some next steps. I'll start with a recap of some ideas others have talked about, throw in a gripe about treating community sites like they're "check out my band" web pages, and finish by a modest proposal for a new RSS schema that scales.
Background: RSS Doesnt Scale
Weblogs are becoming an important source of timely information. RSS aggregators have made it possible to monitor hundreds of feeds - weblog and otherwise - without checking them for updates, since the aggregator checks all your feeds for updates at set intervals. This is convenient for end users, and gives authors a way to broadcast their information in near real-time.
The problem is that RSS feeds are an extremely inefficient (i.e. expensive) way of sending updated information. Let's say I had a list of 500 children in a class and you wanted to be informed whenever a new child joined the class. Here's how RSS would do it: you'd call me ever hour and I'd read the entire class roster to you. You'd check them against your list and add the new ones. Now remember that RSS is used for articles, not just names, and it gets a lot worse. The only way to find out that there's nothing new is to pull down the whole feed - headlines, article text, and all.
There are two possible cases here, and RSS doesn't scale for either of them. If the RSS contents change rarely, I'm downloading the same document over and over, just to verify that it hasn't changed. If the RSS changes frequently, you've probably got an active community which means you've got lots of readers, many of whom may have their aggregators cranked up to higher refresh rates. Both cases have bandwidth efficiency issues.
What's Been Suggested
HTTP Compression
The solution that was taken in this case was HTTP compression. Scott G. has been recommending this for months now; it's good to see it's being implemented now. That's a reasonably simple solution, especially in IIS 6.
Conditional Gets
The next step is most useful for infrequently updated RSS feeds: Conditional Gets. The idea is that you tell the web server what you've got, and it only gives you content if it's changed. Otherwise, it returns a tiny message saying you've already got the latest content. HTTP has included this support since HTTP 1.1. It's the technology that lets your browser cache images, and it makes a lot of sense for RSS.
The plumbing of the system invloves HTTP/1.1 Etag, If-None-Match, and If-Modified-Since headers. The ETag (Entity Tag) is just a resource identifier, chosen by the server. It could be a checksum or hash, but doesn't need to be. This isn't hard to implement - the client just needs to save the ETags and send them as If-None-Match headers on the next request for the same resource. The server can check if the client has the latest version of the file and just send them an HTTP 304 if they're current. This has been suggested many times:
http://nick.typepad.com/blog/2004/09/rss_bandwidth_c.html
http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers
http://www.pocketsoap.com/weblog/stories/2002/05/0015.html
More on the HTTP headers:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/act/htm/actml_ref_href.asp
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.3.4
The HTTP approach can be taken further with a little immagination. Justin Rudd wants customized feeds per user (based on ETag and LastModified, or unique request keys), for example.
What's Been Done (on weblogs.asp.net)
Well, HTTP Compression did eventually get implemented, but the first fix was to chop the RSS feed to 500 characters and to cut the HTML main feed down to something that was (and still is, in my opinion) barely usable.
What's worse is that this was done with no warning and with no request for suggestions. That kind of support for a growing community site is bad manners and bad business. It's bad manners because it's disrespectful of the effort the authors put into their content, and it's bad business because it drives away both authors and readers. Starting a community site represents a committment to respect and support the community, and that's not what we saw last week.
What makes this worse is that it disregarded the input of those who could likely have helped. The authors on weblogs.asp.net and blogs.msdn.com represent many thousands of years - at least - of varied software development experience. Include the active readership and you're in the hundreds of thousands to millions of years of development experience. That's a mind boggling resource. A general post or a message at the top of the HTML feed - "We're hitting the wall on bandwidth and are planning to trim the feeds in two weeks" would probably have elicited the comments that brought about the HTTP Compression implementation before rather than after the weeping and gnashing of teeth. If not, at least we knew it was coming...
This unilateral action on a blog is reminiscent of Dave Winer's stunt with weblogs.com. Dave's done a lot for blogging, but shutting off thousands of blogs without warning was a really bad idea. Chopping the feeds was not near as bad, but it's the same sort of thinking. The value in weblogs.asp.net is not in the .Text (oops, I mean Community Server:: Blogs) engine, it's in the content. I'm not pretending to be half as smart as Scott W. or anyone else at Telligent, but weblogs.asp.net community (both authors and readers) definitely is.
What I'm Suggesting
Fixing RSS
Nothing big, just restructuring RSS a bit. Yeah, it's a big deal, but I think it's got to happen eventually. HTTP tricks are great, but they won't work on busy feeds that are regularly updated.
The idea is to normalize RSS a bit by separating the items into separate resources that can be requested individually. That turns the top level RSS into a light document which references posts as resources (similar to image references in HTML). The RSS Master could include ETag and IMS info (preferable), or it could be returned on request for each individual item. Regardless, this would limit the unnecessesary resending of largely unchanged RSS files. Here's a simplified example.
RSS today is one big blob of XML with everything in it:
<rss>
<channel><title>example blog</title>
<item>lots and lots of stuff</item>
<item>lots and lots of stuff</item>
<item>lots and lots of stuff</item>
<item>lots and lots of stuff</item>
</channel>
</rss>
I'm proposing we normalize this, so the main RSS feed contains a list of external references to items which can be separately requested:
<rss>
<channel><title>example blog</title>
<item id=1>
<item id=2>
<item id=3>
<item id=4>
</rss>
<item id=1>lots and lots of stuff</item>
<item id=2>lots and lots of stuff</item>
<item id=3>lots and lots of stuff</item>
<item id=4>lots and lots of stuff</item>
The benefit is that an aggregator can just pull down the light RSS summary to see if there's anything new or updated, then pull down just what it needs. My current RSS is about 40K, and this would chop it down to just over 1K.
This is even technically possible in RSS 2.0 through the RSS extensibility model. Here's a rough example (I've used the namespace RSSI for "RSS Index"):
<rss>
<channel><title>example blog</title>
<item>
<title>Simple RSS feed not supported</title>
<description>This feed requires RSSI support. Please use an RSSI compliant aggregregator such as...</description>
</item>
<rssi:item checksum= "8b7538a34e0156c"ref=http://www.tempuri.org/blogs/rss.aspx?item=8924 />
<rssi:item checksum= "a34e0156c8b7538"ref=http://www.tempuri.org/blogs/rss.aspx?item= 8925 />
</rss>
Eventually - maybe RSS 3.0 - we add checksum and ref as optional attributes of the <item> tag.
This would also simplify some other things, like synchronizing RSS information between computers. Dare has proposed something similar to what I've recommended with SIAM (Synchronization of Information Aggregators using Markup), but he's just looking at how to synchronize the RSS at the aggregator level once it's come down. If RSS supported this natively, SIAM would be even simpler to implement at the aggregator level.
Another Idea I Considered, Then Rejected
I was thinking about other HTTP tricks that could reduce the amount of RSS going across the wire, and I briefly considered partial downloads. The idea is that the server would return a complex ETag that would give individual checksums and byte ranges for each post. This is possible since the ETag is very loosely defined. Then the aggregator could request specific byte ranges in the RSS using the HTTP Range header. This would work just fine for static XML based RSS, but would be inefficient for dynamic RSS since it would cause multiple individual requests to the same big RSS page.
Another technology I considered and rejected was using XInclude and XPointer to assemble the RSS. MSDN Article here. Here's a syntax sample:
<xi:include href="http://www.tempuri.org/rss.aspx" xpointer="xpointer(//feed[@author='WTDoor'])"/>
It's interesting, but doesn't help when all the data's in one big XML file.
And what's the deal with OPML?
While I'm at it, someone needs to Pimp My OPML. OPML is a great way to pull down a list of feeds, but once you've got them you can't keep in sync. It'd be great if aggregators could subscribe to OPML feeds, and if someone would write an OPML Diff tool, and...
And also... Happy Birthday to me. and my mom. and my daughter. I was born on my mom's birthday, and we cheated a bit by scheduling a c-section for our daughter, Esther. She's one today, I'm 34, and my mom's not telling.