Google has posted a very interesting analysis of the code/authoring techniques of over one billion documents here. It seems much of the data they collected was pretty obvious (ex: the abundance of the "a" and "img" element). But, it's still an interesting read.