Archives

Archives / 2012 / October
  • Html Agility Pack for Reading “Real World” HTML

    In an ideal world, all data you need from the web would be available via well-designed services. In the real world you sometimes have to scrape the data off a web page. Ugly, dirty – but if you really want that data, you have no choice.

    Just don’t write (yet another) HTML parser.

    I stumbled across the Html Agility Pack (HAP) a long time ago, but just now had the need for a robust way to read HTML.

    A quote from the website:

    This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

    Using the HAP was a simple matter of getting the Nuget package, taking a look at the example and dusting off some of my XPath knowledge from years ago.

    The documentation on the Codeplex site is non-existing, but if you’ve queried a DOM or used XPath or XSLT before you shouldn’t have problems finding your way around using Intellisense (ReSharper tip: Press Ctrl+Shift+F1 on class members for reading the full doc comments).