doesntsuck.com: crap-tolerant html parsers in python

September 01, 2004

crap-tolerant html parsers in python (link)

http://www.crummy.com/software/BeautifulSoup/
You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.
Beautiful Soup
Introduction

There are lots of parsers for HTML and XML documents. They are great if you care about the document as a whole. They've got all sorts of uses: you can transform the document, index it, display its structure, or process its content.

But what if you only care about part of a document? What if you just want to extract some information and leave? You have two choices. You can use regular expressions. ("Now you have two problems.") Or you can parse the document and use information about the structure to find the parts you're interested in. The trouble is, it gets tiresome traversing the parse tree looking for the parts you want. After a while, even regular expressions start to look pretty good.

Wouldn't it be nice if there were a parser that could do the tree traversal stuff for you? You could tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Beautiful Soup can do all this--and less. It won't choke if you give it ill-formed markup: it'll just give you access to a correspondingly ill-formed data structure. It doesn't care if you give it fake HTML tags or if the namespaces are wrong. It accepts that you're doing this to get some data into a more usable format. It appreciates that if the data were well-formed to begin with, you probably wouldn't be doing what you're doing.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286269
A simple HTML 'parser' that will 'read' through an HTML file and call functions on data and tags etc.
Useful if you need to implement a straightforward parser that just extracts information from the file *or* modifies tags etc.

Shouldn't choke on bad HTML.

Posted by yargevad at September 1, 2004 05:57 PM