finding a fault-tolerant HTML parser for iPhone SDK

A new SelfSolved problem is ready for perusal:

A couple of my iPhone projects require a decent HTML/XHTML parser. On OS X, Cocoa ships with NSXMLDocument, which includes dirty HTML parsing functionality from libtidy. Unfortunately, NSXMLDocument is not part of the actual iPhone 2.2 SDK (though it is part of the 2.2 Simulator — so it’ll compile just fine at dev time but break when deploying — a big gotcha if you never tested against a real iPhone).

NSXMLParser is a part of the iPhone SDK…This is not a reasonable alternative.

Check out my writeup at SelfSolved #42: HTML or XHTML Parser for iPhone SDK 2.x

Finally, all out of all the potential alternatives I found (all referenced at the SelfSolved writeup — including one that requires a license fee to use), this one seems to be the most promising and requires the least amount of pain (read: interaction with the libxml C API — god knows I’ve done enough of that while building prototypes at Yahoo! Research Berkeley)

IE 6 renders a blank page on XHTML-style script end tag

10-01-2011: And the world slides backwards. I believe all major browsers, including the latest Firefox and Safari, now have this behavior. If you see a blank XHTML-served-as-HTML page in Safari or Firefox, check the script tags and make sure they are not self-closing: always use <script> ... </script>

On IE 6, a well-formed and validated web page may be rendered as a blank page if you close <script> tags in XHTML style. As in, <script type="text/javascript" ... src="foo.js" />, rather than the HTML style <script type="text/javascript" ... src="foo.js"></script>

So one of my web pages renders great in Safari and Firefox, but in IE 6, it is a completely blank page, devoid of content. Puzzled, I ran it through the W3C validator – no problem at all. Selected a View Source in IE, and noted that the entire HTML output looked OK.

Eventually I narrowed down the problem to a <script> tag in the markup. Namely, a <script type="text/javascript" ... src="foo.js" /> kind of tag. IE rendered the page when I removed the tag, and goes blank when I put it back. Curiously enough, I hadn’t actually invoked any functions from that .js file, so it was definitely not any code I was executing. Replacing the .js file with a dummy .js file also triggered the blank page. Changing or omitting the other attributes did not help.

The problem is fairly obvious now. When I close the tag in HTML style, with an actual </script> tag, IE proceeds to render just fine.

The obvious conclusion is that IE is buggy, but that may not necessarily true (well, in this one instance anyway). Despite most pages’ “compliance” with XHTML, DOCTYPE’ed and all, most web servers still serve these “XHTML” files as mimetype text/html instead of the recommended application/xhtml+xml. This is pragmatic, since IE 6 doesn’t even bother to render application/xhtml+xml, and user agents are required to stop rendering upon encountering non-valid markup (imagine the chaos that would cause).

However, it seems this might introduce a cause for the gotcha. Interpreted in actual text/html mode, one might imagine that to a HTML parser, <script .... /> doesn’t really appear to close the <script> tag at all – in fact, it might merely look like a rather malformed start script tag and no end tag. If I were a dumbly compliant parser+renderer, I might just start walking down the response string looking for that mythical end to this start tag. And end up rendering nothing. Of course, if I were a slightly smarter parser, I would look for a DOCType, but then I’d contradict the server’s mimetype, and down that road lies even more madness.

Nevertheless, the solution, when staring at a blank page in IE when the markup seems fine, is to check your script tags, if any.

I’m no expert at this soup of SGML/HTML/XHTML/XML standards thing, so the above is just my random opinion plus some observations. Still, it seems that MS should patch this particular problem, since it’s fairly non-obvious (many people, I’d surmise, would use the this kind of shorthand close tag in an XHTML file, especially since it validates fine) and upsets the status quo compromise of incremental Web standards compliance through browser compliance modes, content negotiation, and occasionally bad mimetype service. But of course, that’s never going to happen.

Update: I’ve been made aware in the comments section that the same issue occurs in IE 7. Just great.