finding a fault-tolerant HTML parser for iPhone SDK

A new SelfSolved problem is ready for perusal:

A couple of my iPhone projects require a decent HTML/XHTML parser. On OS X, Cocoa ships with NSXMLDocument, which includes dirty HTML parsing functionality from libtidy. Unfortunately, NSXMLDocument is not part of the actual iPhone 2.2 SDK (though it is part of the 2.2 Simulator — so it’ll compile just fine at dev time but break when deploying — a big gotcha if you never tested against a real iPhone).

NSXMLParser is a part of the iPhone SDK…This is not a reasonable alternative.

Check out my writeup at SelfSolved #42: HTML or XHTML Parser for iPhone SDK 2.x

Finally, all out of all the potential alternatives I found (all referenced at the SelfSolved writeup — including one that requires a license fee to use), this one seems to be the most promising and requires the least amount of pain (read: interaction with the libxml C API — god knows I’ve done enough of that while building prototypes at Yahoo! Research Berkeley)

IE 6 renders a blank page on XHTML-style script end tag

10-01-2011: And the world slides backwards. I believe all major browsers, including the latest Firefox and Safari, now have this behavior. If you see a blank XHTML-served-as-HTML page in Safari or Firefox, check the script tags and make sure they are not self-closing: always use <script> ... </script>

On IE 6, a well-formed and validated web page may be rendered as a blank page if you close <script> tags in XHTML style. As in, <script type="text/javascript" ... src="foo.js" />, rather than the HTML style <script type="text/javascript" ... src="foo.js"></script>

So one of my web pages renders great in Safari and Firefox, but in IE 6, it is a completely blank page, devoid of content. Puzzled, I ran it through the W3C validator – no problem at all. Selected a View Source in IE, and noted that the entire HTML output looked OK.

Eventually I narrowed down the problem to a <script> tag in the markup. Namely, a <script type="text/javascript" ... src="foo.js" /> kind of tag. IE rendered the page when I removed the tag, and goes blank when I put it back. Curiously enough, I hadn’t actually invoked any functions from that .js file, so it was definitely not any code I was executing. Replacing the .js file with a dummy .js file also triggered the blank page. Changing or omitting the other attributes did not help.

The problem is fairly obvious now. When I close the tag in HTML style, with an actual </script> tag, IE proceeds to render just fine.

The obvious conclusion is that IE is buggy, but that may not necessarily true (well, in this one instance anyway). Despite most pages’ “compliance” with XHTML, DOCTYPE’ed and all, most web servers still serve these “XHTML” files as mimetype text/html instead of the recommended application/xhtml+xml. This is pragmatic, since IE 6 doesn’t even bother to render application/xhtml+xml, and user agents are required to stop rendering upon encountering non-valid markup (imagine the chaos that would cause).

However, it seems this might introduce a cause for the gotcha. Interpreted in actual text/html mode, one might imagine that to a HTML parser, <script .... /> doesn’t really appear to close the <script> tag at all – in fact, it might merely look like a rather malformed start script tag and no end tag. If I were a dumbly compliant parser+renderer, I might just start walking down the response string looking for that mythical end to this start tag. And end up rendering nothing. Of course, if I were a slightly smarter parser, I would look for a DOCType, but then I’d contradict the server’s mimetype, and down that road lies even more madness.

Nevertheless, the solution, when staring at a blank page in IE when the markup seems fine, is to check your script tags, if any.

I’m no expert at this soup of SGML/HTML/XHTML/XML standards thing, so the above is just my random opinion plus some observations. Still, it seems that MS should patch this particular problem, since it’s fairly non-obvious (many people, I’d surmise, would use the this kind of shorthand close tag in an XHTML file, especially since it validates fine) and upsets the status quo compromise of incremental Web standards compliance through browser compliance modes, content negotiation, and occasionally bad mimetype service. But of course, that’s never going to happen.

Update: I’ve been made aware in the comments section that the same issue occurs in IE 7. Just great.

download servers and the Web Developer extension


One nifty thing that the Web Developer extension for Firefox can do is live HTML editing, on the currently loaded page. The feature is activated via the toolbar, under the Miscellaneous button, via the item “Edit HTML”. It pops up a text box containing the current page’s HTML. Edit to your heart’s content, and hit the Apply button (the blue-with-green-arrow button beside the search box – not exactly the most obvious icon for “Apply”, but that’s a UI critique for another time). The current loaded web page will reflect your changes.

Obviously it will stick around only until you load some other page, since you are not actually editing the web page on the remote server itself. So how is this useful?

So MegaShares is one of those sketchy file hosting and download sites, akin to Rapidshare, MegaUpload, etc. I had a problem here where some some files are served from storage machine #21, which was apparently overloaded or just not configured right – it would start the download fine, but the download gradually stalls before completion. Wacky. There appears to be some redundancy, however, and I wondered if I can grab the file from another server by changing the machine number in the URL.

Unfortunately, as most of these places do, they prohibit direct access to a file without going through their UI, so I can’t just take the download URL, change the machine number, and pop it in the browser. I assumed they were checking referrers, so I spoofed the REFERER field. No luck.

You can see where I’m going with this. Enter the Web Developer extension. Used the Edit HTML feature to change the URL on the page directly, and clicked through the changed link. Success! Their script accepts this action, and the download starts from machine #3. Whatever referrer check or scripting magic that they use to enforce their no-direct-access policy is still intact, since the rest of the page has not changed.

Obviously this is a specific example – if there were no storage redundancy at MegaShare, this trick would have been useless. Nevertheless, it demonstrates the power of live-editing a loaded page, in your browser. Extensions like Greasemonkey is the pinnacle of this kind of editing, but for a once-off adjustment, one doesn’t really need the power of a full scripting environment like that.

Not quite a real Read/Write Web, but an interesting trick to keep in mind.