Hi all,

tl;dr ? Scroll to bottom of post.

Those who clicked onward to this post will probably be familiar with how most HTML-scrapers (google searches, imdb lookups, etc.) seem to work.. download the file, fseek or $bfind to the point(s) of interest, read a bit, repeat until done.

I was hoping to get something a bit more 'proper' so that much of the seeking around is dropped.

I have written a small HTML parser that pretty much follows the rules laid out for HTML5 in terms of void elements, elements that don't need a closing tag - all the usual suspects you might find in the wild, and distill the output to an INI file so that the results can easily be referenced. It's not DOM-compliant, but it does what I need it to do.

However, despite attempts to make it go faster (a few milliseconds here, a few milliseconds there), a 70k page still takes about 10 seconds to fully parse. ( For the curious - no, a hash table isn't any faster than writeini, as long as I don't forget to flushini when done, and it saves instructions for concatenating things for hash table storage. )

So I tried to look for some alternatives, but Google is throwing me a bunch of almost completely useless results.

The one result that was very useful, was domxml.dll written several years back. It works, but unfortunately it expects well-formed XML, which HTML tends not to be - which means the dom, tree and XPath methods are useless, while the SAX method falls apart on some style directives / script code. This is something I can mostly work around (inserting ![CDATA[ where appropriate in a fast pre-parser). However, when I ran it against a file from 'in the wild', the server I was connected to told me that 'ELSE' was not a valid command. There was no 'else' in my code, but there was in the file being parsed (part of a piece of javascript) and a quick test showed that the DLL was actually ending up evaluating that line. Ouch - major security issue there.

Thus I'm back where I left off, which is either doing the seek-a-thon (fast, but annoying and inflexible to write) or going with mIRC being unresponsive for several seconds while a file is parsed.

One thing I haven't tried is whether $bfind is faster on smaller variables. E.g. if I chop off the leading part (which I have already parsed), would $bfind go faster.. and if so, fast enough to offset now having to bcopy.. but given that bfind is pretty fast to begin with, I doubt that's where I'm going to find major speed increases smile

tl;dr version:
1. Are there any existing, fast, full HTML parsers?
2. If not, is there in fact a faster method than incrementally seeking/$bfind'ing an element's opening tag head, tag tail, closing tag head, tag tail, and so forth and so on? I was thinking a regular expression may be faster but those only take regular strings.. and an element may very well span more than the variable size limit (e.g. a 'content' DIV which wraps user comments).

Thanks for any hints!