Hi all,
tl;dr ? Scroll to bottom of post.
Those who clicked onward to this post will probably be familiar with how most HTML-scrapers (google searches, imdb lookups, etc.) seem to work.. download the file, fseek or $bfind to the point(s) of interest, read a bit, repeat until done.
I was hoping to get something a bit more 'proper' so that much of the seeking around is dropped.
I have written a small HTML parser that pretty much follows the rules laid out for HTML5 in terms of void elements, elements that don't need a closing tag - all the usual suspects you might find in the wild, and distill the output to an INI file so that the results can easily be referenced. It's not DOM-compliant, but it does what I need it to do.
However, despite attempts to make it go faster (a few milliseconds here, a few milliseconds there), a 70k page still takes about 10 seconds to fully parse. ( For the curious - no, a hash table isn't any faster than writeini, as long as I don't forget to flushini when done, and it saves instructions for concatenating things for hash table storage. )
So I tried to look for some alternatives, but Google is throwing me a bunch of almost completely useless results.
The one result that was very useful, was domxml.dll written several years back. It works, but unfortunately it expects well-formed XML, which HTML tends not to be - which means the dom, tree and XPath methods are useless, while the SAX method falls apart on some style directives / script code. This is something I can mostly work around (inserting 
tl;dr version:
1. Are there any existing, fast, full HTML parsers?
2. If not, is there in fact a faster method than incrementally seeking/$bfind'ing an element's opening tag head, tag tail, closing tag head, tag tail, and so forth and so on? I was thinking a regular expression may be faster but those only take regular strings.. and an element may very well span more than the variable size limit (e.g. a 'content' DIV which wraps user comments).
Thanks for any hints!