Register Log In

Forums Scripts & Popups Parsing HTML the proper way - existing solutions?

Print Thread

Parsing HTML the proper way - existing solutions? #236295 16/02/12 11:40 PM
Joined: Feb 2012 Posts: 18 S Steeeve OP Pikka bird
OP Steeeve Pikka bird S Joined: Feb 2012 Posts: 18	Hi all, tl;dr ? Scroll to bottom of post. Those who clicked onward to this post will probably be familiar with how most HTML-scrapers (google searches, imdb lookups, etc.) seem to work.. download the file, fseek or $bfind to the point(s) of interest, read a bit, repeat until done. I was hoping to get something a bit more 'proper' so that much of the seeking around is dropped. I have written a small HTML parser that pretty much follows the rules laid out for HTML5 in terms of void elements, elements that don't need a closing tag - all the usual suspects you might find in the wild, and distill the output to an INI file so that the results can easily be referenced. It's not DOM-compliant, but it does what I need it to do. However, despite attempts to make it go faster (a few milliseconds here, a few milliseconds there), a 70k page still takes about 10 seconds to fully parse. ( For the curious - no, a hash table isn't any faster than writeini, as long as I don't forget to flushini when done, and it saves instructions for concatenating things for hash table storage. ) So I tried to look for some alternatives, but Google is throwing me a bunch of almost completely useless results. The one result that was very useful, was domxml.dll written several years back. It works, but unfortunately it expects well-formed XML, which HTML tends not to be - which means the dom, tree and XPath methods are useless, while the SAX method falls apart on some style directives / script code. This is something I can mostly work around (inserting ![CDATA[ where appropriate in a fast pre-parser). However, when I ran it against a file from 'in the wild', the server I was connected to told me that 'ELSE' was not a valid command. There was no 'else' in my code, but there was in the file being parsed (part of a piece of javascript) and a quick test showed that the DLL was actually ending up evaluating that line. Ouch - major security issue there. Thus I'm back where I left off, which is either doing the seek-a-thon (fast, but annoying and inflexible to write) or going with mIRC being unresponsive for several seconds while a file is parsed. One thing I haven't tried is whether $bfind is faster on smaller variables. E.g. if I chop off the leading part (which I have already parsed), would $bfind go faster.. and if so, fast enough to offset now having to bcopy.. but given that bfind is pretty fast to begin with, I doubt that's where I'm going to find major speed increases tl;dr version: 1. Are there any existing, fast, full HTML parsers? 2. If not, is there in fact a faster method than incrementally seeking/$bfind'ing an element's opening tag head, tag tail, closing tag head, tag tail, and so forth and so on? I was thinking a regular expression may be faster but those only take regular strings.. and an element may very well span more than the variable size limit (e.g. a 'content' DIV which wraps user comments). Thanks for any hints!

Entire Thread
Subject	Posted By	Posted
Parsing HTML the proper way - existing solutions?	Steeeve	16/02/12 11:40 PM
Re: Parsing HTML the proper way - existing solutions?	pball	17/02/12 01:34 AM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	17/02/12 04:24 AM
Re: Parsing HTML the proper way - existing solutions?	argv0	17/02/12 03:27 AM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	17/02/12 04:43 AM
Re: Parsing HTML the proper way - existing solutions?	argv0	17/02/12 05:20 AM
Re: Parsing HTML the proper way - existing solutions?	jaytea	17/02/12 12:03 PM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	18/02/12 09:40 PM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	20/02/12 10:35 PM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	20/02/12 10:51 PM
Re: Parsing HTML the proper way - existing solutions?	jaytea	22/02/12 09:16 AM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	23/02/12 01:40 PM

Link Copied to Clipboard