Register Log In

Forums Scripts & Popups Parsing HTML the proper way - existing solutions?

Print Thread

Re: Parsing HTML the proper way - existing solutions? #236313 17/02/12 12:03 PM
Joined: Feb 2006 Posts: 523 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	hi there, nice post! Originally Posted By: Steeeve Those who clicked onward to this post will probably be familiar with how most HTML-scrapers (google searches, imdb lookups, etc.) seem to work.. download the file, fseek or $bfind to the point(s) of interest, read a bit, repeat until done. I was hoping to get something a bit more 'proper' so that much of the seeking around is dropped. the problem, as i'm sure you've realized, is that if you elect to use mIRC's sockets to fetch specific data from a web server, 99% of the time it will be most efficient to handle that data as it comes in (line-wise, or byte-wise) instead of collecting it and parsing it formally. these solutions tend to be generated on a case-by-case basis and are horribly inflexible, as you noted. we resort to these basic methods because, to a layman, they most easily express and identify the portion of the webpage that we want to fetch. a more formal HTML parser that could be used in our scripts would be great if more of us knew how to use them! for those that don't, i suspect used alongside a feature akin to Chrome's "Inspect Element" (scripter views the webpage inside a custom @window and clicks / highlights the areas they want to fetch) would revolutionize the way scripters create these socket scripts. but that's something else altogether ;P Originally Posted By: Steeeve ( For the curious - no, a hash table isn't any faster than writeini, as long as I don't forget to flushini when done, and it saves instructions for concatenating things for hash table storage. ) indeed, they are on the same order of magnitude with respect to execution time. but it's worth noting that calls to /hadd & $hget() are actually about twice as quick as similar calls to /writeini & $readini(), owing to the extra argument and possibly extra work involved in referencing the internal hash table associated with the INI section vs. the scripter's hash table. still, i don't expect that these calls constitute a significant portion of the overall load your script places on the scripting engine, so just use whatever you feel is most suitable Originally Posted By: Steeeve However, when I ran it against a file from 'in the wild', the server I was connected to told me that 'ELSE' was not a valid command. There was no 'else' in my code, but there was in the file being parsed (part of a piece of javascript) and a quick test showed that the DLL was actually ending up evaluating that line. Ouch - major security issue there. i'll say! are you sure the code being executed is an unavoidable side effect of calling the DLL? i can't see how that could be at all desirable Originally Posted By: Steeeve One thing I haven't tried is whether $bfind is faster on smaller variables. E.g. if I chop off the leading part (which I have already parsed), would $bfind go faster.. and if so, fast enough to offset now having to bcopy.. but given that bfind is pretty fast to begin with, I doubt that's where I'm going to find major speed increases your subsequent $bfind()s would presumably begin at later positions rather than searching from the start again (N = 1). keeping track of these advancing positions and mIRC seeking to them on the next call to $bfind() will be substantially faster than chopping up your binvar, copying etc. $bfind() is as fast as it takes to find the data from the position you specify, easy peasy ;P a little advice though: search using byte values rather than text. mIRC does not handle text searches very well - they can be up to ~10 times slower. Originally Posted By: Steeeve 2. If not, is there in fact a faster method than incrementally seeking/$bfind'ing an element's opening tag head, tag tail, closing tag head, tag tail, and so forth and so on? I was thinking a regular expression may be faster but those only take regular strings.. and an element may very well span more than the variable size limit (e.g. a 'content' DIV which wraps user comments). sure, you can use $regex() to operate on the page in 4kb (4,150 characters, more precisely) chunks. you can even manipulate the expression in such a way that $regml() returns results in a conveniently predictable way. here's a simple example: Code: alias htmltags { noop $regex($1-, /^([^<]>\|)([^<>])\|<(/?[a-z]+) ?([^>])>([^<>])\|(<[^>]*\|)$\K/g) if ($regml(1)) echo -a End of tag from previous line: $v1 if ($regml(2) != $null) echo -a Data outside: $v1 var %i = 3, %n = $regml(0) while (%i < %n) { echo -a Tag $calc(%i / 3) : $regml(%i) inc %i echo -a Data inside: $regml(%i) inc %i echo -a Data outside: $regml(%i) inc %i } if ($regml(%n)) echo -a Unclosed tag at end: $v1 } /htmltags a>b<c d="e">f</g>h<i Quote: End of tag from previous line: a> Data outside: b Tag 1 : c Data inside: d="e" Data outside: f Tag 2 : /g Data inside: Data outside: h Unclosed tag at end: <i so each full tag always occupies 3 values of $regml(), partial tags occupy the first and last matches, and $regml(2) is the first substring outside of any tag. this could certainly be faster for you than fiddling with $bfind(). you could take the partial tag at the end and slap it on to the start of the next chunk to simplify things. good luck and keep us posted! "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Entire Thread
Subject	Posted By	Posted
Parsing HTML the proper way - existing solutions?	Anonymous	16/02/12 11:40 PM
Re: Parsing HTML the proper way - existing solutions?	pball	17/02/12 01:34 AM
Re: Parsing HTML the proper way - existing solutions?	Anonymous	17/02/12 04:24 AM
Re: Parsing HTML the proper way - existing solutions?	argv0	17/02/12 03:27 AM
Re: Parsing HTML the proper way - existing solutions?	Anonymous	17/02/12 04:43 AM
Re: Parsing HTML the proper way - existing solutions?	argv0	17/02/12 05:20 AM
Re: Parsing HTML the proper way - existing solutions?	jaytea	17/02/12 12:03 PM
Re: Parsing HTML the proper way - existing solutions?	Anonymous	18/02/12 09:40 PM
Re: Parsing HTML the proper way - existing solutions?	Anonymous	20/02/12 10:35 PM
Re: Parsing HTML the proper way - existing solutions?	Anonymous	20/02/12 10:51 PM
Re: Parsing HTML the proper way - existing solutions?	jaytea	22/02/12 09:16 AM
Re: Parsing HTML the proper way - existing solutions?	Anonymous	23/02/12 01:40 PM

Link Copied to Clipboard