Register Log In

Forums Scripts & Popups Parsing HTML the proper way - existing solutions?

Print Thread

Re: Parsing HTML the proper way - existing solutions? jaytea #236362 20/02/12 10:35 PM
Joined: Feb 2012 Posts: 18 S Steeeve OP Pikka bird
OP Steeeve Pikka bird S Joined: Feb 2012 Posts: 18	Originally Posted By: jaytea a more formal HTML parser that could be used in our scripts would be great if more of us knew how to use them! for those that don't, i suspect used alongside a feature akin to Chrome's "Inspect Element" (scripter views the webpage inside a custom @window and clicks / highlights the areas they want to fetch) would revolutionize the way scripters create these socket scripts. but that's something else altogether ;P haha - yeah, way outside the scope of what I'm going for. the XPath stuff in domxml.dll were interesting, but it breaks all over the place (XML vs HTML aside, it thew an out of memory error - oops). Originally Posted By: jaytea indeed, they are on the same order of magnitude with respect to execution time. but it's worth noting that calls to /hadd & $hget() are actually about twice as quick as similar calls to /writeini & $readini(), owing to the extra argument and possibly extra work involved in referencing the internal hash table associated with the INI section vs. the scripter's hash table. still, i don't expect that these calls constitute a significant portion of the overall load your script places on the scripting engine, so just use whatever you feel is most suitable As I mentioned in the other post, the concatenation to keep things identifiable adds a smidgen of processing time. Just to stay on that topic.. I did cut down processing time a little bit. Odd things that I didn't know before, such as... Code: ; slow while (something) { var %x = thing } ; fast var %x while (something) %x = thing Who knew curly braces and 'var' would slow things down so much? Well, some people knew, as I found those tips in a scripting notes write-up.. it prompted me to check out some other things* and try to speed-test all the things.. and then use only the fastest variants. * by 'much' I mean a second over a loop of, say, 50000 counts. Yeah, almost not worth the botehr. ** including regular expressions. Specifically, regular expressions to get rid of non-space whitespaces, and leading, double, and trailing spaces. Turns out a $replace() followed by a 'tokenize 32 ... \| %var = $1-' was significantly faster. Still, I'll have to try a more full regular expression route (such as you wrote up), because shaving away milliseconds resulted in only a very marginal speedup. In fact, it seems that by far the most time is spent in $bfind(). The reason I use $bfind is because I could bread in a whole file into a binary variable, and then use $bfind() to find a string (or character array) starting from a given position (i.e. the last position I dealt with). But in a simple test off of a 70k file, finding a unique 'something' in the middle starting at position 0 each time (the reason for this will become clear shortly), 100 $bfind()'s took a whopping 12.117s. Now here's the thing that bugs me about that... If I instead just fopen the file and use fseek with a wildcard search to find that same unique something in the middle, 100 runs took a blazingly fast 0.11s! The reason it bugs me is not because it's so much faster - I'd switch over immediately! The reason is this... You can't make fseek start seeking from a specific position - and subsequent fseeks also don't start from the position it last left off. incorrect, see follow-up post This is a dealbreaker because obviously I'm not looking for unique somethings, I'm looking for generic somethings, and with this I would only ever find the first. I'm tempted to see if I could fseek, then grab from that position onward into a binvar, write that out to a new file, and then use that to fseek in again. Curious if the file system i/o would kill the performance there. Sure makes me wish fseek with -w (and -r) would allow a starting position. Or, alternatively, that $bfind would be as fast as fseek -w . Originally Posted By: jaytea i'll say! are you sure the code being executed is an unavoidable side effect of calling the DLL? i can't see how that could be at all desirable I didn't spot anything in the DLL's documentation about evaluation/execution of the read data and/or how to disable that (unlike e.g. $read's -n switch) I also couldn't spot any particular reason why it would evaluate/execute code at that point. I just know that when I replaced the javascript 'else' with 'echo -s holy crap', my status window was happily written to with 'holy crap'. Not good. Originally Posted By: jaytea your subsequent $bfind()s would presumably begin at later positions rather than searching from the start again (N = 1). keeping track of these advancing positions and mIRC seeking to them on the next call to $bfind() will be substantially faster than chopping up your binvar, copying etc. Yes. Basically I keep track of a position in the binvar, %pos, which itself is set mostly from $bfind(), finding the < character (via byte value) or a string (as a string), then finding the > character (unless the element is a comment, cdata, script or style element, in which case I look for the matching closing tag as everyhing inbetween should be treated as character data), grabbing everything inbetween, etc. etc. Originally Posted By: jaytea $bfind() is as fast as it takes to find the data from the position you specify, easy peasy ;P a little advice though: search using byte values rather than text. mIRC does not handle text searches very well - they can be up to ~10 times slower. yeah, unfortunately not much choice in some cases. E.g. if I encounter a <script* element opening tag, I should look for its </script closing tag. I could look for </ instead, but then I'll still have to grab the next bunch of characters and compare against that - looking for </script directly is faster. The reason I don't use byte values is because I can't be sure of the capitalization used. script, SCRIPT, Script, ScRIpT are all interpreted validly by browsers, and thus authors ran with that. ( If XHTML had been made mandatory, forcing authors to write code properly, my life and that of browser makers would have been so much easier. Instead, HTML5 does away with a lot of the strictness. Boo.) Originally Posted By: jaytea sure, you can use $regex() to operate on the page in 4kb (4,150 characters, more precisely) chunks. you can even manipulate the expression in such a way that $regml() returns results in a conveniently predictable way. here's a simple example: Yeah, I'm familiar with a few regexes to grab HTML bits and pieces. I'll have to give that a try, still. I didn't know 4k (I'll stick to 4,096 char chunks, I think) was the limit.. I thought it would be much, much smaller. Regex is still pretty slow compared to direct methods, but in this case it could possibly replace whole chunks of code. I'm sure I'll still run into some issues, but it'll be worth a shot. Originally Posted By: jaytea this could certainly be faster for you than fiddling with $bfind(). you could take the partial tag at the end and slap it on to the start of the next chunk to simplify things. Yep, that's what I'm hoping for. Well, short of something that makes bfind go at the speed of fseek / adds a start position to fseek. looks at Khaled nicely Last edited by Steeeve; 20/02/12 10:52 PM.

Entire Thread
Subject	Posted By	Posted
Parsing HTML the proper way - existing solutions?	Steeeve	16/02/12 11:40 PM
Re: Parsing HTML the proper way - existing solutions?	pball	17/02/12 01:34 AM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	17/02/12 04:24 AM
Re: Parsing HTML the proper way - existing solutions?	argv0	17/02/12 03:27 AM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	17/02/12 04:43 AM
Re: Parsing HTML the proper way - existing solutions?	argv0	17/02/12 05:20 AM
Re: Parsing HTML the proper way - existing solutions?	jaytea	17/02/12 12:03 PM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	18/02/12 09:40 PM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	20/02/12 10:35 PM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	20/02/12 10:51 PM
Re: Parsing HTML the proper way - existing solutions?	jaytea	22/02/12 09:16 AM
Re: Parsing HTML the proper way - existing solutions?	Steeeve	23/02/12 01:40 PM

Link Copied to Clipboard