Register Log In

Forums Scripts & Popups Parsing HTML the proper way - existing solutions?

Print Thread

Parsing HTML the proper way - existing solutions? #236295 16/02/12 11:40 PM
S Steeeve
Steeeve S	Hi all, tl;dr ? Scroll to bottom of post. Those who clicked onward to this post will probably be familiar with how most HTML-scrapers (google searches, imdb lookups, etc.) seem to work.. download the file, fseek or $bfind to the point(s) of interest, read a bit, repeat until done. I was hoping to get something a bit more 'proper' so that much of the seeking around is dropped. I have written a small HTML parser that pretty much follows the rules laid out for HTML5 in terms of void elements, elements that don't need a closing tag - all the usual suspects you might find in the wild, and distill the output to an INI file so that the results can easily be referenced. It's not DOM-compliant, but it does what I need it to do. However, despite attempts to make it go faster (a few milliseconds here, a few milliseconds there), a 70k page still takes about 10 seconds to fully parse. ( For the curious - no, a hash table isn't any faster than writeini, as long as I don't forget to flushini when done, and it saves instructions for concatenating things for hash table storage. ) So I tried to look for some alternatives, but Google is throwing me a bunch of almost completely useless results. The one result that was very useful, was domxml.dll written several years back. It works, but unfortunately it expects well-formed XML, which HTML tends not to be - which means the dom, tree and XPath methods are useless, while the SAX method falls apart on some style directives / script code. This is something I can mostly work around (inserting ![CDATA[ where appropriate in a fast pre-parser). However, when I ran it against a file from 'in the wild', the server I was connected to told me that 'ELSE' was not a valid command. There was no 'else' in my code, but there was in the file being parsed (part of a piece of javascript) and a quick test showed that the DLL was actually ending up evaluating that line. Ouch - major security issue there. Thus I'm back where I left off, which is either doing the seek-a-thon (fast, but annoying and inflexible to write) or going with mIRC being unresponsive for several seconds while a file is parsed. One thing I haven't tried is whether $bfind is faster on smaller variables. E.g. if I chop off the leading part (which I have already parsed), would $bfind go faster.. and if so, fast enough to offset now having to bcopy.. but given that bfind is pretty fast to begin with, I doubt that's where I'm going to find major speed increases tl;dr version: 1. Are there any existing, fast, full HTML parsers? 2. If not, is there in fact a faster method than incrementally seeking/$bfind'ing an element's opening tag head, tag tail, closing tag head, tag tail, and so forth and so on? I was thinking a regular expression may be faster but those only take regular strings.. and an element may very well span more than the variable size limit (e.g. a 'content' DIV which wraps user comments). Thanks for any hints!

Re: Parsing HTML the proper way - existing solutions? #236303 17/02/12 01:34 AM
Joined: Nov 2009 Posts: 295 P pball Fjord artisan
pball Fjord artisan P Joined: Nov 2009 Posts: 295	I've made my fair share of website parsing scripts so I'll throw down some tips and tricks I use. Though this might not be exactly what you're looking for. The main way I parse html is using regex. This method and how mirc handles getting the website's code has it's limitations though. It works great for websites that have "clean" code where the beginning and end tags are on the same line as the info you want. (Since mirc gets the code line by line) EX: Code: if ($regex(%fml.result,/<p><a href=".?" class="fmllink">(.?)</a></p>/)) { just strip html code and viola you have a funny FML story Second method I've used (though really dislike) is to find a tag using an isin check to find a tag. This method is for sites that are "messy" and have the beginning and end tags on separate lines. After the tag is found a new line is read and a loop is started and stores the info into a variable until the end tag is found. Using a loop can be dangerous and has screwed up for me many times. EX: Code: if (<tr><td>Temperature</td> isin %w.read) { set %w.temp $null sockread %w.read while (</td> !isin %w.read) { set %w.temp %w.temp %w.read \| sockread %w.read } } And my favorite way to parse html inside mirc is to not use mirc to parse the html. I've been making C# programs that you just /run google.exe search%20terms%20here and mirc leaves open a local socket which the C# program sends back the already parsed info to. This is great for many reasons. C# is faster than mirc, mirc isn't frozen while it waits for the website/C# to respond, in C# you can get the whole page in a single var and use more complex regex. I use this method for anything that doesn't have simple clean html code, like google or wunderground. Not really sure if this is the kinda info you're wanting to discuss but it's what I know. I also want to say I've never used or come across a script that writes to a file and searches it.

Re: Parsing HTML the proper way - existing solutions? #236306 17/02/12 03:27 AM
Joined: Oct 2003 Posts: 3,641 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,641 Montreal, QC, Canada	If you're okay with escaping from the clutches of mIRC script, there are also dlls like ruby4mirc, tcl4mirc, python4mirc, perl4mirc and more. All of these language bindings have XML libs associated with them which you can use instead of mIRC. I would go this route if you really have complex X(HT)ML to parse. Using an external program like a C# exe is okay, but the integration is lacking-- the above dlls can all be embedded within your mIRC script, so it's much easier to use and modify. Also, all of the above are dynamically typed languages, and, typically, dynamically typed languages are easier to write quick prototypes with, so you'd probably get off the ground a little faster than with C#/C++/Java (unless of course you're an expert in one of those languages).

Re: Parsing HTML the proper way - existing solutions? pball #236307 17/02/12 04:24 AM
S Steeeve
Steeeve S	Hi pball, Yeah, the methods you mention are the methods I've been using - just look for the bit I'm interested in, write code specifically for that bit. More specifically: open file seek to something interesting handle that seek to something interesting that may be across multiple lines - go here read line deal with it if this isn't the line where I stop caring, go to '- go here' (either an actual go to, or a 'while' loop) etc. etc. close file The up side to that is that it's very fast. The down side is that you do have to put in bits of code for every specific little thing, dealing with odd scenarios and practically needing debug output to see what the code is doing up to that point. Yeah, I'm trying to keep external dependencies down to a minimum. In another language I have access to .NET and the DOM for an HTML document is presented to me as an object to query and manipulate at free will. Unfortunate I can't interface with that As it is, I would have preferred to stay away from domxml.dll, but it's just one file which wasn't too bad - except for the whole XML vs not-quite-XML thing.. not to mention the security issue.. yikes.

Re: Parsing HTML the proper way - existing solutions? argv0 #236308 17/02/12 04:43 AM
S Steeeve
Steeeve S	argv0, I did see the site for js4mirc that I thought might have been interesting, but the documentation quickly shattered my hopes (plus it seems to only take strings which means I'd run into aforementioned issues again). I'd prefer to stay within the mIRC language, but I might have to check out python4mirc at least. I'll try shaving away some more milliseconds in the mean time as well.. and try it on a more modern machine - I might be fussing over a moot issue If I get it to a reasonably releasable state, I'll be sure to post it here - there's bound to be parts that can be sped up in ways I'm not thinking of.

Re: Parsing HTML the proper way - existing solutions? #236309 17/02/12 05:20 AM
Joined: Oct 2003 Posts: 3,641 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,641 Montreal, QC, Canada	It wouldn't be very hard to update js4mirc to use the v8 engine rather than spidermonkey (very old), which would allow you to make use of commonJS implementations and libraries, including jsdom, which would give you DOM-like access to an XML document. There are also http/net libraries to load that HTML into such a document. The embedding code is actually really simple: http://bravenewmethod.wordpress.com/2011/03/30/embedding-v8-javascript-engine-and-go/

Re: Parsing HTML the proper way - existing solutions? #236313 17/02/12 12:03 PM
Joined: Feb 2006 Posts: 523 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	hi there, nice post! Originally Posted By: Steeeve Those who clicked onward to this post will probably be familiar with how most HTML-scrapers (google searches, imdb lookups, etc.) seem to work.. download the file, fseek or $bfind to the point(s) of interest, read a bit, repeat until done. I was hoping to get something a bit more 'proper' so that much of the seeking around is dropped. the problem, as i'm sure you've realized, is that if you elect to use mIRC's sockets to fetch specific data from a web server, 99% of the time it will be most efficient to handle that data as it comes in (line-wise, or byte-wise) instead of collecting it and parsing it formally. these solutions tend to be generated on a case-by-case basis and are horribly inflexible, as you noted. we resort to these basic methods because, to a layman, they most easily express and identify the portion of the webpage that we want to fetch. a more formal HTML parser that could be used in our scripts would be great if more of us knew how to use them! for those that don't, i suspect used alongside a feature akin to Chrome's "Inspect Element" (scripter views the webpage inside a custom @window and clicks / highlights the areas they want to fetch) would revolutionize the way scripters create these socket scripts. but that's something else altogether ;P Originally Posted By: Steeeve ( For the curious - no, a hash table isn't any faster than writeini, as long as I don't forget to flushini when done, and it saves instructions for concatenating things for hash table storage. ) indeed, they are on the same order of magnitude with respect to execution time. but it's worth noting that calls to /hadd & $hget() are actually about twice as quick as similar calls to /writeini & $readini(), owing to the extra argument and possibly extra work involved in referencing the internal hash table associated with the INI section vs. the scripter's hash table. still, i don't expect that these calls constitute a significant portion of the overall load your script places on the scripting engine, so just use whatever you feel is most suitable Originally Posted By: Steeeve However, when I ran it against a file from 'in the wild', the server I was connected to told me that 'ELSE' was not a valid command. There was no 'else' in my code, but there was in the file being parsed (part of a piece of javascript) and a quick test showed that the DLL was actually ending up evaluating that line. Ouch - major security issue there. i'll say! are you sure the code being executed is an unavoidable side effect of calling the DLL? i can't see how that could be at all desirable Originally Posted By: Steeeve One thing I haven't tried is whether $bfind is faster on smaller variables. E.g. if I chop off the leading part (which I have already parsed), would $bfind go faster.. and if so, fast enough to offset now having to bcopy.. but given that bfind is pretty fast to begin with, I doubt that's where I'm going to find major speed increases your subsequent $bfind()s would presumably begin at later positions rather than searching from the start again (N = 1). keeping track of these advancing positions and mIRC seeking to them on the next call to $bfind() will be substantially faster than chopping up your binvar, copying etc. $bfind() is as fast as it takes to find the data from the position you specify, easy peasy ;P a little advice though: search using byte values rather than text. mIRC does not handle text searches very well - they can be up to ~10 times slower. Originally Posted By: Steeeve 2. If not, is there in fact a faster method than incrementally seeking/$bfind'ing an element's opening tag head, tag tail, closing tag head, tag tail, and so forth and so on? I was thinking a regular expression may be faster but those only take regular strings.. and an element may very well span more than the variable size limit (e.g. a 'content' DIV which wraps user comments). sure, you can use $regex() to operate on the page in 4kb (4,150 characters, more precisely) chunks. you can even manipulate the expression in such a way that $regml() returns results in a conveniently predictable way. here's a simple example: Code: alias htmltags { noop $regex($1-, /^([^<]>\|)([^<>])\|<(/?[a-z]+) ?([^>])>([^<>])\|(<[^>]*\|)$\K/g) if ($regml(1)) echo -a End of tag from previous line: $v1 if ($regml(2) != $null) echo -a Data outside: $v1 var %i = 3, %n = $regml(0) while (%i < %n) { echo -a Tag $calc(%i / 3) : $regml(%i) inc %i echo -a Data inside: $regml(%i) inc %i echo -a Data outside: $regml(%i) inc %i } if ($regml(%n)) echo -a Unclosed tag at end: $v1 } /htmltags a>b<c d="e">f</g>h<i Quote: End of tag from previous line: a> Data outside: b Tag 1 : c Data inside: d="e" Data outside: f Tag 2 : /g Data inside: Data outside: h Unclosed tag at end: <i so each full tag always occupies 3 values of $regml(), partial tags occupy the first and last matches, and $regml(2) is the first substring outside of any tag. this could certainly be faster for you than fiddling with $bfind(). you could take the partial tag at the end and slap it on to the start of the next chunk to simplify things. good luck and keep us posted! "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Re: Parsing HTML the proper way - existing solutions? jaytea #236336 18/02/12 09:40 PM
S Steeeve
Steeeve S	Hi jaytea, I had been looking at some posts about regular expressions that were made in your name, and I'll definitely have to read your reply again when I'm more awake and comment on it. ( Apparently the 'watched topics' isn't actually e-mailing me - I'll have to check my settings. ) In the mean time, I did identify that I had some misconceptions about scripting and execution time for some things that seem ingrained in scripting as being 'the way to do it' (but not focused on speed, clearly), so I'm shaving away at what I've got and doing speed tests (wrote a little alias that runs a few tests a few thousand times, then scales up the number of loops to give me a more statistically significant number of tests (~10s per test)... fun things to find out that way. Just to note specifically the hash table vs ini thing.. the part where a hash table seems to fall apart is that its storage is 3 levels deep without trickery, while inis are 4 levels. Which means that if I have e.g. [tr id='#something][td]text[/td][/tr] If I want to store things in an ini, I would get: (somefile.ini) [tr] id=something [tr][td] &content=text While if I had a hash table, I would have to use trickery... htmlhashtable - tr.id - something htmlhashtable - tr-td.&content - text That trickery of concatenation $+(...) is what ends up affecting performance quite a bit. Alternatively, I'd be creating one hash table per element. I haven't tested that in terms of performance, but surely that's not the way to go But maybe there's some other trick there. For now I'd like to finish the shaving, and going over your post's notes re: use of regexes. The ini vs hash thing is a lot easier to swap out than the other things

Re: Parsing HTML the proper way - existing solutions? jaytea #236362 20/02/12 10:35 PM
S Steeeve
Steeeve S	Originally Posted By: jaytea a more formal HTML parser that could be used in our scripts would be great if more of us knew how to use them! for those that don't, i suspect used alongside a feature akin to Chrome's "Inspect Element" (scripter views the webpage inside a custom @window and clicks / highlights the areas they want to fetch) would revolutionize the way scripters create these socket scripts. but that's something else altogether ;P haha - yeah, way outside the scope of what I'm going for. the XPath stuff in domxml.dll were interesting, but it breaks all over the place (XML vs HTML aside, it thew an out of memory error - oops). Originally Posted By: jaytea indeed, they are on the same order of magnitude with respect to execution time. but it's worth noting that calls to /hadd & $hget() are actually about twice as quick as similar calls to /writeini & $readini(), owing to the extra argument and possibly extra work involved in referencing the internal hash table associated with the INI section vs. the scripter's hash table. still, i don't expect that these calls constitute a significant portion of the overall load your script places on the scripting engine, so just use whatever you feel is most suitable As I mentioned in the other post, the concatenation to keep things identifiable adds a smidgen of processing time. Just to stay on that topic.. I did cut down processing time a little bit. Odd things that I didn't know before, such as... Code: ; slow while (something) { var %x = thing } ; fast var %x while (something) %x = thing Who knew curly braces and 'var' would slow things down so much? Well, some people knew, as I found those tips in a scripting notes write-up.. it prompted me to check out some other things* and try to speed-test all the things.. and then use only the fastest variants. * by 'much' I mean a second over a loop of, say, 50000 counts. Yeah, almost not worth the botehr. ** including regular expressions. Specifically, regular expressions to get rid of non-space whitespaces, and leading, double, and trailing spaces. Turns out a $replace() followed by a 'tokenize 32 ... \| %var = $1-' was significantly faster. Still, I'll have to try a more full regular expression route (such as you wrote up), because shaving away milliseconds resulted in only a very marginal speedup. In fact, it seems that by far the most time is spent in $bfind(). The reason I use $bfind is because I could bread in a whole file into a binary variable, and then use $bfind() to find a string (or character array) starting from a given position (i.e. the last position I dealt with). But in a simple test off of a 70k file, finding a unique 'something' in the middle starting at position 0 each time (the reason for this will become clear shortly), 100 $bfind()'s took a whopping 12.117s. Now here's the thing that bugs me about that... If I instead just fopen the file and use fseek with a wildcard search to find that same unique something in the middle, 100 runs took a blazingly fast 0.11s! The reason it bugs me is not because it's so much faster - I'd switch over immediately! The reason is this... You can't make fseek start seeking from a specific position - and subsequent fseeks also don't start from the position it last left off. incorrect, see follow-up post This is a dealbreaker because obviously I'm not looking for unique somethings, I'm looking for generic somethings, and with this I would only ever find the first. I'm tempted to see if I could fseek, then grab from that position onward into a binvar, write that out to a new file, and then use that to fseek in again. Curious if the file system i/o would kill the performance there. Sure makes me wish fseek with -w (and -r) would allow a starting position. Or, alternatively, that $bfind would be as fast as fseek -w . Originally Posted By: jaytea i'll say! are you sure the code being executed is an unavoidable side effect of calling the DLL? i can't see how that could be at all desirable I didn't spot anything in the DLL's documentation about evaluation/execution of the read data and/or how to disable that (unlike e.g. $read's -n switch) I also couldn't spot any particular reason why it would evaluate/execute code at that point. I just know that when I replaced the javascript 'else' with 'echo -s holy crap', my status window was happily written to with 'holy crap'. Not good. Originally Posted By: jaytea your subsequent $bfind()s would presumably begin at later positions rather than searching from the start again (N = 1). keeping track of these advancing positions and mIRC seeking to them on the next call to $bfind() will be substantially faster than chopping up your binvar, copying etc. Yes. Basically I keep track of a position in the binvar, %pos, which itself is set mostly from $bfind(), finding the < character (via byte value) or a string (as a string), then finding the > character (unless the element is a comment, cdata, script or style element, in which case I look for the matching closing tag as everyhing inbetween should be treated as character data), grabbing everything inbetween, etc. etc. Originally Posted By: jaytea $bfind() is as fast as it takes to find the data from the position you specify, easy peasy ;P a little advice though: search using byte values rather than text. mIRC does not handle text searches very well - they can be up to ~10 times slower. yeah, unfortunately not much choice in some cases. E.g. if I encounter a <script* element opening tag, I should look for its </script closing tag. I could look for </ instead, but then I'll still have to grab the next bunch of characters and compare against that - looking for </script directly is faster. The reason I don't use byte values is because I can't be sure of the capitalization used. script, SCRIPT, Script, ScRIpT are all interpreted validly by browsers, and thus authors ran with that. ( If XHTML had been made mandatory, forcing authors to write code properly, my life and that of browser makers would have been so much easier. Instead, HTML5 does away with a lot of the strictness. Boo.) Originally Posted By: jaytea sure, you can use $regex() to operate on the page in 4kb (4,150 characters, more precisely) chunks. you can even manipulate the expression in such a way that $regml() returns results in a conveniently predictable way. here's a simple example: Yeah, I'm familiar with a few regexes to grab HTML bits and pieces. I'll have to give that a try, still. I didn't know 4k (I'll stick to 4,096 char chunks, I think) was the limit.. I thought it would be much, much smaller. Regex is still pretty slow compared to direct methods, but in this case it could possibly replace whole chunks of code. I'm sure I'll still run into some issues, but it'll be worth a shot. Originally Posted By: jaytea this could certainly be faster for you than fiddling with $bfind(). you could take the partial tag at the end and slap it on to the start of the next chunk to simplify things. Yep, that's what I'm hoping for. Well, short of something that makes bfind go at the speed of fseek / adds a start position to fseek. looks at Khaled nicely Last edited by Steeeve; 20/02/12 10:52 PM.

Re: Parsing HTML the proper way - existing solutions? #236364 20/02/12 10:51 PM
S Steeeve
Steeeve S	oh, wait a minute.. the reason subsequent fseeks aren't going anywhere is because it fseek from the position of the pointer inclusive. If I fseek 1 char (actually, i suppose I could read a char, saves having to inc %pos 1, then it does find subsequent results. Might be worth a shot replacing my bfinds with fseeks, then. Will adjust previous post.

Re: Parsing HTML the proper way - existing solutions? #236374 22/02/12 09:16 AM
Joined: Feb 2006 Posts: 523 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	Originally Posted By: Steeeve Code: ; slow while (something) { var %x = thing } ; fast var %x while (something) %x = thing Who knew curly braces and 'var' would slow things down so much? curly braces are a tad slower because mIRC has to locate the closing '}' ;P /var is a bit slower since it's converted into one or more /set commands at a lower level. '%var =' is mapped (onto 'set %var') but this occurs a bit further down the line and the process isn't as involved. the fastest equivalent method of looping is therefore: Code: var %x while (something) !set %x thing where /!set of course tells mIRC to bypass custom aliases and immediately refer to the inbuilt /set command. Originally Posted By: Steeeve Specifically, regular expressions to get rid of non-space whitespaces, and leading, double, and trailing spaces. Turns out a $replace() followed by a 'tokenize 32 ... \| %var = $1-' was significantly faster. use "$gettok($replace( ... ), 1-, 32)" to trim those spaces Originally Posted By: Steeeve .. it prompted me to check out some other things* and try to speed-test all the things.. and then use only the fastest variants. * by 'much' I mean a second over a loop of, say, 50000 counts. Yeah, almost not worth the botehr. exactly. it's interesting to consider micro-optimizations since they usually reveal certain pieces of information about the language, but from a practical standpoint it's rather a waste of time. Originally Posted By: Steeeve 100 $bfind()'s took a whopping 12.117s. Now here's the thing that bugs me about that... If I instead just fopen the file and use fseek with a wildcard search to find that same unique something in the middle, 100 runs took a blazingly fast 0.11s! i assume you realized that those subsequent /fseeks were starting from the beginning of the line matching your previous /fseek -w call ;P /fseek -w does indeed seem to be faster than $bfind().text but only about 3-4 times so. still significant, and a rather puzzling result given you would expect the latter to involve less work internally. Originally Posted By: Steeeve oh, wait a minute.. the reason subsequent fseeks aren't going anywhere is because it fseek from the position of the pointer inclusive. If I fseek 1 char (actually, i suppose I could read a char, saves having to inc %pos 1, then it does find subsequent results. Might be worth a shot replacing my bfinds with fseeks, then. haha yes, but be careful: that pointer position, as mentioned up there, is at the beginning of the entire line. unless you're using a wildmatch such as "stuff", you may still locate that very same line if you only advance the current position by one byte. the correct way to handle "stuff*" is to grab the entire line with $fread(), loop through all occurrences of 'stuff' (with $pos(), for example) since there could be multiple, then when you're done use /fseek -n to advance the pointer to the next line in preparation for your next /fseek -w. "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Re: Parsing HTML the proper way - existing solutions? jaytea #236402 23/02/12 01:40 PM
S Steeeve
Steeeve S	Originally Posted By: jaytea curly braces are a tad slower because mIRC has to locate the closing '}' ;P /var is a bit slower since it's converted into one or more /set commands at a lower level. '%var =' is mapped (onto 'set %var') but this occurs a bit further down the line and the process isn't as involved. the fastest equivalent method of looping is therefore: Code: var %x while (something) !set %x thing where /!set of course tells mIRC to bypass custom aliases and immediately refer to the inbuilt /set command. Yeah, I read the reasons in the advanced scripting document. I guess coming from a background in scripting languages that are at least pre-parsed once to deal with exactly these sort of things, it surprised me that every single loop would have to re-do all the world. Of course that also opens up interesting tricks like using !goto %variable... not sure yet if that's a good thing, but as long as %variable is one in a pre-determined set, I guess it can be faster. Originally Posted By: jaytea use "$gettok($replace( ... ), 1-, 32)" to trim those spaces Nope - well, not always anyway. I'd say it definitely looks better. On smaller runs (if you can call them small): Code: numtests: 59772.86 mult: 5.977286 1-> 4787 (gettok) 2-> 4516 (tokenize) 2-> 4517 (tokenize) 1-> 4867 (gettok) On longer runs (definitely long): Code: numtests: 383043.925 mult: 15.321757 1-> 28621 (gettok) 2-> 30824 (tokenize) 2-> 30324 (tokenize) 1-> 28060 (gettok) Numbers are in milliseconds. Ran both tests a couple of times, they were both consistent in this behavior. Weird, huh? ( I run test 1 and 2, then 2 and 1, in case either of the tests does something strange that affects the next test. ) Originally Posted By: jaytea exactly. it's interesting to consider micro-optimizations since they usually reveal certain pieces of information about the language, but from a practical standpoint it's rather a waste of time. Yeah. The thing is, you can't really tell whether they're micro-optimizations until you run them against an actual use case. While the difference between 'var %var =' and '%var =' is almost negligible, if I had any reason to need to do them a few hundred thousand times, I'd happily take the savings. In addition, once you've identified them, there's little reason not to make use of them (short of the ones with caveats). Originally Posted By: jaytea i assume you realized that those subsequent /fseeks were starting from the beginning of the line matching your previous /fseek -w call ;P Yeah, I had actually added an fseek fname 0 to make absolutely sure it would search from the beginning, just as the $bfind did. It was still faster. Originally Posted By: jaytea /fseek -w does indeed seem to be faster than $bfind().text but only about 3-4 times so. still significant, and a rather puzzling result given you would expect the latter to involve less work internally. Exactly. I'll take the 3-4, if I can work that into the script and not lose them again on a bunch of logic juggling between fseek and $fopen().pos to find/get the positions I want, and $bvar to actually get the information between those two positions. 3-4 might not seem like very much, but a delay of 12 seconds parsing a page versus a delay of 3-4 seconds.. world of difference in observed behavior. A webpage that loads in 3-4 seconds is acceptable, a webpage that takes 10+, people will quickly begin to wonder if something's wrong Definitely curious as to why $bfind would be relatively 'slow'. Then again: Originally Posted By: jaytea haha yes, but be careful: that pointer position, as mentioned up there, is at the beginning of the entire line. unless you're using a wildmatch such as "stuff", you may still locate that very same line if you only advance the current position by one byte. Oi vey - pointer at the beginning of the line... that's right. Originally Posted By: jaytea the correct way to handle "stuff*" is to grab the entire line with $fread(), loop through all occurrences of 'stuff' (with $pos(), for example) since there could be multiple, then when you're done use /fseek -n to advance the pointer to the next line in preparation for your next /fseek -w. More things to try, then - although if I'm doing a per-line thing anyway, the first thing I would try is the regex route.

Link Copied to Clipboard