REGEX help - highlight parser, \b and more - 13/09/03 04:41 PM
Warning: confusing, disorganized text below.
I'm trying to make some sort of highlight script on mIRC using regular expressions. It should look for some strings in the given text, being sure it finds exactly what I want it to look for and being aware of punctuations and stuff like that (eg. find "cold.", "colddd!" but not when it's "coldplay", "cold war" etc). It's not supposed to use token identifiers, since I'm not looking for separated strings. It should work like "if (a isin a a)" instead of "if ($istok(a a,a,32))", for many reasons.
My old highlight script doesn't do all these stuff. It does some, like being aware of similar strings ("coldplay", "cold war" etc). It uses 2 variables. It checks one ***, then it loops through the tokenized values of the other one, looking for every item in the given text. *** I forgot it doesn't loop through the first var, instead it replaces all 's by commas, check another existing commas etc. them dump them as $remove() parameters.
The first variable contains the strings that I want to remove from the text before anything, the second one contains the strings that I want to find. Example:
%highlight.remove coldplaycold playcoldwarcold warcoldfusionstone coldlistening to cold
%highlight.find coldcooldcoooldcolldcollld
Actually it doesn't look for "cold"-related strings only. It looks for a lot of strings, like names, even parts of names, and any other varied stuff I want to be highlighted. My variables are 87489374 times bigger than these, filled with hundreds of possibilities, all of them going to be part of a loop, called everytime anyone says something.
Well, it's too slow (benchmark-speaking) and IMO it's a nightmare to edit those vars. Although my IRC experience isn't affected by its speed loss (because either I can't notice much of it or I'm used to it), I suspect it's absorbing more resources than normal, because it's the only script that is massively used here and mIRC just makes everything slower and slower as time passes.. then if I deactivate the script, resources go fine for a much longer time. I could be wrong about this, but anyway.
Plus, this method urges for regular expressions. It's too limited, I hate having to specify one more item for a loop only because of a single char (space, hifen, underscore, accentuated letter etc) instead of anything smarter, like a char telling the script that it could be any punctuation character, or that it couldn't be repeated, or couldn't be followed by "hi" etc. If I make that without the regex functions, I'll be asking the script to be much more slower.
But I barely know how to use reg. exps. efficiently, so I'm having a lot of problems.
One of them is how to make the expression as easily customizable as the old variables. I'd like some ideas about this. I tried to cover "cold" and "andré" (or "andre"), along with its exceptions going to be removed ("coldplay", "andrew" etc) and the expression turned to be a monster much uglier than the variables and it still has some flaws, due to my small knowledge.
Another problem is.. trying to cover "andré", I found a stupid difficulty. I was starting it with something like "/\bandr[eéÉ]+\b/gi" to cover punctuation and repetitions, but then I realized that \b (word boundary assertion) doesn't work with accentuated chars, since \b is a "position in the subject string where the current character and the previous character do not both match \w or \W" (www.pcre.org/man.txt):
alias blah { echo -ag $regex($1,/\bandr[eéÉ]+\b/gi) }
/blah hi andre! works
/blah hi andreee! works
/blah hi andreé! works
/blah hi andrée! works
/blah hi andré! doesn't work
/blah hi andrééé! doesn't work
That is, "é" is considered a word boundary. Actually, I need every accentuated letter and "çÇ" to be parts of a "word" too, or anything simple, maybe like using \w. I didn't find anything, and can't use something like "[à-äè-ïò-öú-ýÿÀ-ÄÈ-ÏÒ-ÖÙ-ÝŸçÇ\w]".
One more problem (the last, I'm a little confused trying to read all this) is that I don't really get how does the capturing of some subpatterns work. I can't define exactly what I want to be captured to \2 etc. because sometimes I don't know how to use "(?:blah)" (to know it won't capture) when the subpattern has already a function, like lookahead/lookbehind assertions. I'm limited to know only what goes to \1.. can't know what goes to \2 \3 and so on.
That's it. Regular expressions let me crazy. I'd really like a good mIRC-related (full) tutorial.. for now I'm reading man.txt, but it lacks practical examples, since it's not really related . Can anyone give some of these examples too?
I'm trying to make some sort of highlight script on mIRC using regular expressions. It should look for some strings in the given text, being sure it finds exactly what I want it to look for and being aware of punctuations and stuff like that (eg. find "cold.", "colddd!" but not when it's "coldplay", "cold war" etc). It's not supposed to use token identifiers, since I'm not looking for separated strings. It should work like "if (a isin a a)" instead of "if ($istok(a a,a,32))", for many reasons.
My old highlight script doesn't do all these stuff. It does some, like being aware of similar strings ("coldplay", "cold war" etc). It uses 2 variables. It checks one ***, then it loops through the tokenized values of the other one, looking for every item in the given text. *** I forgot it doesn't loop through the first var, instead it replaces all 's by commas, check another existing commas etc. them dump them as $remove() parameters.
The first variable contains the strings that I want to remove from the text before anything, the second one contains the strings that I want to find. Example:
%highlight.remove coldplaycold playcoldwarcold warcoldfusionstone coldlistening to cold
%highlight.find coldcooldcoooldcolldcollld
Actually it doesn't look for "cold"-related strings only. It looks for a lot of strings, like names, even parts of names, and any other varied stuff I want to be highlighted. My variables are 87489374 times bigger than these, filled with hundreds of possibilities, all of them going to be part of a loop, called everytime anyone says something.
Well, it's too slow (benchmark-speaking) and IMO it's a nightmare to edit those vars. Although my IRC experience isn't affected by its speed loss (because either I can't notice much of it or I'm used to it), I suspect it's absorbing more resources than normal, because it's the only script that is massively used here and mIRC just makes everything slower and slower as time passes.. then if I deactivate the script, resources go fine for a much longer time. I could be wrong about this, but anyway.
Plus, this method urges for regular expressions. It's too limited, I hate having to specify one more item for a loop only because of a single char (space, hifen, underscore, accentuated letter etc) instead of anything smarter, like a char telling the script that it could be any punctuation character, or that it couldn't be repeated, or couldn't be followed by "hi" etc. If I make that without the regex functions, I'll be asking the script to be much more slower.
But I barely know how to use reg. exps. efficiently, so I'm having a lot of problems.
One of them is how to make the expression as easily customizable as the old variables. I'd like some ideas about this. I tried to cover "cold" and "andré" (or "andre"), along with its exceptions going to be removed ("coldplay", "andrew" etc) and the expression turned to be a monster much uglier than the variables and it still has some flaws, due to my small knowledge.
Another problem is.. trying to cover "andré", I found a stupid difficulty. I was starting it with something like "/\bandr[eéÉ]+\b/gi" to cover punctuation and repetitions, but then I realized that \b (word boundary assertion) doesn't work with accentuated chars, since \b is a "position in the subject string where the current character and the previous character do not both match \w or \W" (www.pcre.org/man.txt):
alias blah { echo -ag $regex($1,/\bandr[eéÉ]+\b/gi) }
/blah hi andre! works
/blah hi andreee! works
/blah hi andreé! works
/blah hi andrée! works
/blah hi andré! doesn't work
/blah hi andrééé! doesn't work
That is, "é" is considered a word boundary. Actually, I need every accentuated letter and "çÇ" to be parts of a "word" too, or anything simple, maybe like using \w. I didn't find anything, and can't use something like "[à-äè-ïò-öú-ýÿÀ-ÄÈ-ÏÒ-ÖÙ-ÝŸçÇ\w]".
One more problem (the last, I'm a little confused trying to read all this) is that I don't really get how does the capturing of some subpatterns work. I can't define exactly what I want to be captured to \2 etc. because sometimes I don't know how to use "(?:blah)" (to know it won't capture) when the subpattern has already a function, like lookahead/lookbehind assertions. I'm limited to know only what goes to \1.. can't know what goes to \2 \3 and so on.
That's it. Regular expressions let me crazy. I'd really like a good mIRC-related (full) tutorial.. for now I'm reading man.txt, but it lacks practical examples, since it's not really related . Can anyone give some of these examples too?