REGEX help - highlight parser, \b and more

mIRC Homepage

Warning: confusing, disorganized text below.

I'm trying to make some sort of highlight script on mIRC using regular expressions. It should look for some strings in the given text, being sure it finds exactly what I want it to look for and being aware of punctuations and stuff like that (eg. find "cold.", "colddd!" but not when it's "coldplay", "cold war" etc). It's not supposed to use token identifiers, since I'm not looking for separated strings. It should work like "if (a isin a a)" instead of "if ($istok(a a,a,32))", for many reasons.

My old highlight script doesn't do all these stuff. It does some, like being aware of similar strings ("coldplay", "cold war" etc). It uses 2 variables. It checks one ***, then it loops through the tokenized values of the other one, looking for every item in the given text. *** I forgot it doesn't loop through the first var, instead it replaces all 's by commas, check another existing commas etc. them dump them as $remove() parameters.
The first variable contains the strings that I want to remove from the text before anything, the second one contains the strings that I want to find. Example:

%highlight.remove coldplaycold playcoldwarcold warcoldfusionstone coldlistening to cold
%highlight.find coldcooldcoooldcolldcollld

Actually it doesn't look for "cold"-related strings only. It looks for a lot of strings, like names, even parts of names, and any other varied stuff I want to be highlighted. My variables are 87489374 times bigger than these, filled with hundreds of possibilities, all of them going to be part of a loop, called everytime anyone says something.

Well, it's too slow (benchmark-speaking) and IMO it's a nightmare to edit those vars. Although my IRC experience isn't affected by its speed loss (because either I can't notice much of it or I'm used to it), I suspect it's absorbing more resources than normal, because it's the only script that is massively used here and mIRC just makes everything slower and slower as time passes.. then if I deactivate the script, resources go fine for a much longer time. I could be wrong about this, but anyway.

Plus, this method urges for regular expressions. It's too limited, I hate having to specify one more item for a loop only because of a single char (space, hifen, underscore, accentuated letter etc) instead of anything smarter, like a char telling the script that it could be any punctuation character, or that it couldn't be repeated, or couldn't be followed by "hi" etc. If I make that without the regex functions, I'll be asking the script to be much more slower.
But I barely know how to use reg. exps. efficiently, so I'm having a lot of problems.

One of them is how to make the expression as easily customizable as the old variables. I'd like some ideas about this. I tried to cover "cold" and "andré" (or "andre"), along with its exceptions going to be removed ("coldplay", "andrew" etc) and the expression turned to be a monster much uglier than the variables and it still has some flaws, due to my small knowledge.

Another problem is.. trying to cover "andré", I found a stupid difficulty. I was starting it with something like "/\bandr[eéÉ]+\b/gi" to cover punctuation and repetitions, but then I realized that \b (word boundary assertion) doesn't work with accentuated chars, since \b is a "position in the subject string where the current character and the previous character do not both match \w or \W" (www.pcre.org/man.txt):

alias blah { echo -ag $regex($1,/\bandr[eéÉ]+\b/gi) }
/blah hi andre! works
/blah hi andreee! works
/blah hi andreé! works
/blah hi andrée! works
/blah hi andré! doesn't work
/blah hi andrééé! doesn't work

That is, "é" is considered a word boundary. Actually, I need every accentuated letter and "çÇ" to be parts of a "word" too, or anything simple, maybe like using \w. I didn't find anything, and can't use something like "[à-äè-ïò-öú-ýÿÀ-ÄÈ-ÏÒ-ÖÙ-ÝŸçÇ\w]".

One more problem (the last, I'm a little confused trying to read all this) is that I don't really get how does the capturing of some subpatterns work. I can't define exactly what I want to be captured to \2 etc. because sometimes I don't know how to use "(?:blah)" (to know it won't capture) when the subpattern has already a function, like lookahead/lookbehind assertions. I'm limited to know only what goes to \1.. can't know what goes to \2 \3 and so on.

That's it. Regular expressions let me crazy. I'd really like a good mIRC-related (full) tutorial.. for now I'm reading man.txt, but it lacks practical examples, since it's not really related

. Can anyone give some of these examples too?

A character class including A-Z, 1-9, and accented characters: [A-Z1-9\300-\377]. To invert that, you would use the ^ of course... I'm not sure about including beginnings and endings of lines, though.

I don't know what documentation you're using, but there is the php help: http://us4.php.net/manual/en/pcre.pattern.syntax.php

When you do manage to get a regular expression test working the way you want it, I would recommend generating a script file with an on text event for each pattern you are searching for. This should help the speed.

What I really need is something that works like \b, but supporting accentuated chars.
Your character class is almost like the one I've posted: "[à-äè-ïò-öú-ýÿÀ-ÄÈ-ÏÒ-ÖÙ-ÝŸçÇ\w]", all of these are in the \300-\377 range, but the other ones like 'æ' I won't use. So, I'd need to use that big one and of course it still wouldn't work exactly like \b.. *sigh*.

As I said, I read www.pcre.org/man.txt, which is 4.x-related. mIRC uses PCRE 4.3 and the PHP manual says it works with PCRE 5.005, but this link seems much better since it's not a damn .txt file so I'll try it anyway, thanks.

Yeah, I didn't think about using ON TEXT events before.. I suppose you're talking about the new feature, using regex matches for the events?.. unfortunately it's not possible, I need to strip control codes before checking.
Plus, this search seems to need regex, because many items could have many conditions to be made.. like "search for 'andré' but not 'andréia', nor 'andré <some surname>'", "search for 'cold', forget 'coldplay'/'coldfusion'/'cold war'/'coldwar'/<put here more strings and many variations of them>; 'cooold' is fine, same exceptions for it ('coooldplay' etc)".
That would need more scripting and actually my point is to avoid it, being able just to fill a list of strings to be searched for, being filtered by a list of exception strings, pre-developed conditions etc. and then let mIRC do the job. My IRC environment really asks for something like this..

I need to strip control codes before checking.

Use the S switch, see versions.txt section 123 for details.

Oops, I didn't know about that, thanks. But then I'd fall into this problem..
By the way,

on $*:TEXT:m/regular expression/switches:#:/echo message: $1-

What's this 'm' supposed to mean?

You can replace \b<something> with (?<=^|\s)<something> and <something>\b with <something>(?=$|\s). These are called assertions, the first is a positive lookbehind assertion, the second is a positive lookahead. Look them up in man.txt. To match, for example, "blah" or "bleh" only as separate words you'd use
(?<=^|\s)bl[ae]h(?=$|\s)

Regarding m/pattern/, m allows you to set the quote character. By default, the quotes (ie the pair of characters than encloses the actual pattern) are /..../. With m, you can use any other char. For example, here:
mi^rc$i
the actual pattern is ^rc$, the i's play the role of /.

I know lookahead and lookbehind assertions, but didn't think of using ^ and $ with them, somehow I thought these were separated special chars, thus this couldn't be possible. Thanks!

Do you know any good regex example sources (not really having to be mIRC-related, but at least understandable from such point of view)? I always find only the basic ones, like in ms.org and PHP stuff..

Thanks for the explanation on 'm' too.

I don't know of many regex tutorials but one of the first I've read and still like is this. I also found some useful info (although mixed with a lot of irrelevant stuff) in Perldoc.com's regex page.

Thanks for the links.

Can you (or anybody else) help me with more?

Let's say I have this:

//var %re | echo -a $regsub(eh.. some random text etc.. blahblah eh?,/(rand|text|eh)/gi,04\1,%re) -> %re
*** 4 -> eh.. some random text etc.. blahbleh?

Ok, it matches what I've specified.. then how can I match *everything but* what I specify, but keeping the whole text there (ie. not just dropping the subpattern)?
In this case, I'd want it to echo:

*** 4 -> eh.. some random text etc.. blahbleh?

I've read the whole PHP-based docs at the link you passed, but couldn't figure this out yet.

Now, if I wanted to assign a different colour for every found match of the above, I'd have to use $regsub() 3 times, is this right?

//var %re | echo -a *** $null( $regsub(eh.. some random text etc.. blahbleh?,/(rand)/gi,04\1,%re) $regsub(%re,/(text)/gi,03\1,%re) $regsub(%re,/(eh)/gi,06\1,%re) ) %re
*** eh.. some random text etc.. blahbleh?

Or is there a better way? Assigning different references to every item to use only one $regsub() wouldn't work for many reasons, as far as I know..

Thanks in advance.

Uh, can't edit it anymore..
The first command line I posted should have been

//var %re | echo -a $regsub(eh.. some random text etc.. blahbleh?,/(rand|text|eh)/gi,04\1,%re) -> %re

"blahbleh", not "blahblah eh".

eh?

Exclusive matching in regex is not only much more complicated but, in many cases, not well defined. If we're talking about single characters, things are pretty clear and a [^class] takes care of this: [^abc] matches any character except a, b or c. But for strings of more than 1 letters issues arise. Not only is it hard to exclusively match, you must also make sure that parts of the string that you want to exclude aren't matched. For example, you want to match anything but "blah". Let's say you have an input string "Hello blah". What should regex match here? Just "Hello " ? Or "lah", "la", "ah" , "l", "h" etc as well? All these strings, except the first "Hello ", are inside "blah" (so the pattern that matches them doesn't have to match "blah" too). So, what you need is not only to exclusively match parts of the string but also to 'consume' the parts of the string that do match. This would mean that at least some part of the pattern must match "blah" and eat it, so that the other parts of the pattern, which exclude "blah", work correctly. According to the above, your previous example can be written like this:

Code:

//var %re | echo -a $regsub(eh.. some random text etc.. blahbleh?,/(.*?)((?:rand|text|eh|$)+)/gi,04\1\2,%re) -&gt; %re

It's a bit of a trick but works. The only, minor, drawback is that you can end up with unecessary 04 strings in %re, but this can easily be taken care of with $remove(%re,04).

The only feature of PCRE that's indirectly related to this are assertions. You can't exactly exclude a string with assertions but you can have a peek at the string lying ahead or behind the current matching position. In some cases, this can be used for exclusive matching, like in this example, which catches every word in a sentence excluding the word "blah":

Code:

//echo -a $regex(hello blah world bleh foo,/(?&lt;=^|\s)(?!blah)(\w+)/gi) - $regml(0) - $regml(1) : $regml(2) : $regml(3) : $regml(4)

That's about all I can think of to address this problem, if anyone has any other idea(s) I'd be very interested to hear about it.

Thanks for the explanation.