mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Jan 2004
Posts: 1,358
L
Hoopy frood
OP Offline
Hoopy frood
L
Joined: Jan 2004
Posts: 1,358
I've found two problems with regular expressions in the beta; they may be facets of a single bug.

1. Capture groups are generally broken

Code:
//echo -ag $regex(abc,/(a)|(b)|(c)/g) :: $regml(1) $regml(2) $regml(3)

7.43      :  3 :: a b c
7.43.1418 :  3 :: a b


2. Fix 7 regarding $regml and ()* breaks backwards compatibility. This fix fills optional groups with $null instead of skipping the capture. Expressions which had been crafted to accomplish this behavior already are now broken in the beta.

For example, in 7.43 (released,non-beta,curious versioning) you can see I have crafted this regex such that $regml(1) will contain either the scheme or $null, and $regml(2) will always contain the domain.

Run against two strings "http://domain.com" and "domain.com" you can see the results below, 7.43 the domain is in $regml(2) in both cases, in 7.43.1418 the domain was pushed to $regml(3)

Code:
(?:(https?)://|())(domain.com)

Code:
//noop $regex(http://domain.com,(?:(https?)://|())(domain.com)) | echo -ag n = $regml(0) :: $!regml(1) = $regml(1) :: $!regml(2) = $regml(2)

7.43      : n = 2 :: $regml(1) = http :: $regml(2) = domain.com
7.43.1418 : n = 3 :: $regml(1) = http :: $regml(2) =

//noop $regex(domain.com,(?:(https?)://|())(domain.com)) | echo -ag n = $regml(0) :: $!regml(1) = $regml(1) :: $!regml(2) = $regml(2)

7.43      : n = 2 :: $regml(1) = :: $regml(2) = domain.com
7.43.1418 : n = 3 :: $regml(1) = :: $regml(2) =


Now, I do prefer this new behavior where the original workaround is not necessary and optional groups are filled with $null:
Code:
(?:(https?)://)?(domain.com)

Code:
//noop $regex(domain.com,(?:(https?)://)?(domain.com)) | echo -ag n = $regml(0) :: $!regml(1) = $regml(1) :: $!regml(2) = $regml(2)
7.43.1418 : n = 2 :: $regml(1) = :: $regml(2) = domain.com

//noop $regex(http://domain.com,(?:(https?)://)?(domain.com)) | echo -ag n = $regml(0) :: $!regml(1) = $regml(1) :: $!regml(2) = $regml(2)
7.43.1418 : n = 2 :: $regml(1) = http :: $regml(2) = domain.com


It seems to me this backwards capability and new behavior can be preserved, because (?:(https?)://|())(domain.com) is interpreted to have 3 capture groups instead of 2, if http:// is present I would not expect the empty () after the OR to be captured. If my thinking is wrong here and the two ideas are incompatible, maybe the new behavior can remain by use of a switch or property.

Joined: Dec 2002
Posts: 5,420
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,420
Quote:
Capture groups are generally broken

Okay, I tried your example and the PCRE API pcre16_get_substring() returned the following:

Code:
text: abc
expr: /(a)|(b)|(c)/g

PCRE returns:
Match ( 1/ 1) : ( 0, 1): 'a'
Match ( 1/ 2) : (-1,-1): ''
Match ( 2/ 2) : ( 1, 2): 'b'
Match ( 1/ 3) : (-1,-1): ''
Match ( 2/ 3) : (-1,-1): ''
Match ( 3/ 3) : ( 2, 3): 'c'

$regex() returns: 3

$regml() returns: 6 and items:
1 : 1 : a
2 : 0 :
3 : 2 : b
4 : 0 :
5 : 0 :
6 : 3 : c

Note: PCRE seems to be skipping subsequent empty matches. The above result is actually "a-- -b- --c". I have not found a way of making it include the full nine matches, as shown in the regex website tests below.

For the same regex pattern, Perl returns:

Code:
my @result = $text =~ /(a)|(b)|(c)/g;
for (my $i=1; $i <= @result; $i++) {
   print "$i:" . ($result[$i-1] || "") . "\n";
}

1:a
2:
3:
4:
5:b
6:
7:
8:
9:c

The same test at rubular.com returns a similar result, as does a test at regexplanet.com, although regex101.com strips out empty matches, like pre-beta versions of mIRC.

The reason I made this change is that pre-beta mIRCs were not capturing the initial empty group in the following situation:

Code:
test {
  var %text = :nick!ident@host.com PRIVMSG #testing :one two three
  var %re = ^(@\S+ )*\x3A(([^\s!@]+)![^\s!@]+@[^\s]+) PRIVMSG (#\S+) (\x3A.+)$
  noop $regex(test, %text, %re)
  var %n = $regml(test, 0)
  echo n: %n
  var %m = 1
  while (%m <= %n) {
    echo 1 %m : $regml(test, %m).pos : $regml(test, %m)
    inc %m
  }
}

result in pre-beta:
n: 4
1 : 2 : nick!ident@host.com
2 : 2 : nick
3 : 30 : #testing
4 : 39 : :one two three

result in beta:
n: 5
1 : 0 :
2 : 2 : nick!ident@host.com
3 : 2 : nick
4 : 30 : #testing
5 : 39 : :one two three

The result in the beta, which returns an emtpy match against (@\S+ )*, is preferred. Testing the above with regex101.com, its match list starts at 2 and ends at 5, so it also includes the empty match.

I see what you mean about workarounds in pre-beta versions of mIRC being broken with the new behaviour. The only solution I can think of is to only enable the new behaviour with a new regex modifier/switch - this would allow it to be used in contexts such as $regex(), event definitions, settings in options dialogs, and so on. There is a fairly comprehensive list of regex modifiers at rexegg.com. We would have to decide on a letter that is not currently in use.

Joined: Dec 2002
Posts: 5,420
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,420
Right, thanks for bringing this up Loki12583. It looks like my implementation in the beta is not correct. I will be reverting beta.txt item 7 for the next version, so $regml() will go back to working how it did in previous versions.

However, I have added a $regml().group property that returns the () group number that matched the item.

I have also added $regmlex(name,N,M) where N is the () group number, and M is the Mth matching item for that group. If M is not specified, it defaults to 1. This allows you to directly reference a matching group item if you need to, which is the original issue I was trying to solve. This will be in the next version.


Link Copied to Clipboard