$regml().pos $regmlex().pos - mIRC Discussion Forums

mIRC Homepage

This might relate to one of these threads
https://forums.mirc.com/ubbthreads.php/topics/265044/pos-property-regml-regmlex-unicode
https://forums.mirc.com/ubbthreads.php/topics/220537/7-xx-bug-in-regml-pos

When not using //u, $regml().pos and $regmlex().pos report the position increased by +1 or +2 if the match is the 2nd or 3rd character of a utf8 encoding char. This can make a later byte in the string report the same/earlier position than an earlier byte. The older thread seems to indicate that .pos originally returned the byte position within the utf8 string, but was adjusted to return the $pos() position. If it's undesirable for all encoding bytes of the same character to return the same position, there might be usefulness in a .utfpos property.

//var -s %a $chr(233) $+ foo $+ $chr(10004) $+ bar | echo -ag $regex(foo1,%a,/(f)/) pos: $regml(foo1,1).pos vs $regex(foo2,%a,/(\xa9)/) pos: $regml(foo2,1).pos and $regex(foo3,%a,/(\x94)/) pos: $regml(foo3,1).pos vs $regex(foo4,%a,/(b)/) pos: $regml(foo4,1).pos | var -s %b $regsubex(foo,%a,/(.)/g,$base($asc(\t),10,16,2) $+ $chr(32)) | var %i 1 , %string | while (%i <= $regml(foo,0)) { var %string %string match# %i $gettok(%b,%i,32) $regml(foo,%i) is at pos: $regml(foo,%i).pos | inc %i } | echo -a %string

1 pos: 2 vs 1 pos: 2 and 1 pos: 7 vs 1 pos: 6

match# 1 C3 is at pos: 1 match# 2 A9 is at pos: 2 match# 3 66 is at pos: 2 match# 4 6F is at pos: 3 match# 5 6F is at pos: 4 match# 6 E2 is at pos: 5 match# 7 9C is at pos: 6 match# 8 94 is at pos: 7 match# 9 62 is at pos: 6 match# 10 61 is at pos: 7 match# 11 72 is at pos: 8

Thanks for your bug report. If you are passing a unicode string, you should be using //u. There is no valid result for .pos without //u in this case as the string is processed as UTF-8 internally. The returned .pos must reference the original string, not a non-existent string. That said, I would prefer it if UTF-8 characters forming a unicode character returned a .pos to the start of that character. That would make more sense. This change will be in the next beta.

Then again, I'd say that the regex routine should apply //u automatically if it sees a Unicode string. However, this was probably not added as it might have affected older scripts, although I am not seeing what the side-effects might be at this point.

This behavior will be difficult to script around. The $2 parm requires a byte offset, but the .pos property returns a character offset. So, a script will be able to do $bvar(&binvar,$regml(foo,1).pos,length) to handle the match string only if the binvar contains only codepoints 1-127. When dealing with binvars, it would be helpful to have a $regml().bytepos property so the script can know the match's actual offset in the &binvar.

And, with returning offsets smaller than the $2 parm will also be confusing for addressing the binvar, so scripts will be forced to copy the section of binvar to a new binvar, into which they would search at offset 1.

Okay, I have added a $regml().bytepos property to the next beta. This will be filled for all types of regex matches, since the regex search is byte, not UTF-16, based anyway.