mIRC Homepage

$maxlenl chars vs bytes

Posted By: maroon

$maxlenl chars vs bytes - 22/03/19 04:20 AM

I'm unsure if this is by design or is a bug, but there's at least 1 bug in here. Sometimes $maxlenl means 'characters' and other times 'bytes'.

Not an error: //echo -a $str($chr(10004),8280) $+ x is 24841 bytes

but other string functions are silently ignoring characters beyond the byte length. Both output the same hash:

//echo -a $sha256($str($chr(10004),2764))
//echo -a $sha256($str($chr(10004),2765))

i.e. 2764*3=8292

same happens with: $md5 $sha1/256/384/512 $crc $hmac $hotp $totp
but not with $hash

bset does not return an error, but limits bytes added to a &binvar at 8292:

//bset -t &v 1 $str($chr(10004),6000) | echo -a $bvar(&v,0) $sha256(&v,1)

result: 8292 (hash for 2764 UTF8 characters)

but /write can output a variable containing 8270 10004's.

$len doesn't report a number larger than 8292, and there's no line-too-long error:

//echo -a $len($utfencode($str($chr(10004),3000)))

It also appears that mIRC can receive a byte string from a DLL somewhat longer than double the 8292 bytes without crashing, but it doesn't appear that mIRC can send more than 8292 bytes worth of unicode characters to a dll. This is related to my question here
Posted By: Khaled

Re: $maxlenl chars vs bytes - 22/03/19 07:42 AM

The maximum internal length applies to both byte arrays and wide character arrays throughout mIRC. If you are dealing with strings, it means characters. If you are dealing with binary variables, it means bytes. That said, this may not apply everywhere, and if you are converting back and forth between strings and binary variables, longer lengths may be preserved, but that is not guaranteed.

Regarding $len(): it has no checks on string length - it is returning the length of the string that it is seeing. The issue is with $utfencode() which was originally designed to quietly truncate results at the maximum length. This actually has been discussed before - while it would be possible to change it to report errors, the odds are that this would break scripts. The same applies to many old identifiers that truncate quietly.

In your example, $utfencode() is creating a truncated string of 8292, which is beyond the maximum allowed string length of $maxlenl. The scripting language allows you some leeway but your string is at the very maximum of the leeway. At this point, use of the string may cause a string length error at some point in the scripting language, depending on how the string is used.
Posted By: Wims

Re: $maxlenl chars vs bytes - 24/08/20 10:47 PM

Hello, some feedbacks on this.

It would be helpful to add a parameter to $utfencode in order to make it not truncate, this is actually a problem in script because $utfencode is the function you need to use, to know the utf8 length, to see if it exceed a limit or not.
Something like $utfencode(input,[charset],[%var|&binvar]) returning the correct length of the utf8 string when %var|&binvar is used, and copying only the maximum length allowed into the output %variable (and all of it if a binvar of course)? This would allow us to not only get the real length, but we would now have the option as well to use a binvar to access all the bytes regardless of the length.

There is a note in versions.txt saying that binvar are no longer limited on the number of byte that can be stored, one of the advantages of binvar in mIRC is to overcome the limit on characters for a line/parameter/etc.

Take /bset in maroon's example, $str() is evaluated correctly, because 6000 characters is ok, his bset comes with -t, and it can be extremely easily argued that therefore we're dealing with string according to your definition, so the limit should be in characters.
Now you could be arguing that no, this is a binary variable command and the -t switch does not override it.
That would be fine, except it's against the principle of allowing them to hold any number of byte. If binvar can hold as much as we want, why is bset silently limiting the number of byte added? If it makes perfect sense for binvar to hold that much bytes, then as long as $str resolves and the total line length in character is not beyond $maxlenl+100 (the current real limit), I don't see why maroon's bset would fail. To me that's simply a bug of /bset. The expected result can be achieved via two /bset, I don't see why not with one.


$sha256 and the like, being identifier, I also don't understand why they would chop at the limit, the result returned won't exceed any limit.
It is extremely unpleasant to not get an error because, and especially with $maxlenl changing over time, you have no idea that only that much has been used, and it will just cut an utf8 char in 'half' (is the case in maroon's example when $maxlenl+100 = 10340).

$regsubex suffer from the same problem and it's not very nice either: //noop $regsubex(foo,$str($chr(10004),6000),,) gives line too long for $regsubex, despite $str() being fine and the result being $null

It has to be said, $sha* etc and $regsubex are not binary function as far as the scripter is concerned.

I know that the common ground is converting to a single byte array, I just don't think it's necessary to apply the limit there, the scripting engine itself should be enough to handle that.
From my experience with msl (but it's certainly true for custom alias) any identifier parameter is limited to a maximum length of $maxlenl+Leeway (Leeway being 100 atm).
Of course there are exceptions, like $len, but $sha* family should all be exceptions as well.
And, if $len has no check on string length, $len can actually never return more than $maxlenl+Leeway, if you pass more than that as plain text to $len, the scripting engine stops and return $len: invalid parameter, but this is certainly not a limit on $len itself, just the engine parsing plain text parameter, I assume. And non plain text, well, you're limited to $maxlenl+Leeway anyway or you'll get a line too long error. The same applies to $sha256.

That gives us a limit already on the number of bytes that can be written to the single byte array internally from such identifier call: the above x 4.

$maxlenl being 10240 for now, that's a limit of around 40kb from the msl engine itself, and that's a memory that is released immediately after the call (might be a bit different memory wise for $regsubex, but for most identifiers like that, I believe the memory is released immediately), I don't think mIRC is in any danger.

All in all, with Unicode and the future, 64 bits, I don't think the internal limit on byte for such function requiring conversion is making it much safer for mIRC, rather we get stuck on what we feel should be working.
Of course mIRC needs some kind of limit, and it's nice to have it extended etc, but it makes sense to have a limit on the number of characters only in our script, since binvar are not limited.
I believe this limit made from the parameter length in character is enough.



Based on the above, I would like to see the limit removed for all identifiers applicable like that, because we can always use a binvar directly with the identifiers themselves (again, when applicable, I didn't check but ideally, it would always be applicable, but does work with $sha* family and via $bfind().regex for regex identifiers.)
Posted By: maroon

Re: $maxlenl chars vs bytes - 25/08/20 02:43 PM

Another aspect of this issue is the max length of bytes to/from a DLL. In this post I was asking that the documentation be updated from saying the old limit of '900', instead of making it clear that this also involved a GPF crash bug. But instead of changing the 900 to indicate the updated limit, the reference to the 900 limit was deleted.

It appears the string sent to the DLL is chopped at maxlenl+100 the same way other identifiers are having their parms chopped at that same length, however if the string returned from the DLL back to mIRC is longer than RETURN-LIMIT bytes, it crashes the client.

I encountered the crash bug in a DLL where I was making an extension of the $rand(a,z) function where I could have a 3rd parm which indicates the number of random bytes within the range to output. i.e. $rand(a,z,10) would return 10 random letters. I was limiting the output to be $maxlenl characters, and everything was fine as long as 'z' was replaced by a codepoint <= 2047 where each codepoint is UTF8 encoded as 1 or 2 bytes. However, when 'z' and 'a' were both replaced by a codepoint above 2047, it was easy to make the DLL crash. It appears that RETURN-LIMIT is close to ( ($maxlenl + grace-length) x 2 + 100), and if the DLL sends more than RETURN-LIMIT bytes back to mIRC, that triggers the GPF.

When sending codepoints which UTF8-encode into 3 bytes, I was able to safely send the following strings from the DLL, but increasing the string by +1 character, the GPF crashed mIRC.

v7.63, $maxlenl 10240 + grace-length 100 = 10340
safe = 6927*3 = (10240+100)*2+101

v7.61, $maxlenl 8192 + grace-length 100 = 8292
safe = 5563*3 = (8192+100)*2+105

v7.51, pre-$maxlenl 4096 + grace-length 54 = 4150
safe = 2800*3 = (4096+54)*2+100

Is there a memory structure which a DLL can look at to see what's the safe byte length to output back to mIRC, or should the documentation just be updated to indicate the max safe bytes whether or not the buffer is bumped up to be 3*(maxlenl+grace)?
Posted By: Khaled

Re: $maxlenl chars vs bytes - 25/08/20 03:20 PM

Quote
Is there a memory structure which a DLL can look at to see what's the safe byte length to output back to mIRC, or should the documentation just be updated to indicate the max safe bytes whether or not the buffer is bumped up to be 3*(maxlenl+grace)?

It should be possible to extended the DLL LOADINFO structure itself to include this in future versions. I have added this to my to-do list.
Posted By: Sat

Re: $maxlenl chars vs bytes - 27/08/20 10:44 AM

Originally Posted by maroon
When sending codepoints which UTF8-encode into 3 bytes, [..]

Although this phrasing kind of suggests you did, smile I suspect you did not test this with a DLL that was in Unicode mode (i.e., setting LOADINFO.mUnicode to TRUE and using wchar_t pointers for 'data' and 'parms'), is that right? In Unicode mode, my simple test DLL can return the maximum of 10240 characters without problems. That also makes sense, as no transcoding is necessary in that case. As such, this problem should only occur when mIRC needs to transcode from UTF-8, in which case it does seem to make sense that you can pass about 20480 bytes into the same buffer.

Instead of documenting the exact rules with respect to buffer limits for non-Unicode DLLs, I would suggest that the DLL Support help file page recommend the use of Unicode DLLs whenever dealing with non-ASCII code points. My additional two suggestions for Unicode support on the DLL Support page would be:
- change "char" to "TCHAR" in the 'procname' declaration;
- document more clearly that mUnicode is set to FALSE by default, but may be set to TRUE, which affects the types of the two string parameters in subsequent calls into the DLL.
After all, for efficiency (and of course despite the defaults that are in place for backward compatibility only), I think the general recommendation should be that new DLLs be made with Unicode support.

Originally Posted by Khaled
It should be possible to extended the DLL LOADINFO structure itself to include [the buffer size] in future versions.

Exposing the $maxlenl value to DLLs sounds like a good addition. There are many DLLs that can return large amounts of data, and being able to obtain the limits at run time will allow DLLs to be forward-compatible in terms of getting the most out of the buffer. As above, I do think this should focus on the Unicode DLL variant though.
Posted By: maroon

Re: $maxlenl chars vs bytes - 30/08/20 03:46 AM

After discussing this with Saturn, it turns out the issue is related to my DLL being in the default mode, and not having the UNICODE flag set. When I had originally tried to set the unicode flag, text was output as a stream of Chinese symbols, because I didn't realize there was a -DUNICODE compiler flag I was supposed to use too.

Apparently 'unicode' is effectively a UTF16 encoding, where each character always uses 2 bytes, as opposed to UTF8 where the most common values 0-127 use 1 byte in a &binvar while the rest use either 2 or 3 depending on which range of codepoints they're in.

Since in UTF16 mode the 10340 characters always uses 10340*2 bytes, that encoding works for the current buffer's byte length. However that same number of bytes is being used by mIRC to receive 10340 character from the DLL in the default unicode=NO setting, where it's possible for the same number of characters to be as long as 10340*3 bytes, causing the GPF.

For now I need to finish debugging my DLL to make sure all the functions work now, before thinking of transitioning everything to UNICODE mode for the benefit of the one function that needs 10340*3 bytes in UTF8 mode. In Unicode mode, the UTF16 is also the input format, which means the functions trying to hash strings would be returning the hash for the UTF16 bytes instead of the UTF8 bytes, so I'd need to translate incoming data so those functions would once again return the correct answer.
© 2020 mIRC Discussion Forums