Register Log In

Forums Feature Suggestions -U switch to prevent 0-255 UTF encoding text i/o

Print Thread

-U switch to prevent 0-255 UTF encoding text i/o #260662 01/06/17 10:50 PM
Joined: Jan 2004 Posts: 2,127 maroon OP Hoopy frood
OP maroon Hoopy frood Joined: Jan 2004 Posts: 2,127	It would be nice if exemption from utf-encoding for the 0-255 range could also be available in the other text i/o commands. Quote: 32.Added -a switch to all binary variable commands that makes them not apply UTF-8 encoding to characters in the range 0-255, as long as the line contains no characters > 255. The -u switch has already been taken for /hsave, and /filter has it as a subswitch for -t sort, but it's available for /write and /writeini. Perhaps a unity choice could be -U. Or since the binaries use -a, the texts could use -A. Code: //var %f chlo $+ $chr(233) $+ .txt \| write -c %f %f \| hfree test1 \| window -l @test2 \| clear @test2 \| noop $findfile($mircdir,chlo?.txt,0,1, set %x $nopath($1-) ) \| hadd -ms test1 1 %x \| hsave -sn test1 $replace(%x,.,1.) \| aline @test2 %x \| var %y $line(@test2,1) \| echo -a var x $len(%x) %x var y $len(%y) %y \| filter -wfc @test2 $replace(%x,.,2.) * \| write -c $replace(%x,.,3.) %x \| writeini $replace(%x,.,4.) section item %x \| noop $findfile($mircdir,chlo $+ $chr(233) $+ *.txt,0,1,echo -a $file($1-).size $nopath($1-) ) \| run cmd /k type %x This creates the 9-byte chloe.txt filename with the accented small-e, but writes a 10-byte filename to that same file, with the accented-e encoded into 2 bytes. In all 4 methods of writing to disk, it encodes $chr(233) as the 2 bytes $chr(195) + $chr(169), which is why the filesize of a 9-byte filename + 2 bytes $crlf is written as a 12-byte file instead of 11. Even though the file is a 10-character line plus 2 for the $crlf, $len($read(chlo $+ $chr(233) $+ .txt,nt,1)) returns 9 because it decodes the text before calculating the length. Currently, using $findfile to load a directory listing into a @window then using /filter to write it to disk - results in a file listing of encoded text instead of the actual filenames. Other than trying to load the text lines into binary variables and bwrite'ing the text lines individually, I'm not sure how to write the filenames to disk and not the utfencoding of the filenames. Code: //write -c chlo $+ $chr(233) $+ .txt X \| write -c test.txt X \| noop $findfile($mircdir,chlo $+ $chr(233) $+ .txt,0,1,set %f $nopath($1-)) \| echo -a %f \| bset -ta &b 1 %f \| bset -t &b2 1 %f \| echo -a $bvar(&b,0) $bvar(&b,1-) / $bvar(&b2,0) $bvar(&b2,1-) \| bwrite %f 0 $bvar(&b,0) &b \| bwrite test.txt 0 $bvar(&b2,0) &b2 \| echo -a $file(test.txt).size $crc(test.txt,2) $crc(&b2,1) $crc(%f,0) / %f $file(%f).size $crc(%f,2) $crc(&b,1) From this command, it seems $crc (and $sha512 and $sha1) are being provided utf-encoded text instead of the text containing $chr(233), so part of this 'wishlist' would include something like $crc(text,3) that would avoid hashing of utf-encoded 0-255 text. Results for mirc's built-in $sha512 and the equivalent from Saturn's old sha2.dll are the same, so it looks like text provided for $dll(sha2.dll,sha512,0 text) is also being encoded before mIRC give to the .dll, so a $dll().prop could allow dll's to be given 128-255 not encoded. I don't yet understand exactly how $encode works, but I know mime converts 3 bytes into 4 text, but this repeating pattern confirms that $encode is being given the 2-byte encoding of $chr(233) since there would be no such TcOp repeating pattern when fed 2 bytes. Code: //set %t M $+ $chr(233) \| echo -a $encode($str(%t,5),m)

Re: -U switch to prevent 0-255 UTF encoding text i/o maroon #260665 02/06/17 10:02 AM
Joined: Apr 2004 Posts: 871 The Netherlands Sat Hoopy frood
Sat Hoopy frood Joined: Apr 2004 Posts: 871 The Netherlands	I'm afraid that most of your post is based on a fundamental misconception. Quote: This creates the 9-byte chloe.txt filename with the accented small-e, but writes a 10-byte filename to that same file, with the accented-e encoded into 2 bytes. No, it creates a 9-character filename, where each character may have a value in the range 0-65535 (let's stick to Unicode plane 0 for simplicity here) rather than the 0-255 range of byte values. As such, in order to store such a string of characters (the filename) as a string of bytes (file data), a conversion must take place, and this (necessarily) requires up to multiple bytes per character. The conversion that mIRC performs is a conversion to the standardized UTF-8 encoding of the string, where each character value above 127 is indeed converted to a set of multiple bytes. As a result of this encoding, every filename can be stored in such a way that it can be converted back again without losing anything, and that is exactly what happens when you have mIRC read from your file later on. As such, your examples are misleading in that they use a character in the 128-255 value range, which erroneously suggests that the UTF-8 encoding is adding redundant bytes. Imagine a filename with a character in the 256-65535 range. How would you store such a filename as file data? You don't have to answer that, because mIRC and UTF-8 have already solved that for you. The only price to pay for that universal solution is that for characters in the 128-255 range, the UTF-8 encoding "looks" like it introduces unnecessary extra bytes, which it really doesn't. So when you say this.. Quote: I'm not sure how to write the filenames to disk and not the utfencoding of the filenames. ..you're confusing "filename" with "encoding of a filename". The UTF-8 encoding of the filename is in essence the filename. What you call "the filename" here, is a non-standard codepage-based encoding of it that only happens to work for your specific example and not for filenames in general. If you want to use such a custom, non-universal encoding, then yes, you'll have to work with binary variables, and /bset (in particular with -ta) already gives you the tools you need to do that. In your post you're not making any sort of case as to why mIRC should make it easier to make use of such encodings. Generally speaking they're a bad idea and at most they should be used for interoperability with other applications that do not support UTF-8 yet. The same applies to the rest of your suggestions: the CRC/SHA-1/etc identifiers all take binary variables, so if you have a reason to hash a specific set of bytes rather than a simple string (which is then implicitly encoded as UTF-8), you can and should use a binary variable as input. It would be nice if DLLs had a way to accept and manipulate binary variables, but that's really an altogether different issue.. Saturn, QuakeNet staff

Link Copied to Clipboard