Register Log In

Forums Bug Reports Binary Vars BUG

Print Thread

Page 1 of 2

1

2

Binary Vars BUG #230803 20/03/11 09:41 PM
Joined: May 2006 Posts: 27 Brasil A Asterix_UO OP Ameglian cow
OP Asterix_UO Ameglian cow A Joined: May 2006 Posts: 27 Brasil	well, after khaled makes a utf-8 support this bug starts to show //bset -t &x 1 á \| echo -a $bvar(&x,0) must return 1, but it's returning 2 thanks! Suchorski @ FreeNode

Re: Binary Vars BUG Asterix_UO #230806 20/03/11 10:24 PM
Joined: Dec 2002 Posts: 2,962 Norwich, UK S starbucks_mafia Hoopy frood
starbucks_mafia Hoopy frood S Joined: Dec 2002 Posts: 2,962 Norwich, UK	It returns 2 because it takes 2 bytes to represent that character in UTF-8. Binary variables operate on bytes, not characters. You can use $len($bvar(&x,1-).text) if you want to know the number of characters it contains (and the binvar contains valid UTF-8 of course). Last edited by starbucks_mafia; 20/03/11 10:27 PM. Spelling mistakes, grammatical errors, and stupid comments are intentional.

Re: Binary Vars BUG Asterix_UO #230807 20/03/11 10:24 PM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	It's not a bug. The character "á" is represented with two bytes in UTF-8.

Re: Binary Vars BUG drum #230892 24/03/11 05:06 AM
Joined: Apr 2003 Posts: 342 M MeStinkBAD Fjord artisan
MeStinkBAD Fjord artisan M Joined: Apr 2003 Posts: 342	K this is just $&*&! up... $asc(á) = 225 $chr(225) = á It's one byte... Beware of MeStinkBAD! He knows more than he actually does!

Re: Binary Vars BUG MeStinkBAD #230893 24/03/11 05:57 AM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	It's not one byte. Please learn about Unicode. $chr() now returns Unicode codepoints. á is represented by codepoint U+00E1 or 225. As shown by the page, the UTF-8 encoding of this codepoint is 2 bytes long: UTF-8 (hex) 0xC3 0xA1 Exactly what you get when you use it: Code: //bset -t &x 1 á \| echo -a $base($bvar(&x,1),10,16) $base($bvar(&x,2),10,16) - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: Binary Vars BUG MeStinkBAD #230895 24/03/11 06:15 AM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	There's a good chart here that explains how each "code point" (in this case, 225 or U+00E1) is translated into raw byte(s). The result is all characters between $chr(128) and $chr(2047) take up two bytes each.

Re: Binary Vars BUG drum #230906 24/03/11 09:12 PM
Joined: Apr 2003 Posts: 342 M MeStinkBAD Fjord artisan
MeStinkBAD Fjord artisan M Joined: Apr 2003 Posts: 342	Code: //bset -t &x 1 a \| echo -a a > $bvar(&x,0) \| bset -t &x 1 á \| echo -a á > $bvar(&x,0) They should both be two bytes... or one byte... I haven't check lately but mIRC does not properly read UTF-16 encoded files since the last time I checked... big Endian or Little Endian... has this changed? Beware of MeStinkBAD! He knows more than he actually does!

Re: Binary Vars BUG MeStinkBAD #230907 24/03/11 09:28 PM
Joined: Dec 2002 Posts: 2,962 Norwich, UK S starbucks_mafia Hoopy frood
starbucks_mafia Hoopy frood S Joined: Dec 2002 Posts: 2,962 Norwich, UK	Who said anything about UTF-16? mIRC uses UTF-8. Using UTF-16 would be pointless because it's a fixed-width encoding that uses null-bytes as filler, thereby making it unusable over IRC. Spelling mistakes, grammatical errors, and stupid comments are intentional.

Re: Binary Vars BUG MeStinkBAD #230910 24/03/11 09:54 PM
Joined: Jul 2006 Posts: 4,153 France W Wims Hoopy frood
Wims Hoopy frood W Joined: Jul 2006 Posts: 4,153 France	No they shouldn't Quote: It returns 2 because it takes 2 bytes to represent that character in UTF-8. Binary variables operate on bytes, not characters. #mircscripting @ irc.swiftirc.net == the best mIRC help channel

Re: Binary Vars BUG MeStinkBAD #230913 24/03/11 10:17 PM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	Originally Posted By: MeStinkBAD Code: //bset -t &x 1 a \| echo -a a > $bvar(&x,0) \| bset -t &x 1 á \| echo -a á > $bvar(&x,0) They should both be two bytes... or one byte... I haven't check lately but mIRC does not properly read UTF-16 encoded files since the last time I checked... big Endian or Little Endian... has this changed? mIRC is using UTF-8 to encode the characters you list with /bset which is why you aren't getting two bytes for both. If it used UTF-16 then you'd be right. Using $read on three files with identical text -- stored in UTF-8, UTF-16 LE, and UTF-16 BE -- they all read just fine. As far as binary variables and /bread goes though, it works fine but it's just reading raw bytes -- if the data happens to be Unicode text, you would have to determine which encoding the text is on your own. Edit: By the way, $bvar().text only works correctly with UTF-8 encoded text. Last edited by drum; 24/03/11 10:26 PM.

Re: Binary Vars BUG Wims #230914 24/03/11 10:18 PM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	Originally Posted By: Wims No they shouldn't Uh, binary variables can store any binary data, not just text. It doesn't even make sense to treat all binary data as characters.

Re: Binary Vars BUG drum #230916 24/03/11 10:31 PM
Joined: Jul 2006 Posts: 4,153 France W Wims Hoopy frood
Wims Hoopy frood W Joined: Jul 2006 Posts: 4,153 France	Why exactly are you replying to me ? My post agrees with what you and others said previously and emphasizes on the fact that binary variables works on bytes, not caracters, which is why : Quote: They should both be two bytes... or one byte is wrong #mircscripting @ irc.swiftirc.net == the best mIRC help channel

Re: Binary Vars BUG Wims #230917 24/03/11 10:43 PM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	Originally Posted By: Wims Why exactly are you replying to me ? Haha, sorry, I misunderstood your previous post. It looked like you were stating "No they shouldn't" in response to "Binary variables operate on bytes, not characters." (But I realize now that's not what you meant.)

Re: Binary Vars BUG drum #230922 25/03/11 05:21 AM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	I was confused by this too. - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: Binary Vars BUG drum #230924 25/03/11 10:09 AM
Joined: Apr 2003 Posts: 342 M MeStinkBAD Fjord artisan
MeStinkBAD Fjord artisan M Joined: Apr 2003 Posts: 342	I'm unable to read "layout.ini" in the \Windows\Prefetch\ directory (it's currently the only UTF-16 text file I've got access to atm). Using the standard $read function. I'll do some more testing later... but it doesn't properly treat UTF-16 text files as it should... *** Does not read UTF-16 files with no BOM... Last edited by MeStinkBAD; 25/03/11 10:21 AM. Beware of MeStinkBAD! He knows more than he actually does!

Re: Binary Vars BUG MeStinkBAD #230925 25/03/11 11:12 AM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	And how would mIRC even know the file is UTF-16 encoded if there is no BOM?

Re: Binary Vars BUG drum #230933 25/03/11 08:24 PM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	There are plenty of UTF-16 files with no BOM, and the spec explains how to deal with such files: From wikipedia: Originally Posted By: wikipedia If the BOM is missing, the standard says that big-endian encoding should be assumed. Therefore mIRC should be able to handle these files; how to detect them is an interesting question, but not really our concern. In other words, whether mIRC performs auto-detection or we have to tell it manually isn't really the issue. There needs to be a way to access these files. I would consider it a bug if mIRC is unable to read non-BOM UTF-16 files given that it is supposedly "unicode compatible"-- at worst, it should be a feature suggestion. Perhaps a new switch on $read and its counterparts to specify file encodings (which mIRC badly needs!) would help. That said, this topic is totally tangential to binary vars, and we should open a separate discussion for the issue. - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: Binary Vars BUG argv0 #230939 25/03/11 11:10 PM
Joined: Dec 2002 Posts: 344 D drum Pan-dimensional mouse
drum Pan-dimensional mouse D Joined: Dec 2002 Posts: 344	Originally Posted By: argv0 There are plenty of UTF-16 files with no BOM, and the spec explains how to deal with such files: From wikipedia: Originally Posted By: wikipedia If the BOM is missing, the standard says that big-endian encoding should be assumed. It's not that clear cut unfortunately, as explained by the next sentence from that Wikipedia article: Originally Posted By: Wikipedia If the BOM is missing, the standard says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications also assume little-endian encoding by default.) For this reason, it's common for UTF-16 files without a BOM to be LE, not BE, if they originated from a Windows program. Originally Posted By: argv0 Therefore mIRC should be able to handle these files; how to detect them is an interesting question, but not really our concern. In other words, whether mIRC performs auto-detection or we have to tell it manually isn't really the issue. There needs to be a way to access these files. I would consider it a bug if mIRC is unable to read non-BOM UTF-16 files given that it is supposedly "unicode compatible"-- at worst, it should be a feature suggestion. Perhaps a new switch on $read and its counterparts to specify file encodings (which mIRC badly needs!) would help. I generally don't like the concept of auto-detection. I'm sure you've seen the Notepad trick where you save some text, close and reopen the file, and the text becomes gibberish. UTF-8 is much more commonly used in text files anyway, so it should only be an issue in a small number of cases. However, I think it would be a good idea to add a switch to force mIRC to use a particular encoding when reading a text file.

Re: Binary Vars BUG drum #230943 26/03/11 05:04 AM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	Originally Posted By: drum It's not that clear cut unfortunately No, the specification is very clear cut. The behaviour of certain programs is incorrect, but that shouldn't really matter. The point is that mIRC should [at least] support the specification. The other point is, again, this discussion should be in another thread. BOMs and UTF-16 support have nothing to do with this issue. - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: Binary Vars BUG argv0 #230972 26/03/11 06:45 PM
Joined: Dec 2002 Posts: 5,428 London, UK Khaled Hoopy frood
Khaled Hoopy frood Joined: Dec 2002 Posts: 5,428 London, UK	It has been a while since I researched this but if I recall correctly, the only way to determine the encoding format of a file without a BOM is to scan the contents of the file and to analyze it. In the case of a text file that uses UTF-16, BE or LE, and is only storing characters from the Basic Latin and Latin-1 Supplement of the Unicode table, you can check for alternating zero bytes. You will need to load the file, check for the BOM and if it does not exist, continue reading the file as single-byte characters. If you come across a zero byte, the file is probably UTF-16 and you will need to start reading the file again from the beginning as UTF-16 (whether it is BE or LE depends on where the zero byte came in the sequence). If there is an error in the file and the zero byte is not meant to be there (or if it was ANSI text saved with zero byte separators on purpose for some reason), loading it as UTF-16 results in garbage. If there are no zero bytes, you will not be able to determine whether it is ANSI or UTF-16. For example, if you come across a UTF-16 file with no BOM that contains thousands of characters in the range 0x0101 to 0x017F, you will not be able to tell whether it is ANSI or Unicode. However if it contains any character that has a zero byte (in the range 0x0000 to 0x00FF, or 0x0100 etc.) you can assume it is UTF-16. While the above method will work for some text files (and mostly only for Latin text), I felt it was a little too unreliable and limited in scope, which is why I decided not to add support for it.

Re: Binary Vars BUG Khaled #230978 26/03/11 11:59 PM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	Originally Posted By: Khaled It has been a while since I researched this but if I recall correctly, the only way to determine the encoding format of a file without a BOM is to scan the contents of the file and to analyze it. Well, there is no reliable way to detect the proper encoding of a UTF-16 encoded file without a BOM-- and by extension, there is no way to detect that a file is UTF-16 without a BOM (similarly, even the BOM itself isn't always a valid way to detect an encoding format). The auto-detection method you propose would be extremely slow since it would potentially scan the entire file, and would need to do this for every file-- this would mean that for most files (non-utf16 ones) you would be always scanning ~1000 chars prior to reading-- every time! slow! That's why I proposed an extra switch in all $read/$fread commands to force a specific encoding. It's basically impossible (impractical) to auto-detect, so the scripter should have to tell mIRC in these cases. Telling the runtime what encoding you want to read a file as is fairly common in every language with robust encoding support. It's fine to assume UTF-8 as default, and fair to allow basic auto-detection, but for certain encodings, we need a way to enforce this manually. - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: Binary Vars BUG argv0 #230979 27/03/11 12:05 AM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	Note that having an encoding switch leaves room for adding back other encoding support, at least for reading such files from the FS, in the future. Not saying we should open up a debate about supporting codepages again, but it allows for this possibility if, in the future, it becomes easier support other encodings. Basically, the encoding parameter could be undefined for any value other than utf8, ascii, utf16(bom?), utf16le and utf16be for now. In the future, you could add things like sjis, etc. One way to add this would be to attach the parameter to the end of $read (and $fread too): //echo -a $read(file.txt, wn, foo, 1, utf16le) Since the current last param is [N] (numeric), you can autodetect when it is present, so we could also have: //echo -a $read(file.txt, wn, foo, utf16le) And of course everything should get transcoded to UTF-8 internally (or whatever mIRC's internal rep is). - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: Binary Vars BUG argv0 #231029 29/03/11 02:26 PM
Joined: Apr 2003 Posts: 342 M MeStinkBAD Fjord artisan
MeStinkBAD Fjord artisan M Joined: Apr 2003 Posts: 342	/fopen should contain the encoding switch. $read can detect this automatically or at least make a good guess. $read always reads the entire file up to a specific line. $read will read the entire file if no line is specified. There is no way to select the OUTPUT encoding with /write or /fwrite. Though with /fopen /fwrite should use the specified encoding set with /fopen. /fopen -e <encoding> <handle> <file path> Beware of MeStinkBAD! He knows more than he actually does!

Re: Binary Vars BUG MeStinkBAD #231043 29/03/11 09:29 PM
Joined: Dec 2002 Posts: 2,962 Norwich, UK S starbucks_mafia Hoopy frood
starbucks_mafia Hoopy frood S Joined: Dec 2002 Posts: 2,962 Norwich, UK	This really needs to be posted as a feature suggestion since this is nothing at all to do with the original bug report. Spelling mistakes, grammatical errors, and stupid comments are intentional.

Page 1 of 2

1

2

Link Copied to Clipboard