Register Log In

Forums Bug Reports 6.17 UTF-8 garbled - both clients setup same way

Print Thread

6.17 UTF-8 garbled - both clients setup same way #143558 26/02/06 01:56 AM
Joined: Feb 2006 Posts: 9 R rwbh OP Nutrimatic drinks dispenser
OP rwbh Nutrimatic drinks dispenser R Joined: Feb 2006 Posts: 9	The Problem: Korean characters encoded in UTF-8 appear on the receiving client as garbled text mixed with capitalised raw commands. (But this may not be limited to Korean.) Both clients: are mIRC 6.17 have Multibyte display, Multibyte editbox, SJIS/JIS conversion, and process ANSI codes enabled use a font containing Hangul characters (I tested with Arial Unicode MS, Code2000, and Bitstream Cyberbit. -- Hangul is the Korean language.) have thier font's script set to "Hangul" (This is done to be absolutely sure both clients are expecting Korean. -- When both are using UTF-8, the font's script should not matter, but oddly enough, it does matter in this experiment. That's another bug.) have UTF-8 set to "Display and Encode" I tested this first with clones, then with copies of mIRC in separate directories, and lastly with several other users in various channels. All produced the same results. I also tested this on networks that used CHARSET=ascii (like EFnet, DALnet) and CHARSET=utf-8 (like UniLang). The network's character set should not (and did not) make a difference in the results. On one client, type this Korean (Hangul) character: 걹 The HTML code for it is &[/b]#[b]44153; In Windows Character Map, it is labeled as: U+AC79 (0x819D): Hangul Syllable Kiyeok Eo Rieulkiyeok *Note: To view it correctly in IE, go to menubar>View>Encoding and select "Unicode (UTF-8)". If that doesn't work, uncheck "Auto-Select", choose "Unicode (UTF-8)" again, or refresh the page. It should appear above as a single Korean character. If it appears correctly, you should be able to copy it from this page and try it yourself. Results: On the sending client, it will appear correctly and immediately. On the receiving client, it will appear as garbled text with a capitalised raw command, even though both clients are set to Hangul and set to "display and encode" UTF-8. It will take a while to appear on the receiving client. If you wait, it will appear with a raw PONG command or a list of nicks. If you want to make it appear on the receiving client immediately, separately send any ASCII string in a second message. Then, it will appear with a raw PRIVMSG command followed by that ASCII string. These results are not supposed to happen. The receiving client should immediately see the single Korean character--not garbled text after a long delay. The same results can be achieved using other Korean (Hangul) characters such as: 굥 HTML code: &[/b]#[b]44389; Windows Character Map: U+AD65 (0x828B): Hangul Syllable Kiyeok Yo leung And, again, this may not be limited to Korean.

Re: 6.17 UTF-8 garbled - both clients setup same way #143559 27/02/06 11:39 AM
Joined: Dec 2002 Posts: 5,424 London, UK Khaled Hoopy frood
Khaled Hoopy frood Joined: Dec 2002 Posts: 5,424 London, UK	If you disable SJIS/JIS conversion and ANSI processing, does the problem go away?

Re: 6.17 UTF-8 garbled - both clients setup same way #143560 28/02/06 10:23 AM
Joined: Feb 2006 Posts: 9 R rwbh OP Nutrimatic drinks dispenser
OP rwbh Nutrimatic drinks dispenser R Joined: Feb 2006 Posts: 9	It works correctly if I disable SJIS/JIS conversion. I tried all combinations of those two options. The ANSI processing did not affect the results, so it does not seem to be involved. Still, shouldn't SJIS/JIS only affect Japanese? And even then, isn't UTF-8 encoding designed to eliminate the need for SJIS and JIS?

Re: 6.17 UTF-8 garbled - both clients setup same way #143561 28/02/06 11:27 AM
Joined: Dec 2002 Posts: 5,424 London, UK Khaled Hoopy frood
Khaled Hoopy frood Joined: Dec 2002 Posts: 5,424 London, UK	It looks like the issue may be related to the order in which mIRC is appyling the SJIS/JIS and UTF-8 encoding/decoding when both options are enabled. I should have this fixed in the next bugfix release. Thanks for checking it out.

Re: 6.17 UTF-8 garbled - both clients setup same way #143562 01/03/06 10:12 AM
Joined: Feb 2006 Posts: 9 R rwbh OP Nutrimatic drinks dispenser
OP rwbh Nutrimatic drinks dispenser R Joined: Feb 2006 Posts: 9	To clarify, SJIS/JIS had to be disabled on the sending client. Neither of those options on the receiving client had any effect on the results. So, it seems the problem is with encoding the outgoing text. P.S. Disabling SJIS/JIS on the sending client also fixed the font script bug: Quote: ... * have thier font's script set to "Hangul" (This is done to be absolutely sure both clients are expecting Korean. -- When both are using UTF-8, the font's script should not matter, but oddly enough, it does matter in this experiment. That's another bug.) ...

6.17 UTF-8 encoding and SJIS/JIS encoding conflict #143563 02/03/06 12:34 AM
Joined: Feb 2006 Posts: 9 R rwbh OP Nutrimatic drinks dispenser
OP rwbh Nutrimatic drinks dispenser R Joined: Feb 2006 Posts: 9	The single SJIS/JIS checkbox actually does two things. It encodes outgoing SJIS into JIS, and it decodes incoming JIS into SJIS (according to the "mirc.hlp" file). Think of the encoding and decoding as separate functions. So far, we know: The receiving client can have SJIS/JIS decoding and UTF-8 decoding both enabled. The decoding options do not conflict. The sending client should encode in either JIS, SJIS, or UTF-8. Only one encoding format can be used at a time. I propose a change in the options: I propose that the "SJIS/JIS conversion" checkbox and the UTF-8 encoding drop-down be removed and replaced with: 1. an Encoding drop-down box where you must select either JIS or UTF-8. 2. a checkbox to Decode incoming JIS into SJIS (This just reorganises the existing options.) or 1. an Encoding drop-down box where you must select either JIS, SJIS, or UTF-8. 2. checkboxes for each decoding format: one to Decode JIS, one to Decode SJIS, and one to Decode UTF-8 (...because I'm not sure if SJIS is used by default when you disable "SJIS/JIS conversion". If it is used by default, then it would be nice to have an option that declares or at least lets you know the default format.) See also: Computers and Japanese: A very short description of how Japanese text is stored in computers (use ctrl+f if you don't see that section)

Re: 6.17 UTF-8 encoding and SJIS/JIS encoding conflict #143564 02/03/06 06:59 AM
Joined: Feb 2006 Posts: 9 R rwbh OP Nutrimatic drinks dispenser
OP rwbh Nutrimatic drinks dispenser R Joined: Feb 2006 Posts: 9	Or, if all of the decoding functions can be active without conflicts, you may wish to hide their checkboxes since they would all cooperate anyway. (but testing must happen before that) I see one possible flaw in the ideas I've presented so far. We know the problem is a conflict with "SJIS/JIS conversion" and UTF-8 encoding on the sending client, but I've assumed the conflict is between the encoding functions only. Since "SJIS/JIS conversion" appears as only one checkbox, I cannot rule out the possibility that the problem may instead be between the sending client's SJIS/JIS decoding and UTF-8 encoding. This seems very unlikely because SJIS/JIS decoding should, by nature, have absolutely no effect on outgoing text, but it is possible. Side note: I've been typing in Japanese on 6.16 for many months. Japanese appears correctly as long as both clients have "SJIS/JIS conversion" set the same way. In other words, I haven't seen any bugs with SJIS/JIS in 6.16. When I discovered this bug in 6.17, it just happened that Korean was being used. The same bug exists with certain Japanese characters, and the same adjustment (disabling SJIS/JIS on the sending client) fixes it.

Re: 6.17 UTF-8 encoding and SJIS/JIS encoding conflict #143565 02/03/06 06:23 PM
Joined: Mar 2006 Posts: 4 R roxfan Self-satisified door
roxfan Self-satisified door R Joined: Mar 2006 Posts: 4	I've noticed the problem before in 6.16 with Russian text. With release of 6.17 it started happening with UTF-8 too. Apparently mIRC applies SJIS->JIS conversion to any valid outgoing SJIS sequence, be it in a Japanese channel or not. The proper fix would be to allow per-channel setting, or at least apply the conversion only for windows with Japanese font.

Re: 6.17 UTF-8 encoding and SJIS/JIS encoding conflict #143566 03/03/06 03:12 AM
Joined: Feb 2006 Posts: 9 R rwbh OP Nutrimatic drinks dispenser
OP rwbh Nutrimatic drinks dispenser R Joined: Feb 2006 Posts: 9	Quote: I've noticed the problem before in 6.16 with Russian text.... Apparently mIRC applies SJIS->JIS conversion to any valid outgoing SJIS sequence, be it in a Japanese channel or not. The proper fix would be to allow per-channel setting, or at least apply the conversion only for windows with Japanese font. After reading your post, I used two 6.16 clients to test every Russian (Cyrillic) character with Arial Unicode MS, SJIS/JIS enabled, Multibyte enabled, and ANSI codes enabled. Each character was sent individually. I then tested with SJIS/JIS enabled on only the sending client, then only on the receiving client, and lastly disabled on both. There were no errors. Your bug does not seem to be related to SJIS/JIS. Please provide more information. If we cannot recreate the bug, then we cannot prove it exists. If we cannot isolate the problem, then a "proper fix" cannot be suggested. Once you can recreate it, please make a thread for it. Just a reminder: Channels can contain multiple languages at once. They, and other non-custom windows, are not limited to receiving one character set at a time. Each remote host/user chooses which encoding and language to speak in. You are responsible for decoding it.

Re: 6.17 UTF-8 encoding and SJIS/JIS encoding conflict #143567 03/03/06 09:11 AM
Joined: Mar 2006 Posts: 4 R roxfan Self-satisified door
roxfan Self-satisified door R Joined: Mar 2006 Posts: 4	1. in the first client, enable JIS/SJIS conversion and set font to cyrillic script. 2. in the second client disable JIS/SJIS conversion. 3. in the first client, type "да" (U+0434,U+0430, or 0xE4 0xE0 in cp-1251). 4. As 0xE4 0xE0 is a valid SJIS code, mIRC applies JIS/SJIS conversion and sends <ESC>$Bhb(B<ESC> which is seen by the second client. Here's a little screenshot: http://img103.imageshack.us/img103/3487/sjisbug3hg.png Both clients were using Microsoft Sans Serif, Cyrillc script.

Re: 6.17 UTF-8 encoding and SJIS/JIS encoding conflict #143568 08/03/06 09:26 PM
Joined: Mar 2006 Posts: 4 R roxfan Self-satisified door
roxfan Self-satisified door R Joined: Mar 2006 Posts: 4	bump

Link Copied to Clipboard