mIRC Home    About    Download    Register    News    Help

Print Thread
#235664 03/01/12 10:07 PM
Joined: Jan 2012
Posts: 19
K
krypto Offline OP
Pikka bird
OP Offline
Pikka bird
K
Joined: Jan 2012
Posts: 19
Hi,
I noticed that the UTF-8 encoding looks a little iffy on mIRC. Everyone using mIRC has the same issue as far as I know and can see, so it shouldn't be on my end. Also, some people are saying that they can't see what I'm typing, so I started looking into it.

mIRC encodes non-ascii characters in double encoded UTF-8, while it should use plain UTF-8. It has no issues decoding regular UTF-8 or the messed up double encoded UTF-8.

Example
mIRC encodes the character รค (U+00E4, LATIN SMALL LETTER A WITH DIAERESIS) as C3 83 C2 A4
In plain UTF-8, it should be C3 A4

krypto #235665 03/01/12 10:46 PM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
I can't reproduce this. Are you sure you're not using a script that happens to be $utfencode or $utfdecoding incoming / outgoing text?


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #235670 04/01/12 05:02 PM
Joined: Jan 2012
Posts: 19
K
krypto Offline OP
Pikka bird
OP Offline
Pikka bird
K
Joined: Jan 2012
Posts: 19
Ok, different bug. I wiresharked the output and it's plain UTF-8. Problem seems to be that mIRC does the double encoding on debug output for some strange reason.

krypto #235671 04/01/12 05:31 PM
Joined: Dec 2002
Posts: 344
D
Pan-dimensional mouse
Offline
Pan-dimensional mouse
D
Joined: Dec 2002
Posts: 344
Originally Posted By: krypto
Ok, different bug. I wiresharked the output and it's plain UTF-8. Problem seems to be that mIRC does the double encoding on debug output for some strange reason.


Yeah, I hadn't really thought about this being an issue before because I always output /debug to a window, but what's happening is /debug is designed to show you the text after it has been UTF-8 encoded. But since mIRC's file routines use UTF-8 now, it ends up encoding it a second time if it's writing to a file. This seems like a bad idea though. /debug should probably handle output to windows and to files differently in this case.

krypto #235672 04/01/12 09:41 PM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
This is intentional. /debug shows the raw bytes, but as drum pointed out, mIRC has no routines to display ANSI anymore, so anything >128 is encoded as UTF-8 (to be pedantic, it is all encoded as UTF-8, but specifically in order to properly display the non-ASCII >128 range).

Note that the debug @window isn't meant to be copy pasted around as verbatim data (it's merely debugging output for ...debugging), but if you really do need to do so, you can potentially $utfdecode the text you copy-- that or /debug -i.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
drum #235673 04/01/12 09:43 PM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
There's no other way for mIRC to display text in 7.x besides Unicode. If the byte value is >128, the only way mIRC can show this in a window is by encoding it as UTF-8, as it does with all data. So it doesn't seem like a bad idea to me.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #235675 05/01/12 12:05 AM
Joined: Dec 2002
Posts: 344
D
Pan-dimensional mouse
Offline
Pan-dimensional mouse
D
Joined: Dec 2002
Posts: 344
Originally Posted By: argv0
There's no other way for mIRC to display text in 7.x besides Unicode. If the byte value is >128, the only way mIRC can show this in a window is by encoding it as UTF-8, as it does with all data. So it doesn't seem like a bad idea to me.


I think you misunderstood me. I don't have a problem with how mIRC handles debug output to a window. I have a problem with how miRC handles debug output to a file. If you are outputting to a file, it makes no sense to double encode the text.

It seems far more useful to me if mIRC would output byte-for-byte exactly what is transmitted into the file. There shouldn't even be any UTF encoding or decoding involved if nothing is being displayed to a window.

Last edited by drum; 05/01/12 12:06 AM.
drum #235677 05/01/12 05:30 AM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Originally Posted By: drum
I have a problem with how miRC handles debug output to a file.


Ah, I didn't realize this affected files as well. That said, I'm not entirely sure I agree-- I think that one could go either way. On one hand, a user might expect a file dump to be verbatim raw data, on the other hand, as I mentioned regarding windows, the debugging output is mostly meant to visualize the data, not necessarily provide it for copy paste purposes-- that's the flipside of the argument though. Users might get confused either way. Especially if they /loadbuf'd a debug.log into a window and it was raw byte data, because that would not display accurately inside mIRC. A discrepancy between "/debug file.log | loadbuf @debug file.log" and "/debug @debug" might be confusing. Similarly, users might get confused by reading a file.txt and seeing different data from a @debug window, and, depending on their editor, it might be parsed as utf-8 anyway, which would defeat the purpose of visualizing the data.

I see the /debug feature as more of a pre-scrubbed hex editor style view of the data from the server. It would be sort of like seeing the following view inside of a hex editor program:

Code:
| A1 BC DD FF | . . . . |
...


It's a debug view, but it's not raw data. I also think it makes sense for mIRC to be consistent about the way it outputs debugging info rather than having special rules for different scenarios. Perhaps, however, a switch to display the byte data should be added so that users can decide on this one.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
drum #235680 05/01/12 11:20 AM
Joined: Jan 2012
Posts: 19
K
krypto Offline OP
Pikka bird
OP Offline
Pikka bird
K
Joined: Jan 2012
Posts: 19
Originally Posted By: drum
It seems far more useful to me if mIRC would output byte-for-byte exactly what is transmitted into the file. There shouldn't even be any UTF encoding or decoding involved if nothing is being displayed to a window.
This. Only way to do that now, is to use eg. Wireshark.

krypto #235681 05/01/12 12:13 PM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Or /debug -i, as I had previously mentioned.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #235694 05/01/12 07:24 PM
Joined: Dec 2002
Posts: 344
D
Pan-dimensional mouse
Offline
Pan-dimensional mouse
D
Joined: Dec 2002
Posts: 344
Originally Posted By: argv0
Ah, I didn't realize this affected files as well. That said, I'm not entirely sure I agree-- I think that one could go either way. On one hand, a user might expect a file dump to be verbatim raw data, on the other hand, as I mentioned regarding windows, the debugging output is mostly meant to visualize the data, not necessarily provide it for copy paste purposes-- that's the flipside of the argument though.


I think users expect it to do what it says it does in the help file: "Outputs raw server messages, both incoming and outgoing, to a debug.log file, or a custom @window." (emphasis mine)

Quote:
Users might get confused either way. Especially if they /loadbuf'd a debug.log into a window and it was raw byte data, because that would not display accurately inside mIRC. A discrepancy between "/debug file.log | loadbuf @debug file.log" and "/debug @debug" might be confusing.


This is an argument for a /loadbuf (and /savebuf) switch that reads/saves to the file as ASCII instead of as UTF-8. Of course, mIRC can't actually actually display ASCII anymore, but the "workaround" of UTF8-encoding the text first, then outputting that to a window, produces the same result visually -- which is fine by me seeing as this is already what it does with /debug.

Quote:
Similarly, users might get confused by reading a file.txt and seeing different data from a @debug window, and, depending on their editor, it might be parsed as utf-8 anyway, which would defeat the purpose of visualizing the data.


You seem to be stuck on assuming that the purpose of the command is to "visualize" the data, but that's just one purpose. There are other valid uses for /debug and it doesn't make sense to limit its usefulness. mIRC shouldn't be making any assumptions about what you will do with the debug.log, or even that you will open it in a text editor. Even if you do open it in a text editor, many do NOT have the same limitation as mIRC and can open the file as ASCII or UTF-8 by simply selecting a different option. Even Windows Notepad can do this!

drum #235698 05/01/12 09:44 PM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Originally Posted By: drum
I think users expect it to do what it says it does in the help file: "Outputs raw server messages, both incoming and outgoing, to a debug.log file, or a custom @window." (emphasis mine)


You emphasized the wrong thing. "raw server messages" does not imply how the data will be displayed (in fact, messages rather than data implies that it's not meant as a packet sniffer). Outputs is the keyword here. And there are a lot of different ways to output data. One of them is to display each byte separately in a hex editor fashion as I described above. You can say that a hex editor "outputs raw data in an exe file", but you won't expect to see a copy pasteable version of the bytes in the file.

Originally Posted By: drum
but the "workaround" of UTF8-encoding the text first, then outputting that to a window, produces the same result visually -- which is fine by me seeing as this is already what it does with /debug.


Except this is not true. Your assumption is that the raw data being dumped to a file is UTF-8, but this is not always the case. Many times, you will get ANSI encodings in the file, which will garble when /loadbuf'd back to mIRC, so it won't produce a correct result visually. Part of the reason /debug does not encode the data before displaying it in @debug is to allow users to visualize the raw results of different encodings:

Originally Posted By: versions.txt
44./debug windows now show raw text without interpreting multibyte or utf text.


The reason for this change was specifically because users were unable to see proper data in the @debug window for non-UTF-8 channels, since the data in @debug will be garbled by the transcoding. This is important in 7.x, since mIRC has no other way to see non-UTF-8 text. The goal is to not have the same garbled output in a channel, and allow users to see what was meant to be sent. The keyword, again, is "see", not use.

Originally Posted By: drum
You seem to be stuck on assuming that the purpose of the command is to "visualize" the data, but that's just one purpose. There are other valid uses for /debug and it doesn't make sense to limit its usefulness.


No, I'm assuming that when piping debugging output to a window or file, your goal is to visualize the data. Again, you can use /debug -i if you have a custom purpose, so the usefulness is NOT limited in all cases, you just have to use /debug -i instead. I don't see a problem with doing that, especially given that the "other valid uses" I can think of are edge cases.

If you can name a specific use case for /debug file that requires the explicit raw data to be shown, I'm all ears-- but I can't think of anything "useful" besides staring at data. Note that even a tool like Wireshark doesn't display "raw data", it's shown in the same uncopyable "hex editor" form I mentioned above.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #235699 05/01/12 11:15 PM
Joined: Dec 2002
Posts: 344
D
Pan-dimensional mouse
Offline
Pan-dimensional mouse
D
Joined: Dec 2002
Posts: 344
Originally Posted By: argv0
You emphasized the wrong thing. "raw server messages" does not imply how the data will be displayed (in fact, messages rather than data implies that it's not meant as a packet sniffer). Outputs is the keyword here. And there are a lot of different ways to output data. One of them is to display each byte separately in a hex editor fashion as I described above. You can say that a hex editor "outputs raw data in an exe file", but you won't expect to see a copy pasteable version of the bytes in the file.


How mIRC chooses to display debug output is up to it, and I don't really have any issue with how it outputs to windows right now. However, it makes no sense to UTF8-encode the data when saving it to a file. If mIRC is doing the displaying, then it can make whatever decisions it wants to regarding how it is displayed, but if it is leaving the "displaying" up to an external application, then it shouldn't be making those decisions -- leave it up to the external application to decide whether to open the file as ANSI, UTF-8, etc.

To put it another way, can you give a good logical reason why mIRC should output debug.log differently in 6.35 than in 7.0? The client is sending and receiving identical byte-for-byte data in either version. Why should mIRC be saving that data to a debug.log file differently in each version?

Ironically enough, you keep talking about how mIRC's debug command works similar to a hex editor -- and yet, because of how mIRC stores debug.log files, they can't be opened directly in an actual hex editor (without first decoding it yourself).

Quote:
Except this is not true. Your assumption is that the raw data being dumped to a file is UTF-8, but this is not always the case. Many times, you will get ANSI encodings in the file, which will garble when /loadbuf'd back to mIRC, so it won't produce a correct result visually. Part of the reason /debug does not encode the data before displaying it in @debug is to allow users to visualize the raw results of different encodings:


You completely misunderstood me. Consider for a moment what happens when mIRC receives a message and then outputs it to a @debug window. It receives the data as raw bytes. It then takes those bytes and UTF *ENCODES* them (I did not say decode) before inserting it into the window buffer. Since the window automatically takes the content of the buffer and decodes it using UTF-8, the result is that you see "undecoded" ANSI text. Since the window would automatically decode anything in its buffer, mIRC actually encodes it before inserting it into the buffer -- cancelling each other out. That is what I'm referring to as the "workaround".

What I was saying is that mIRC should have a switch for /loadbuf that will ENCODE the text to UTF-8 before inserting it into the buffer, so that after the window display routine automatically DECODES the text, the printed result is essentially the same as displaying the text as ANSI. mIRC doesn't need to actually explain it that verbosely, and could just say "read file as ANSI instead of UTF-8" -- but I'm just describing what would actually be happening behind the scenes.

Quote:
If you can name a specific use case for /debug file that requires the explicit raw data to be shown, I'm all ears-- but I can't think of anything "useful" besides staring at data.


Like I said before, opening the file in a hex editor, which gives more information than outputting to a debug window (i.e., seeing if lines terminate with both CR and LF, seeing if nonprintable characters are present, etc.)

drum #235700 06/01/12 01:04 AM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Originally Posted By: drum
To put it another way, can you give a good logical reason why mIRC should output debug.log differently in 6.35 than in 7.0? The client is sending and receiving identical byte-for-byte data in either version. Why should mIRC be saving that data to a debug.log file differently in each version?


Yes. The logical reason was discussed in my first response. It provides different output in a file for the same reason it provides it in the window; because the output is meant to visualize the data, not to accurately log the exact bytes sent over the wire. Again, this may not be your interpretation, but it's perfectly logical and reasonable to other users. I find it perfectly reasonable. And again, allowing for an extra switch in /debug to produce byte for byte raw output would be an option to make both scenarios work.

Originally Posted By: drum
Ironically enough, you keep talking about how mIRC's debug command works similar to a hex editor -- and yet, because of how mIRC stores debug.log files, they can't be opened directly in an actual hex editor (without first decoding it yourself).


This is neither ironic nor odd. If you copy pasted the output of a hex editor into another hex editor you would have the same problem. The point of mIRC's debug.log acting the way it does is to avoid the extra dependency of forcing users to use a hex editor to view what was sent over the IRC server. Instead it can be easily visualized (the point of a debug log) for users to view.

Originally Posted By: drum
Like I said before, opening the file in a hex editor, which gives more information than outputting to a debug window (i.e., seeing if lines terminate with both CR and LF, seeing if nonprintable characters are present, etc.)


Double encoding utf-8 doesn't stop you from doing the above. You could still see CRLFs and non printable chars in a hex editor, you'll just see oddities for UTF-8 encoded text, but since you're only using the hex editor to *view* the data, there isn't much of a harm here.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #235701 06/01/12 02:28 AM
Joined: Dec 2002
Posts: 344
D
Pan-dimensional mouse
Offline
Pan-dimensional mouse
D
Joined: Dec 2002
Posts: 344
Originally Posted By: argv0
because the output is meant to visualize the data, not to accurately log the exact bytes sent over the wire.


Ultimately, this is the thing we completely disagree on.

Quote:
The point of mIRC's debug.log acting the way it does is to avoid the extra dependency of forcing users to use a hex editor to view what was sent over the IRC server.


You wouldn't need to use a hex editor. All you need is Notepad, which can open any text file as ANSI via the drop-down box in the File->Open dialog. In any case, I don't believe Khaled made it behave this way consciously -- I think it's actually just a side effect of switching his file writing routines to encode with UTF-8 by default.

Quote:
You could still see CRLFs and non printable chars in a hex editor, [...]


That's a fair point, but there are still other complications introduced such as characters having variable byte widths.

Last edited by drum; 06/01/12 02:31 AM.

Link Copied to Clipboard