mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Jun 2008
Posts: 58
P
Pivo Offline OP
Babel fish
OP Offline
Babel fish
P
Joined: Jun 2008
Posts: 58
Hi there,

As one of the few people using utf8-encoding for my messages, I'm often alerted that some of my characters are broken.
Also, there are many guys using weird smilies and other character combinations (utf8), simply to get their messages look better.. and those result in weird codes for them.
They all have utf8 display enabled, the problem in most cases are their script-themes.
Many scripts use ascii chars >127 in their themes - some even in their timestamp - which makes mIRC not decode those utf8-codes in the line because it assumes that the whole line has not been encodet.
To solve that, I had to edit the script of some people, so it's checking if the text is utf8-encodet and if so, it encodes the ansi characters, too.
Now, this is pretty exhausting and I know from other irc clients that they just decode everything they can find, regardless of any ansi characters in the rest of the line.
I understand the reasons for the way mirc reacts in such cases, but I would like to suggest an option to choose whether you want to decode every appearing utf8-encodet character or leave those lines with high ansi-ones.
The old behaviour could/should still be defaulted, I would just appreciate a way to easily fix the people's display. (telling them what they have to enable :p)

By the way, I've never seen utf8-codes appearing randomly, without having been encodet before...
So in most cases, if there is UTF8 in the text, it is also meant to be UTF8 and should be decoded (in my opinion).

Greetings,
Alex

Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
This has been reported before. The problem is that "if there is UTF8 in the text" isn't something you can reliably check. What you define as a UTF8 bitstream can easily be a bunch of ANSI chars to me. No "option" will work, because you end up with the exact same case in reverse-- a bunch of unicode chars in a script would mess with the ANSI chars in a line. I'm also not quite sure if your suggestion is to have an option to selectively ignore a specific set of ANSI chars- if so, that's a solution that only seems to help your specific scenario, not something that helps mIRC as a whole.

There was a compromise suggestion to have mIRC treat each word (space delimited) as its own encoding. That way, for instance, a unicode timestamp could sit next to ansi text without changing the encoding. Unfortunately this is likely to make the display a lot slower, and we're still waiting to see what Khaled does about it.



- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
Joined: Jun 2008
Posts: 58
P
Pivo Offline OP
Babel fish
OP Offline
Babel fish
P
Joined: Jun 2008
Posts: 58
Originally Posted By: argv0
What you define as a UTF8 bitstream can easily be a bunch of ANSI chars to me.

I guess that's the reason why mIRC behaves like that.
And as I already mentioned, I've never seen such character combinations appearing randomly in a text without being meant to be UTF8 encodet.
It's not only "my special scenario", many people using high ascii char themes got the same problem, thats not limited my community.
Of course you could treat every word seperately, but in my opinion, that does not replace the old behavior.
(Correct me if I'm wrong, but I think mIRC behaves like that:)
If UTF8-Display is activated, mIRC checks a line for ansi-chars to know if the client the text came from encodet that line.
If it's able to find any high ascii chars in that text which do not belong to a properly utf8 encodet character, it assumes that the whole line is not utf8 encodet.
In case of text events, this is only possible if the client does not encode at all.
Checking word-seperately could now result in the same mistakes as decoding the whole line regardless of high ascii chars. (only less often)

I would really appreciate a possiblity to simply decode and display all appearing utf8 codes in the line without limitations or checkings before.
Be honest, did you ever type € ?
There is a reason why utf8 codes are not made out of common characters...
And I also think, that additional option is not hard to realize, would not hurt anyone (you dont have to make it the default option) and solves those widely spread utf8-display problems easily.

Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Quote:
Be honest, did you ever type € ?


May I answer this question with a question? Did you ever type ЙЮ,ЁОТ?

Because that text includes the exact same character combination as what you just posted above, but using the Cryllic ANSI codepage instead of Western. I don't know Russian or any Cryllic languages, but I'll bet that combination is far more likely to them, and that's just one of many other codepages.

The problem here is that your solution only works for people who speak to you using utf-8 or in english rather than ANSI codepages, the latter still being common among russian/japanese/chinese IRC'ers. Again, what's "uncommon" to you is not so uncommon to others. UTF-8 doesn't specifically choose "uncommon" character combinations, it just utilizes the ANSI code space to encode the code points. A lot of people actually *use* the ANSI code space, and not just for fancy themed output, but to actually communicate.

The better way to handle this really is to treat words individually, that way if one word breaks the encoding, it won't affect the rest of the line, and it will most accurately represent the expected line. This will give you what you want as well. Neither solution will work 100% of the time, but rather than excluding a whole demographic of non-english speakers from this new feature, encoding on a word-by-word basis would fail less while still making it work for everybody. Frankly, I can't even imagine many edge cases where the "word by word" encoding would fail.. only if the line was one big word, or if the themed output did not space out the input text (which is generally uncommon in itself).


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
Joined: Jun 2008
Posts: 58
P
Pivo Offline OP
Babel fish
OP Offline
Babel fish
P
Joined: Jun 2008
Posts: 58
Quote:
Because that text includes the exact same character combination as what you just posted above

If I convert those characters into western, I get ÉÞ,¨ÎÒ - I can't see me having typed such, but maybe I just misunderstand you...
(The € above is an encoded €)

Anyway, I already mentioned I actually know about those risks of a wrong output, thats why I only suggested to make it an option and not default, since I can't imagine it being that much work to run over the check for ansi chars.

Quote:
Frankly, I can't even imagine many edge cases where the "word by word" encoding would fail.. only if the line was one big word, or if the themed output did not space out the input text (which is generally uncommon in itself).

Thats pretty easy to get.
If you're that convinced that it's easily possible to produce utf8 codes accidently by typing, those could be decoded and displayed within a word, even if the rest of the line only contains high ascii characters.
The problem I see in the word-oriented displaying is the same as you're trying to explain me right now wink

I have already thought about all this, and I still think:
People should be able to decide and choose the option which provides the best solution for them instead of living with the result of few people's discussion...
...especially if the fix would be that easy to realize (in my opinion - already mentioned above...)

(Woah.. I should not start such topics in the night... some of my sentences up there are kinda confused)

Edit: The best option would actually be a system that treats own inputs differently to stuff coming from the server.
Example:
Code:
on ^&*:text:*:#: {
  haltdef
  echo -i5tlbfmc normal $chan $+(—,$nick,—) $1-
}

mIRC should notice that the timestamp and those — in the message are a direct output by the script.
$chan, $nick and $1- are coming from the server and should be treated the way mIRC behaves at the moment. (everything seperately)
I know this could also cause a high performance lack, but that can all be made optional.
Also, thats probably hard to realize so I did not mention it before...

Last edited by Pivo; 20/07/08 06:58 PM.
Joined: Jul 2007
Posts: 4
T
Self-satisified door
Offline
Self-satisified door
T
Joined: Jul 2007
Posts: 4
Originally Posted By: argv0
Because that text includes the exact same character combination as what you just posted above, but using the Cryllic ANSI codepage instead of Western. I don't know Russian or any Cryllic languages, but I'll bet that combination is far more likely to them, and that's just one of many other codepages.



I really get your point, nevertheless i get more and more requests why someone can't read german "umlauts" / mutated vowels.
The problem is, ä ö & ü are encoded as UTF-8 when you use the "display and decode" option of the fonts dialog.
If you have a theme with ASCII chars > 127 you'll receive a corrupted message.

Where is the problem to implement an option like Pivo suggested? It would help many german and english users, who'll never chat with people using cyrillic etc. codepage. I really would recommend it. It would help so many mIRC users out there (not only but at least all german) to write and read what they want to.
You could give me the advice to help them out with the way Pivo told about (checking $isutf and $utfencode the timestamp/theme) but most of them are overstrained with sth like that and it would be the easy way to simply enable an option.

oh btw.. sorry for my assumedly bumpy english, i just tried to phrase what i think in a way everybody can understand wink
thanks


I took a test in Existentialism. I left all the answers blank and got 100.

Link Copied to Clipboard