Register Log In

Forums Feature Suggestions UTF8 display regardless of the rest of the line

Print Thread

UTF8 display regardless of the rest of the line #202332 20/07/08 12:12 AM
Joined: Jun 2008 Posts: 58 P Pivo OP Babel fish
OP Pivo Babel fish P Joined: Jun 2008 Posts: 58	Hi there, As one of the few people using utf8-encoding for my messages, I'm often alerted that some of my characters are broken. Also, there are many guys using weird smilies and other character combinations (utf8), simply to get their messages look better.. and those result in weird codes for them. They all have utf8 display enabled, the problem in most cases are their script-themes. Many scripts use ascii chars >127 in their themes - some even in their timestamp - which makes mIRC not decode those utf8-codes in the line because it assumes that the whole line has not been encodet. To solve that, I had to edit the script of some people, so it's checking if the text is utf8-encodet and if so, it encodes the ansi characters, too. Now, this is pretty exhausting and I know from other irc clients that they just decode everything they can find, regardless of any ansi characters in the rest of the line. I understand the reasons for the way mirc reacts in such cases, but I would like to suggest an option to choose whether you want to decode every appearing utf8-encodet character or leave those lines with high ansi-ones. The old behaviour could/should still be defaulted, I would just appreciate a way to easily fix the people's display. (telling them what they have to enable :p) By the way, I've never seen utf8-codes appearing randomly, without having been encodet before... So in most cases, if there is UTF8 in the text, it is also meant to be UTF8 and should be decoded (in my opinion). Greetings, Alex

Re: UTF8 display regardless of the rest of the line Pivo #202336 20/07/08 05:33 AM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	This has been reported before. The problem is that "if there is UTF8 in the text" isn't something you can reliably check. What you define as a UTF8 bitstream can easily be a bunch of ANSI chars to me. No "option" will work, because you end up with the exact same case in reverse-- a bunch of unicode chars in a script would mess with the ANSI chars in a line. I'm also not quite sure if your suggestion is to have an option to selectively ignore a specific set of ANSI chars- if so, that's a solution that only seems to help your specific scenario, not something that helps mIRC as a whole. There was a compromise suggestion to have mIRC treat each word (space delimited) as its own encoding. That way, for instance, a unicode timestamp could sit next to ansi text without changing the encoding. Unfortunately this is likely to make the display a lot slower, and we're still waiting to see what Khaled does about it. - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: UTF8 display regardless of the rest of the line argv0 #202340 20/07/08 12:25 PM
Joined: Jun 2008 Posts: 58 P Pivo OP Babel fish
OP Pivo Babel fish P Joined: Jun 2008 Posts: 58	Originally Posted By: argv0 What you define as a UTF8 bitstream can easily be a bunch of ANSI chars to me. I guess that's the reason why mIRC behaves like that. And as I already mentioned, I've never seen such character combinations appearing randomly in a text without being meant to be UTF8 encodet. It's not only "my special scenario", many people using high ascii char themes got the same problem, thats not limited my community. Of course you could treat every word seperately, but in my opinion, that does not replace the old behavior. (Correct me if I'm wrong, but I think mIRC behaves like that:) If UTF8-Display is activated, mIRC checks a line for ansi-chars to know if the client the text came from encodet that line. If it's able to find any high ascii chars in that text which do not belong to a properly utf8 encodet character, it assumes that the whole line is not utf8 encodet. In case of text events, this is only possible if the client does not encode at all. Checking word-seperately could now result in the same mistakes as decoding the whole line regardless of high ascii chars. (only less often) I would really appreciate a possiblity to simply decode and display all appearing utf8 codes in the line without limitations or checkings before. Be honest, did you ever type € ? There is a reason why utf8 codes are not made out of common characters... And I also think, that additional option is not hard to realize, would not hurt anyone (you dont have to make it the default option) and solves those widely spread utf8-display problems easily.

Re: UTF8 display regardless of the rest of the line Pivo #202364 20/07/08 05:48 PM
Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,918 Montreal, QC, Canada	Quote: Be honest, did you ever type € ? May I answer this question with a question? Did you ever type ЙЮ,ЁОТ? Because that text includes the exact same character combination as what you just posted above, but using the Cryllic ANSI codepage instead of Western. I don't know Russian or any Cryllic languages, but I'll bet that combination is far more likely to them, and that's just one of many other codepages. The problem here is that your solution only works for people who speak to you using utf-8 or in english rather than ANSI codepages, the latter still being common among russian/japanese/chinese IRC'ers. Again, what's "uncommon" to you is not so uncommon to others. UTF-8 doesn't specifically choose "uncommon" character combinations, it just utilizes the ANSI code space to encode the code points. A lot of people actually use the ANSI code space, and not just for fancy themed output, but to actually communicate. The better way to handle this really is to treat words individually, that way if one word breaks the encoding, it won't affect the rest of the line, and it will most accurately represent the expected line. This will give you what you want as well. Neither solution will work 100% of the time, but rather than excluding a whole demographic of non-english speakers from this new feature, encoding on a word-by-word basis would fail less while still making it work for everybody. Frankly, I can't even imagine many edge cases where the "word by word" encoding would fail.. only if the line was one big word, or if the themed output did not space out the input text (which is generally uncommon in itself). - argv[0] on EFnet #mIRC - "Life is a pointer to an integer without a cast"

Re: UTF8 display regardless of the rest of the line argv0 #202366 20/07/08 06:25 PM
Joined: Jun 2008 Posts: 58 P Pivo OP Babel fish
OP Pivo Babel fish P Joined: Jun 2008 Posts: 58	Quote: Because that text includes the exact same character combination as what you just posted above If I convert those characters into western, I get ÉÞ,¨ÎÒ - I can't see me having typed such, but maybe I just misunderstand you... (The â‚¬ above is an encoded €) Anyway, I already mentioned I actually know about those risks of a wrong output, thats why I only suggested to make it an option and not default, since I can't imagine it being that much work to run over the check for ansi chars. Quote: Frankly, I can't even imagine many edge cases where the "word by word" encoding would fail.. only if the line was one big word, or if the themed output did not space out the input text (which is generally uncommon in itself). Thats pretty easy to get. If you're that convinced that it's easily possible to produce utf8 codes accidently by typing, those could be decoded and displayed within a word, even if the rest of the line only contains high ascii characters. The problem I see in the word-oriented displaying is the same as you're trying to explain me right now I have already thought about all this, and I still think: People should be able to decide and choose the option which provides the best solution for them instead of living with the result of few people's discussion... ...especially if the fix would be that easy to realize (in my opinion - already mentioned above...) (Woah.. I should not start such topics in the night... some of my sentences up there are kinda confused) Edit: The best option would actually be a system that treats own inputs differently to stuff coming from the server. Example: Code: on ^&:text::#: { haltdef echo -i5tlbfmc normal $chan $+(—,$nick,—) $1- } mIRC should notice that the timestamp and those — in the message are a direct output by the script. $chan, $nick and $1- are coming from the server and should be treated the way mIRC behaves at the moment. (everything seperately) I know this could also cause a high performance lack, but that can all be made optional. Also, thats probably hard to realize so I did not mention it before... Last edited by Pivo; 20/07/08 06:58 PM.

Re: UTF8 display regardless of the rest of the lin argv0 #202368 20/07/08 07:40 PM
Joined: Jul 2007 Posts: 4 Germany T tonyp Self-satisified door
tonyp Self-satisified door T Joined: Jul 2007 Posts: 4 Germany	Originally Posted By: argv0 Because that text includes the exact same character combination as what you just posted above, but using the Cryllic ANSI codepage instead of Western. I don't know Russian or any Cryllic languages, but I'll bet that combination is far more likely to them, and that's just one of many other codepages. I really get your point, nevertheless i get more and more requests why someone can't read german "umlauts" / mutated vowels. The problem is, ä ö & ü are encoded as UTF-8 when you use the "display and decode" option of the fonts dialog. If you have a theme with ASCII chars > 127 you'll receive a corrupted message. Where is the problem to implement an option like Pivo suggested? It would help many german and english users, who'll never chat with people using cyrillic etc. codepage. I really would recommend it. It would help so many mIRC users out there (not only but at least all german) to write and read what they want to. You could give me the advice to help them out with the way Pivo told about (checking $isutf and $utfencode the timestamp/theme) but most of them are overstrained with sth like that and it would be the easy way to simply enable an option. oh btw.. sorry for my assumedly bumpy english, i just tried to phrase what i think in a way everybody can understand thanks I took a test in Existentialism. I left all the answers blank and got 100.

Link Copied to Clipboard