Register Log In

Forums General Discussion Unicode Support

Print Thread

Unicode Support #224231 07/08/10 12:44 AM
Joined: Feb 2006 Posts: 523 J jaytea OP Fjord artisan
OP jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	Preface This article aims to discuss Unicode as it pertains to mIRC. Background information is included for the sake of completeness; for a more thorough analysis of the Unicode standard the reader should supplement with other more complete sources of information. Unicode implementation in mIRC has been a stepwise process that initially began back in early 2006 with the release of mIRC 6.17. While this progression has been imperceptible to some, this is likely to change with the release of the most recent mIRC version wherein Unicode is ubiquitous and compatibility with other more limited encoding systems marginalised. The ramifications of this movement, though indubitably in the right direction, may be dire for some. It is vital that everyone affected should understand the changes taking place, what they can do to minimise any ill effects and facilitate a smooth transition. The integrity of the examples and snippets provided is not guaranteed when used on versions of mIRC older than 7.1 unless explicitly mentioned. [N] indicates a footnote which provides added information for those interested. Footnotes are located towards the end of the article. What is Unicode? Unicode is a standard for encoding characters that attempts to unify many of the older character encodings encompassing a wide variety of languages and their alphabets. With more than a hundred thousand graphical characters to select from, Unicode provides the ideal standard for IRC, a medium shared simultaneously by users from all over the world. No longer will users viewing text under one particular code page have to change their application or even operating system configuration to be able to sensibly view text in another. Unicode standardises character representation in the same way a single universally spoken language would global communication. For a detailed description of the languages supported by the Unicode standard, refer to: http://www.unicode.org/charts/ The standard introduces an abstract character set which houses the building blocks of character formation. An abstract character is a fundamental unit used either singly or in co-operation with other abstract characters to represent the visual characters eventually displayed on our monitors. The capital letter C, known to our machinery as LATIN CAPITAL LETTER C, is an example of such an abstraction. This name is part of a universal standard and is understood to represent the same abstract character regardless of the platform on which it is interpreted. The standard further defines a code space comprising 1,114,112 unique code points. A code point is a non-negative integer which serves as a potential candidate for abstract character assignment[1]. This is not an unfamiliar concept: consider the ASCII encoding scheme which uses code points 0 to 127. The code point 64 is assigned the '@' character and the capital letter 'A' is assigned to code point 65. Resolving a character given its code point is typically performed in mIRC using the $chr() identifier and the reverse operation using $asc(). Unlike ASCII, wherein the full set of code points have each been assigned corresponding characters, not all of Unicode's code space is currently assigned. In fact, as of Unicode 5.2.0, only 22% of its code space is populated by characters (or so-called non-characters) of some description. Needless to say, the remaining code points will come into play if and when new characters enter the standard. Readers acquainted with the constituents of the recent mIRC version might question $chr()'s tendency therein to handle only the first 65536 code points (referred to henceforth as the Basic Multilingual Plane or simply BMP), with its results wrapping around at $chr(65536) as they used to do previously at $chr(256). This is related to how mIRC represents code points internally[2] and extends to all areas of mIRC scripting that recognise text as a series of characters; it does not mean that mIRC itself only supports code points within the BMP. This is consistent behaviour and the reader should keep the following important result in mind: text is handled, in their scripts, using only those Unicode characters in the Basic Multilingual Plane. In order to differentiate these elements from the wider and more abstract notion of code points, the term code unit will be used. It is appropriate at this moment to delve into a subject of relative esotericism, the concept of surrogates and surrogate pairs. Surrogates are special members of the code space, occupying 2048 code points from U+D800[3] to U+DFFF (55296 to 57343). While this space is reserved for these surrogates, none of its code points is assigned a member of the abstract character set. This would appear contrary to what one might observe through simple experimentation: Quote: //echo -eag $len($chr(55296)) mIRC appears to treat individual surrogates as tangible characters but this is merely a matter of convenience and does not exemplify the true nature of the standard. Therefore it is not correct to refer to them singly as characters, the term surrogate character is a misnomer; surrogate code point or simply surrogate should be preferred. Two types of surrogates exist: high surrogates which occupy 1024 code points (U+D800 to U+DBFF), and low surrogates which occupy the other 1024 code points (U+DC00 to U+DFFF). The utility of surrogates is observed when they are combined in pairs, a high surrogate followed directly by a low surrogate, to form code points beyond the BMP (of which there exist 1,048,576 examples). Here is a line of code that demonstrates this concept: Quote: //var -s %a = $chr(55384), %b = $chr(56320) \| echo -a %a -- %b -- %a $+ %b -- %b $+ %a Quote: %a and %b are set to contain a high and low surrogate respectively. Note that the concatenation %b $+ %a yields nothing useful whereas the correct %a $+ %b form produces the character with code point 155648. The reader should be reminded that in order to manipulate this character (and its contemporaries outside of the BMP) they are compelled to do so in terms of the pair of code units that compose it. A custom alias for resolving code points from surrogate pairs and its inverse operation will be provided later in the article. Incidentally, the term code point may now refer to a pair of code units when describing mIRC's internal representation of this new range of characters. Hitherto the discussion has been limited to that of Unicode characters as they exist locally ie. within the confines of the client. Passing text from one identifier to another, for example, is done with relative ease; representation of Unicode characters as a sequence of code units remains consistent. However, once mIRC is required to consort with external systems such as an IRC server or the hard disk, it can no longer keep text in this format. As most of these systems are octet based, mIRC is obliged to encode characters into a series of bytes using a character encoding scheme. The most widely used of these, and the one adopted by mIRC, is UTF-8. What is UTF-8? UTF-8 (8-bit Unicode Transformation Format) is a method of encoding Unicode code points[4] so that they can be transferred from one location to another as a string of 8-bit integers or bytes. That is to say a code point, which may take on any value in the entire code space, must be transformed into one or more integers, each of which lie in the range 0 to 255. It is clear why the result of this encoding may contain more than one such byte; there does not exist a sufficient number of integers between 0 and 255 to singly and uniquely represent the Unicode code space. Each code point is therefore mapped onto a series of bytes from which it can later be decoded to produce, without any ambiguity, the original code point. The following table outlines the encoding procedure for the full range of code points: Original at: http://en.wikipedia.org/wiki/UTF-8 Visualizing the mapping might be difficult for some; the reader may use the following alias to examine UTF-8 encoding with calculations performed in the more familiar decimal base: Code: ; $utfencodei(N) ; Returns the string of UTF-8 bytes formed by encoding Unicode code point N. ; It works with integers in a similar manner to how $utfencode() works with characters. ; For example: $utfencodei(2034) = 223 178 alias utfencodei { if ($1 isnum 0-127) return $1 if ($1 isnum 128-2047) return $int($calc(192 + $1 / 64)) $calc(128 + $1 % 64) if ($1 isnum 2048-65535) return $int($calc(224 + $1 / 4096)) $int($calc(128 + ($1 / 64) % 64)) $& $calc(128 + $1 % 64) if ($1 isnum 65536-1114111) return $int($calc(240 + $1 / 262144)) $int($calc(128 + ($1 / 4096) % 64)) $& $int($calc(128 + ($1 / 64) % 64)) $calc(128 + $1 % 64) } Note that characters in the 0 to 127 code point range remain unchanged by their encoding, one advantage of which is that compatibility with ASCII is maintained. UTF-8 encoding of multiple code points is performed on one at a time, the results are simply concatenated to produce the encoded byte sequence. This observation leads to the following corollary: $utfencode(%a $+ %b) = $utfencode(%a) $+ $utfencode(%b) for two Unicode characters %a and %b. There is also something to be said about the range of possible UTF-8 encodings: certain byte sequences cannot be generated. For example, the single byte 128 does not constitute meaningful UTF-8 because there exists no Unicode code point which, when passed through the above encoding procedure, yields 128. Such sequences are said to contain invalid UTF-8 and it is up to the application in question to handle them as it deems appropriate. mIRC will attempt to apply a corrective procedure in certain situations by interpreting all invalid bytes as valid Unicode code points in the 128 to 255 code point range. Unfortunately, a person receiving a large amount of invalid UTF-8 on mIRC 7.1 is likely not going to benefit from this correction as the sender's client might be configured to use a particular code page without encoding outgoing text in UTF-8. What are Code Pages? The term code page is generally used to describe any character encoding procedure that enumerates an abstract and unordered character set. This creates a relationship between characters and code points similar to but less complex than the one discussed earlier in the context of the Unicode standard. In the mIRC community, code page takes on a subtly different definition and refers to the more primitive character encodings that do not involve a sophisticated encoding process[5]. These code pages typically define a code space of 256 code points and, as a result, are able to encode each member of the space to byte form trivially and efficiently. Code pages are platform dependent in that they are established by vendors for their respective platforms. For example, Windows defines a different code page than MS-DOS for the Cyrillic alphabet. Meaningful communication between two users on IRC is only possible if their clients are using code pages that resolve transmitted byte sequences to the same characters as one another. This is one of the reasons why the author of mIRC has endeavoured to move away from these client specific code pages and towards a unified standard. Fonts and font linking This article would not be complete without mentioning an integral aspect of character representation that is not strictly a part of the Unicode standard. Fonts are the key ingredients that allow users to visualise what has until now been merely conceptual. Unicode outlines a protocol for assembling glyphs, individual units for graphical display, but it relies upon external fonts to provide these glyphs. Although the overall procedure for graphical representation of an arbitrary sequence of encoded characters is extremely involved, the role of the font is relatively straight forward. For each encoded character in the text[6], a font is summoned from which to grab the associated glyphs. Unfortunately, not all fonts are likely to contain the glyphs asked of them. If your default font does not support the required character, mIRC will iterate through a list of common Unicode fonts (fonts that contain glyphs for a significant portion of Unicode's code space) on your system in order to find a successful match. This leads to a potential discrepancy between the appearances of different parts of the displayed line for users utilizing a relatively limited default font. A comparison of different Unicode fonts is provided at: http://en.wikipedia.org/wiki/Unicode_typeface and http://www.fileformat.info/info/unicode/font/index.htm - the latter website allows a user to locate fonts that support a particular encoded character given its code point value. Part of the complexity of the Unicode standard comes from its ability to construct compound visual characters, called grapheme clusters, by piecing together multiple encoded characters. An example of a grapheme cluster is the Hangul syllable gag which can be formed by concatenating encoded characters U+1100 (HANGUL CHOSEONG KIYEOK), U+1161 (HANGUL JUNGSEONG A) and U+11A8 (HANGUL JONGSEONG KIYEOK): Quote: //var -s %chr1 = $chr($base(1100, 16, 10)), %chr2 = $chr($base(1161, 16, 10)), %chr3 = $chr($base(11A8, 16, 10)) \| echo -eag %chr1 $+ %chr2 $+ %chr3 The first encoded character is the base character which contains the instructions necessary to form the cluster. The presence of grapheme clusters and their base characters implies that Unicode is capable of delineating substantially more visual characters than the ranges of assigned code points might suggest. Older mIRC Versions In older versions of mIRC, users were able to select a code page for character display on a per-window basis. With the introduction of mIRC 6.17, it became possible to simultaneously decode valid UTF-8 and refer to the local code page for invalid UTF-8 (though not both in the same line of text; a single invalid UTF-8 sequence would invalidate the entire line). All characters were stored as byte sequences internally, forcing mIRC scripters to handle Unicode as UTF-8 encoded text. This was problematic in various situations involving string splitting at arbitrary positions, counting characters and any script that operated under the assumption that individual characters within a line of text would each correspond to exactly one encoded character. This issue still plagues scripters in the case of having to deal with characters comprising a surrogate pair as two separate code units. However, it is rare that one encounters a stray surrogate pair in text and the character boundaries are generally simple to determine. Incoming text was decoded as UTF-8 if it were deemed valid, otherwise the code page associated with the relevant window was called upon for character resolution. Furthermore, these code pages had to be referred to for mapping of code page specific bytes to their synonymous Unicode code points when interacting with Windows libraries and controls that were strictly Unicode compliant. This approach resulted in the manifestation of various issues which are inherent to a non-committal variety of Unicode support. One need only skim the older search results for 'Unicode' on mIRC.com's bug report forum to witness the myriad problems present in the old version. Another fundamental weakness, which is more related to the popularity of limited code pages than a client's support of them, was the inability to concurrently refer to multiple code pages in a single mIRC window. A common example is the Channels List window on a network housing users of different nationalities, each communicating in their own language and depending on their own associated code pages. Portions of this window would be unintelligible to the viewer owing to the variety of the code pages used to encode channel topics. Of course, this and similar issues are not automatically resolved by an individual upgrading to the latest mIRC release. The problem of code page dependent text on IRC is pervasive and the effort in bringing the issue to finality would need to be the same. Discontinuation of code pages in the most recent version of mIRC constitutes an indispensable step towards widespread conformity to a superior standard. Be that as it may, it is important to remain pragmatic; this movement cannot be expected to occur immediately. Code page dependency is likely to persist for the forseeable future. The Latest Version (7.1) The new version of mIRC has dropped code page support in favour of the Unicode standard. This change was necessary on many levels and, by the author's own admission, has been the most complex and time-consuming update to the client thus far. It is, admittedly, not difficult to take for granted all the areas of mIRC that invariably come into play under the requirement of full Unicode adherence. Textual display across every type of window and dialog; scripting related functions that operate on text; external libraries heavily intertwined with the underlying code - these are just vague examples of important considerations that surely must have been made. Many routines added in a top-down fashion to the previous versions, used to target odd bugs and exceptional circumstances, would have undoubtedly required extensive modification if not a complete rewrite. It brings to mind the image of a tree house balanced atop a withering tree upon which new loads are constantly added through painstaking effort. This cannot continue indefinitely; without a solid foundation the entire house will inevitably come crashing down (an exaggeration of sorts but the point is valid and worth considering). Unicode Related Functions This section is intended to be less verbose than those earlier and will be used to demonstrate the concepts introduced up to now. First, as promised, a pair of aliases that act as extensions to $chr() and $asc() for code points beyond the BMP: Code: ; $chrU(N) ; Returns the pair of surrogates that represent the given code point for N between 65536 and 1114111. ; Otherwise it returns $chr(N). ; The hexadecimal form $chrU(U+XXXX) is supported. alias chrU { var %chr = $iif(U+* iswm $1, $base($mid($1, 3), 16, 10), $1) if (%chr isnum 65536-1114111) return $chr($calc(55232 + %chr / 1024)) $+ $chr($calc(56320 + %chr % 1024)) return $chr(%chr) } ; $ascU(S1S2) ; Returns the code point of the character depicted by high surrogate S1 combined with low surrogate S2. ; If S1 and S2 are not respectively a high and low surrogate, it reverts to mIRC's own $asc(). alias ascU { var %asc = $asc($1) if (%asc isnum 55296-56319) && ($asc($mid($1, 2)) isnum 56320-57343) return $& $calc(65536 + (%asc - 55296) * 1024 + $v1 % 1024) return %asc } A mention of mIRC's native functions for encoding Unicode and decoding UTF-8 should be made at this time. In order to encode text handled internally, mIRC uses an encoding routine that has been made available to scripters through application of the $utfencode() identifier: Code: $utfencode(text [, C ]) Returns the result of encoding text in UTF-8[7], optionally referencing code page C for specific code page to Unicode resolution. There is of course a complementary identifier used for decoding purposes, named $utfdecode(): Code: $utfdecode(text [, C ]) Returns the result of decoding text from UTF-8, optionally referencing code page C for specific Unicode to code page resolution. The reader may be wondering what this optional C parameter signifies and how it is used. This value, an 8-bit integer, is called a GDI charset number and unambiguously represents a particular code page on the Windows platform. The values of C and corresponding code page names are provided: Quote: 000 - ANSI_CHARSET 001 - DEFAULT_CHARSET 002 - SYMBOL_CHARSET 077 - MAC_CHARSET 128 - SHIFTJIS_CHARSET 129 - HANGEUL_CHARSET 130 - JOHAB_CHARSET 134 - GB2312_CHARSET 136 - CHINESEBIG5_CHARSET 161 - GREEK_CHARSET 162 - TURKISH_CHARSET 163 - VIETNAMESE_CHARSET 177 - HEBREW_CHARSET 178 - ARABIC_CHARSET 186 - BALTIC_CHARSET 204 - RUSSIAN_CHARSET 222 - THAI_CHARSET 238 - EASTEUROPE_CHARSET 255 - OEM_CHARSET Note: GDI charsets 1 and 255 are system dependent and are therefore expected to return different results across different machines. Values not on the table are treated as a reference to DEFAULT_CHARSET, equivalent to using C = 1. It is now possible to elaborate upon the behaviour of the aforementioned identifiers in terms of this C parameter. In the case of encoding text, it is helpful to start by considering an example: Quote: //echo -eag $utfencode($chr(195), 161) The main steps that occur implicitly are: Code point 195 in GREEK_CHARSET is looked up and found to have been assigned the abstract character named GREEK CAPITAL LETTER GAMMA. Unicode's code page is referenced and GREEK CAPITAL LETTER GAMMA is located at code point 915 (U+0393 or $chr(915) in mIRC). Code point 915 is now encoded using the standard procedure for UTF-8 encoding, as though $utfencode($chr(915)) was originally used. It is meaningless to supply a value for C when attempting to encode Unicode code points beyond the first 256. The reason for this is simple: these code pages are byte based and cannot make sense of these extended values. Decoding UTF-8 with code page references is more intricate and, at present, slightly flawed insofar as mIRC will attempt to decode invalid UTF-8 (this issue has been reported and addressed by mIRC's author). The procedure above is inverted but involves a character fitting process for associating Unicode characters with members of the target code page when an exact match cannot be determined. Recall from the previous example that the code point 915 is assigned an abstract character that is found in the GREEK_CHARSET at position 195. This resolution can be performed systematically as demonstrated by the next example: Quote: //echo -eag $asc($utfdecode($utfencode($chr(915)), 161)) The main steps that occur implicitly this time are: Unicode code point 915 is encoded via $utfencode() to form its UTF-8 analogue. The encoded sequence is implicitly decoded back to code point 915 but this is not returned immediately. First the abstract character GREEK CAPITAL LETTER GAMMA is resolved, then code page GREEK_CHARSET is traversed until the abstract character is found. GREEK CAPITAL LETTER GAMMA in GREEK_CHARSET is detected at position 195. If the user attempts to decode UTF-8 that ultimately represents abstract characters non-existent in code page C, a fitting routine is called upon that attempts to relate the abstract Unicode character to a different but similar one in the target code page. This can be seen to occur in the following example: Quote: //echo -eag $chr(65335) -- $utfdecode($utfencode($chr(65335)), 178) Here, code unit 65535 (U+FF37) is assigned FULLWIDTH LATIN CAPITAL LETTER W which does not exist in ARABIC_CHARSET (C = 178). Consequently, it is fitted onto a member of ARABIC_CHARSET that is said to resemble the original character: $chr(87) or the familiar capital letter W. For a detailed explanation of the fitting procedure, the reader should refer to http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt There is an assortment of methods for discovering Unicode characters. On earlier versions of mIRC where Unicode was manipulated strictly with UTF-8, the code point associated with a certain UTF-8 sequence appeared to be irrecoverable. For this purpose one may use the following alias which locates a given code point either using its UTF-8 encoding or its equivalent form as it resides in the active window's code page under the given index: Code: ; /findUnicode <chr> ; For mIRC 6.35 and below ; Usage: Type /findUnicode followed by the character you wish to look up. ; If the character is not valid UTF-8 a code page reference is made. alias findUnicode { bset -t &a 1 $iif($isutf($1), $1, $utfencode($1, $window($active).fontcs)) var %char = $utfdecodei($bvar(&a, 1-)) linesep -a echo -aqg Character: $1 (U+ $+ $base(%char, 10, 16, 4) $+ ) echo -aqg Use $!chr( $+ %char $+ ) in mIRC 7.1 to retrieve this character. linesep -a } alias -l utfdecodei { tokenize 32 $1 if ($1 isnum 0-127) return $1 if ($1 isnum 192-223) return $calc(($1 - 192) * 64 + $2 - 128) if ($1 isnum 224-239) return $calc(($1 - 224) * 4096 + ($2 - 128) * 64 + $3 - 128) if ($1 isnum 240-244) return $calc(($1 - 240) * 262144 + ($2 - 128) * 4096 + ($3 - 128) * 64 + $4 - 128) } An invaluable resource for information on individual Unicode characters is found at the same website provided earlier in the Fonts section and is unabashedly re-iterated here: http://www.fileformat.info/info/unicode/char/search.htm Given that mIRC is naturally predisposed to encoding outgoing text to an IRC server as valid UTF-8, the scripter needs a method to override this default mechanism in the rare instances in which they are pressed to send invalid UTF-8 byte sequences. It was for this purpose that a -n switch was implemented in the /raw command. /raw -n will, for code units less than 256, skip the UTF-8 encoding of these code units and send them instead as single bytes of equal numerical value: Quote: /raw -n WHOIS jâytea The â accented letter in the above example is $chr(226). The IRC server, in this case, has no notion of code pages or character encoding; it simply stores and relays sequences of bytes between IRC clients. Without the -n switch, mIRC encodes and sends â as the two bytes 195 and 162 rather than the single byte 228 that is required for the server to recognise the correct nickname. Other situations that exhibit the utility of this feature are given in the next section. Common Issues If the reader has reached this point in the article with a vague comprehension of the overall Unicode procedure, they should be sufficiently prepared to understand the following material. Not all of the problems mentioned are soluble; it is this author's hope that the explanations provided will at the very least enlighten the user as to why code page discontinuation, a decision that must have been made with great compunction, was ultimately well-advised. The most pressing issue for users since the new mIRC version's release is, predictably, its lack of code page support. This modification has manifested itself most notably by serving to display code page dependent text (read: improperly encoded text) as unintelligible nonsense. Why this is: Code page dependent text is received by mIRC as a series of bytes that, much more often than not, constitutes an invalid UTF-8 byte stream. As such, mIRC will simply map each byte onto its numerically equivalent code unit between 1 and 255 with no regard for the code page used by the sender. The result is a string of text that might be familiar to those users who previously had their clients set to operate under the Latin-1 character map (since Latin-1 shares its character set with the first 256 Unicode code points), though still presumably indecipherable. How this can be resolved: The simplest workaround in this case is to ensure one's client always receives valid UTF-8 encoded text. One must persuade his peers to use clients that are capable of encoding outgoing text in UTF-8. In previous versions of mIRC this is made available to the sender in the font selection dialog (View > Font) by choosing 'Display and encode' under the UTF-8 option. Alternatively the sender may upgrade to the latest version of mIRC. The benefits of this approach are numerous and have been detailed earlier in the article. In reality, this upgrade is infeasible for the individual user unless it represents part of a joint effort. It is the difference between yielding to one's best interests or the best interests of the community at large; the most diplomatic decision is to lead by example and put pressure on other users to do what is unequivocally, in this day and age, essential. Additionally, to view the line of text as it would appear when related to a specific code page, a user may take advantage of the functions exemplified in the previous section: Quote: //echo -ea $utfdecode($utfencode(<text>, C)) C is the GDI charset value of the code page to refer to; its details are fully documented in the previous section. The next most commonly seen issue is similar to but distinct from the previous one and involves sending rather than receiving text. Certain users migrating from the previous version have complained about its inability to join channels with names containing accented characters and other code page specific byte sequences. This is only partially true; mIRC will join the channel whose name is concordant with the Unicode standard, but this is unlikely to be the desired channel as joined by earlier versions of mIRC. In the same vein, messages sent to users whose clients are code page reliant and cannot interpret UTF-8 will appear to be unreadable. [img]http://imgur.com/QtrHU.png[/img] Why this is: As touched upon previously, mIRC now being a fully Unicode adherent application will automatically encode all outgoing text as UTF-8 byte streams. This results in foreign or accented characters being sent as two or more bytes whereas, for joining certain channels, the actual channel name possesses no such encoded byte sequence. Similarly, clients receiving this UTF-8 with code pages that directly resolve bytes to abstract characters without first interpreting and decoding UTF-8 will not see text as it was sent. How this can be resolved: Once again the reader must be reminded that the true problem lies not wholly in their client and its lack of code page support. The users that one communicates with on IRC, who propagate code page dependency by themselves relying solely on them, can certainly be said to have a hand in the matter. If these users were able to effectively view and encode text (see the font options dialog above) as per the Unicode standard, this problem would resolve itself organically. Conformity on a much wider scale might even eventually render code pages obsolete. For the user looking for a scripted alternative, the feature of mIRC that is of value in this set of circumstances is the -n switch of /raw. With it, a user is able to send to the server code units as bytes that are not produced by a complicated encoding scheme. Consider the following example: Quote: /raw -n JOIN #québec The problematic character here, LATIN SMALL LETTER E WITH ACUTE, is $chr(233) in Unicode. As this represents a code point within the range that can be converted to byte sequences with ease, /raw -n will send $chr(233) as the single byte 233 without encoding it as the double byte sequence 195 and 169. There is an important caveat that must accompany all explanations of this technique. While, from a user's perspective, this may appear to solve the problem of being unable to join channels with malformed names, there is still a flagrant issue with mIRC's ability to send subsequent messages to this particular channel. When the IRC server sends a client acknowledgment of having successfully joined the above channel, mIRC applies the corrective procedure detailed earlier to correct the invalid UTF-8 present in the channel name. mIRC proceeds to create a channel window named accordingly and all messages typed within are thus sent to the wrong channel. This can be further worked around by using a script or otherwise to send each channel message through /raw -n. Often a user will try to send non-encoded text to the server using /raw -n without realizing that they are trying to send code units with greater values than 255. These code units will still be encoded via UTF-8 since they cannot directly be resolved to equivalent byte sequences without a secondary encoding scheme. If the user intends to send this code page specific text then they must use $utfdecode() to relate individual characters within the line of text to their corresponding forms in the target code page: Quote: //raw -n PRIVMSG igor : $+ $utfdecode($utfencode(ПФЧЭЯбЖ), 204) [img]http://imgur.com/KmIql.png[/img] In this hypothetical scenario, user igor is the receiver whose client is not able to understand UTF-8 and instead uses RUSSIAN_CHARSET to realise characters sent over IRC. The sender can cater to igor's code page dependency by first referencing the earlier table of GDI charsets to find a value of C associated with RUSSIAN_CHARSET, ie. C = 204. Next, the sender uses a $utfdecode() and $utfencode() combination, the explanation for which was provided in the previous section, to convert the given Unicode code units into members of the RUSSIAN_CHARSET wherever possible. The reader is reminded of the fitting procedure that occurs when an encoded character cannot be found in the required code page. Another problem that has revealed itself several times already is observed when attempting to transfer certain scripts from older versions of mIRC to the present one. Recall that in earlier mIRC versions, Unicode was depicted internally as UTF-8. Any script that endeavoured to display Unicode characters generally did so in terms of the UTF-8 encoded bytes that compose them: Code: alias forAll return $chr(226) $+ $chr(136) $+ $chr(128) In mIRC 6.35, this custom alias can be used as $forAll to return the FOR ALL character occupying code point 8704 (U+2200). Porting this over to mIRC 7.1 obliges the user to make a slight modification: Code: alias forAll return $utfdecode($chr(226) $+ $chr(136) $+ $chr(128)) Since the combined characters form a UTF-8 sequence, $utfdecode() is able to correctly resolve the appropriate code unit. $utfdecode() must be added in this fashion to any and all scripts that previously returned UTF-8 encoded bytes and relied upon mIRC to decode these implicitly for display purposes. This implicit decoding no longer occurs; scripters must now be explicit when attempting to decode UTF-8 that exists internally as Unicode code units. Final Word This author can only, at this moment, repeat the message that has been constant throughout this article and iterated in a multitude of different ways which is: to fully experience the preferable standard of Unicode and all the performance gains that accompany it, one is compelled to upgrade to the latest version of mIRC and encourage their peers to do likewise. Admittedly and rather regrettably, the change has been abrupt and has upset many of mIRC's loyal users. Hopefully these users have found some solace in the information and explanations provided by this article. Footnotes [1] Informally, the reader may consider that one single abstract character is only ever assigned to a code point. In truth, the relationship between code points and associated characters is more complex; one code point could be assigned several abstract characters and a single abstract character could be assigned to multiple code points. This relationship carries with it a set of properties that dictate the behaviour of this combination, the exact details of which are beyond the scope of this article. The coupling of a code point with one or more abstract characters is called an encoded character. [2] mIRC uses a 16-bit encoding form known as UTF-16 to harbor Unicode internally. Each code point in the BMP is depicted as a single 16-bit code unit. For the much less common code points residing outside the BMP, two of these code units are used. An advantage of this approach is the ability to manipulate members of the frequently used BMP directly whilst conserving the memory otherwise occupied by wider encoding forms such as UTF-32 (which encodes all code points as a single 32-bit code unit). [3] The sequence U+XXXX is standard notation used to signify a Unicode code point, XXXX is the code point value in hexadecimal. [4] Technically, UTF-8 operates on a series of code units rather than whole code points (which are a mere abstraction at this point). However, it is easier to consider the mapping as a function of code point and is also valid given the 1-to-1 relationship that exists between code points, code units and valid UTF-8 sequences. [5] With the introduction of the Unicode standard, a revision of the terminology surrounding character encoding systems was conducted. The term character map was introduced to be used where the term code page would otherwise be broad and imprecise; it refers to a character encoding system that is able to map an abstract character set onto a sequence of bytes directly without having to encode data in phases. Most if not all of the byte driven character encoding systems (including multi-byte examples such as CHINESEBIG5_CHARSET) can be considered character maps. [6] Recall that one code point may conceivably represent a collection of abstract characters. If a font does not include a glyph that encapsulates this collection, a combination of glyphs will be required to fully realise the whole encoded character. [7] When discussing UTF-8 within the context of mIRC's identifiers, the reader should keep in mind the application's tendency to store information as code units rather than the requisite bytes for proper UTF-8 portrayal. Thus the UTF-8 referred to in this context is not entirely legitimate; rather, it is a pseudo UTF-8 scheme that exhibits bytes as their analogous code units, each byte mapping onto exactly one code unit. There is the implication that using this pseudo UTF-8 in any area of mIRC where code units are normally encoded to true UTF-8 byte streams (such as sending text over IRC) will result in text effectively being double encoded. While this behaviour might seem counter-intuitive at first glance, it is not unreasonable and should not be problematic for the perspicacious scripter. References and Acknowledgements The following sources have been of inestimable value during the creation of this article: http://en.wikipedia.org/wiki/Unicode http://www.fileformat.info/info/unicode/ http://www.unicode.org/versions/Unicode5.2.0/ The author would like to extend his gratitude towards Collective, Saturn and qwerty for their wisdom and tolerance of his incessant ramblings. Thanks also to Chessnut for pointing out a couple of ways in which the clarity of the article could be improved, and Wims for mentioning a typo in a snippet. This article is open to possible future amendments and readers are encouraged to contact its author with their ideas. The author can be contacted on IRC (#mSL on irc.SwiftIRC.net) or through this forum's private messaging system. Written by jaytea for the mIRC scripting community. "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Entire Thread
Subject	Posted By	Posted
Unicode Support	jaytea	07/08/10 12:44 AM
Re: Unicode Support	Anonymous	07/08/10 09:11 AM
Re: Unicode Support	drum	08/08/10 09:34 PM
Re: Unicode Support	jaytea	10/08/10 07:59 PM
Re: Unicode Support	jaytea	11/08/10 09:03 PM
Re: Unicode Support	RusselB	11/08/10 09:46 PM
Re: Unicode Support	Thrull	16/08/10 06:54 AM
Re: Unicode Support	uberRegenbogen	17/08/10 09:49 AM
Re: Unicode Support	jaytea	17/08/10 04:56 PM
Re: Unicode Support	Anonymous	19/08/10 12:07 PM
Re: Unicode Support	uberRegenbogen	23/08/10 02:50 PM
Re: Unicode Support	jaytea	23/08/10 05:58 PM
Re: Unicode Support	uberRegenbogen	24/08/10 03:15 PM
Re: Unicode Support	jaytea	29/09/10 09:06 AM

Link Copied to Clipboard