mIRC Home    About    Download    Register    News    Help

Print Thread
Unicode $upper $lower $isupper $islower #267797 30/09/20 05:45 AM
Joined: Jan 2004
Posts: 1,438
maroon Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Jan 2004
Posts: 1,438
https://forums.mirc.com/ubbthreads.php/topics/225583/upper-and-lowercase
https://forums.mirc.com/ubbthreads.php/topics/214238/lower-upper

I've read these threads and I still don't understand the rules for what makes a character be considered be uppercase vs lowercase, when the character is in the higher unicode ranges. There's 5 color-coded groups in this alias where the combo of results for these 4 identifiers is either a bug, or there are additional rules which govern them.

Based on what is used for the normal 33-126 range, I'd assumed there would be a few simple rules.

* If a character could be returned by $upper, then $isupper(char) would be $true

* If a character could be returned by $lower, then $islower(char) would be $true

The above 2 means many characters like '123' would return $true for both $isupper and $islower

* If $upper(char) and $lower(char) were different from each other, then $isupper($upper(char)) must be $true and $isupper($lower(char)) must be $false. Also $islower($lower(char)) must be $true and $islower($upper(char)) must be $false.

* If $upper(char) and $lower(char) returned the same codepoint, then it shouldn't be possible for exactly 1 of $isupper(char) $islower(char) to be $true and the other be $false.

However, when looking at the whole unicode range, there are many more exceptions to the above 'rules' than there are characters who comply with them.

This alias examines all 65535 codepoints and returns a bitflag group of 1's or 0's based on whether they return $true = 1 or $false = 0 for:

* $isupper(char)
* $islower(char)
* $isupper($upper(char))
* $islower($lower(char))
* $asc($upper(char)) == $asc($lower(char))

I've color-code a portion of the unicode range to be displayed based on not complying with the above 'rules'. So, either there are some bugs in how a few unicode characters are handled, or there are additional rules that I'm not aware of, or there's a LOT of exceptions.

* pink (01011)
These are 282 codepoints where $islower(char) is $true and $isupper(char) is $false, yet they don't have $upper(char) being different from them.

* tan (10101)
These are 33 codepoints where $isupper(char) is $true and $islower(char) is $false, yet they don't have $lower(char) being different from them.

For the above 2 groups, it seems logical that a character could have only 1 of the 2 $isupper(char) or $islower(char) being $true without having $upper(char) and $lower(char) being different from each other. However, when looking at individual cases I have trouble finding a solution, and I don't know enough about the other languages to know if these results are legit.

For $chr(223), webpages say that this is lowercase, but I can't find a reference to an uppercase equivalent, so maybe it is possible for some characters to be used only in lowercase text without having an uppercase complement. Though, if $chr(223) is lowercase-only, I'm not sure what a solution would be besides the current behavior of $upper($chr(223)) displaying $chr(223) unchanged even though that means $isupper($upper($chr(223))) is $false.

For $chr(304), webpages say this is uppercase, but for linking a lowercase equivalent, they point at the 7bit $chr(105) 'i'. But whether it's a good idea to have a 7-bit codepoint be returned as the $lower of a codepoint above 128, I dunno.

* red (00000)
These are 32 codepoints where $isupper(char) and $islower(char) both report $false, yet they have a $lower(char) and $upper(char) which are different from each other.

* maroon (11110)
These are 79 codepoints where $isupper(char) and $islower(char) both report $true, yet they have a $lower(char) and $upper(char) which are different from each other.

These 2 red and maroon groups seem to be more of a problem, because when $upper(char) and $lower(char) are different from each other, then it seems logical that it shouldn't be possible for $isupper(char) or $islower(char) to either have both be $true, or have both be $false.

* black (00001)
These were 45501 codepoints too numerous to display. These all report $isupper(char) and $islower(char) as both being $false, yet they both can be displayed in the $upper() and $lower() outputs. There were an additional 15725 characters which did follow the above 'rules', where they report $true for both $isupper(char) and $islower(char) since they didn't have an uppercase or lowercase form different than themselves.

Code
alias upperlower_test {
  var %i 1 | if (!$hget(test)) hmake -s test 1 | hdel -sw test z????? | hdel -sw test uplow.*
  while (%i isnum 1-65535) {
    var %char $chr(%i) , %up $upper(%char) , %lo $lower(%char)
    var %asc.char $asc(%char), %asc.up $asc(%up) , %asc.lo $asc(%lo)
    if ($isupper(%up) == $false) hinc -m test uplow.$isupper.says.output.of.$upper.is.$false
    if ($islower(%lo) == $false) hinc -m test uplow.$islower.says.output.of.$lower.is.$false
    if ($isupper(%char))                            var %b1 1 | else var %b1 0
    if ($islower(%char))                            var %b2 1 | else var %b2 0
    if ($isupper($upper(%char)))                    var %b3 1 | else var %b3 0
    if ($islower($lower(%char)))                    var %b4 1 | else var %b4 0
    if ($asc($upper(%char)) == $asc($lower(%char))) var %b5 1 | else var %b5 0
    var %a z $+ $+(%b1,%b2,%b3,%b4,%b5) | hinc test %a
    if (%a !isin z00001 z10110 z01110 z11111 z01011 z10101 z11110 z00000) {
      echo 12 -a debug: %a %i %char : isupper $isupper(%char) islower $islower(%char) * upperchar $isupper(%up) %asc.up * lowerchar $islower(%lo) %asc.lo
      ;z00001 = not upper not lower upper(char) same as lower(char)
      ;z11111 = normal non-alpha
      ;z11110 = is both upper and lower yet upper(char) != lower(char)
      ;z10110 = normal uppercase
      ;z01110 = normal lowercase
      ;z00000 = upper(char) != lower(char) yet isupper(char) islower(char) isupper(upper(char)) islower(lower(char) all false
      ;z00001 = isupper(char) islower(char) isupper(upper(char)) islower(lower(char) all false * upper(char)==lower(char)
      ;z01011 = upper(char)=false lower(char)=true yet upper(char) and lower(char) both the SAME
    }
    if (%a == z01011) echo 13 -ag %i %char $+(U+,$base(%i,10,16,4)) how can isupper(char) be $isupper(%char) and islower(char) be $islower(%char) though asc(upper(char) $asc($upper(%char)) === asc(lower(char)) $asc($lower(%char))
    if (%a == z10101) echo  7 -ag %i %char $+(U+,$base(%i,10,16,4)) how can isupper(char) be $isupper(%char) and islower(char) be $islower(%char) though asc(upper(char) $asc($upper(%char)) === asc(lower(char)) $asc($lower(%char))
    if (%a == z11110) echo  5 -ag %i %char $+(U+,$base(%i,10,16,4)) how can isupper(char) be $isupper(%char) and islower(char) both be $islower(%char) though upper(char) $upper(%char) $asc($upper(%char)) !== lower(char) $lower(%char) $asc($lower(%char))
    if (%a == z00000) echo  4 -ag %i %char $+(U+,$base(%i,10,16,4)) how can isupper(char) be $isupper(%char) and islower(char) both be $islower(%char) though upper(char) $upper(%char) $asc($upper(%char)) !== lower(char) $lower(%char) $asc($lower(%char))
    inc %i | if (%i = 55296) var %i 57344
  }
  var %a | noop $hfind(test,z11*,0,w,inc %a $hget(test,$1)) | echo -a upper(char) .true and lower(char) .true: %a
  var %a | noop $hfind(test,z00*,0,w,inc %a $hget(test,$1)) | echo -a upper(char) false and lower(char) false:  %a
  var %a | noop $hfind(test,z10*,0,w,inc %a $hget(test,$1)) | echo -a upper .true and lower false: %a
  var %a | noop $hfind(test,z01*,0,w,inc %a $hget(test,$1)) | echo -a upper false and lower .true: %a
  echo -ag *isupper() says false to char output by *$upper(): $hget(test,uplow.$isupper.says.output.of.$upper.is.$false)
  echo -ag *islower() says false to char output by *$lower(): $hget(test,uplow.$islower.says.output.of.$lower.is.$false)
  echo -a ====
  echo    -ag isupper|islower|isupper(upper(char))|islower(lower(char))|upper(char)==lower(char)
  echo    -ag 10110 normal uppercase: $hget(test,z10110)
  echo    -ag 01110 normal lowercase: $hget(test,z01110)
  echo    -ag 11111 normal non-alpha, isuppper() and islower() both $true and upper(char)===lower(char): $hget(test,z11111)
  echo    -ag 00001 isupper() & islower() both say $false when upper(char) === lower(char): $hget(test,z00001)
  echo  7 -ag 10101 how can isupper(char) be true and islower(char) be false yet upper(char) is same as lower(char): $hget(test,z10101)
  echo 13 -ag 01011 how can islower(char) be true and isupper(char) be false yet upper(char) is same as lower(char): $hget(test,z01011)
  echo  4 -ag 00000 how can isupper(char) islower(char) isupper($upper(char)) islower($lower(char)) all be false when upper(char) !== lower(char): $hget(test,z00000)
  echo  5 -ag 11110 how can isupper(char) islower(char) isupper($upper(char)) islower($lower(char)) all be $true when upper(char) !== lower(char): $hget(test,z11110)
}

Re: Unicode $upper $lower $isupper $islower [Re: maroon] #267798 30/09/20 06:45 AM
Joined: Dec 2002
Posts: 4,841
Khaled Offline
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 4,841
Thanks for your bug report. This has been discussed before. Everything related to how strings are processed, displayed, etc. will need to be changed to a completely different set of APIs. It is a huge job that will require a lot of code rewriting, testing, and so on. This is on my to-do list.

That said, as you are looking into this, it would be helpful if you could research Windows string handling/classifying APIs and how they behave, eg. relating to upper/lower case, when it comes to characters in different Unicode ranges. I haven't looked into it recently but know from past research that these APIs often behave in ways that are not expected.

Re: Unicode $upper $lower $isupper $islower [Re: Khaled] #267799 30/09/20 09:04 AM
Joined: Aug 2003
Posts: 284
P
Protopia Offline
Fjord artisan
Offline
Fjord artisan
P
Joined: Aug 2003
Posts: 284
A couple of years ago I did look into mIRC support for Unicode, and the differences between mIRC's use of UCS-2 (which is a subset of full unicode UTF-16) and full Unicode, and I started to write a utility script which imported a unicode definition file and used that to provide a full range of Unicode identifiers. But I had to stop work on it due to other real-life priorities, and haven't had a chance to go back to it - but it does give me an insight into the difficulties in this area.

To fix this properly would require mIRC to switch its strings from UCS-2 to UTF-16 - and this might have some general backward compatibility issues.

So I imagine that this is a really non-trivial change.

That said, I would be happy to share either my design thoughts for the identifiers I was planning to provide or even the embryonic code I created if Khaled or anyone else is interested.

Re: Unicode $upper $lower $isupper $islower [Re: maroon] #267822 06/10/20 08:02 PM
Joined: Dec 2002
Posts: 4,841
Khaled Offline
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 4,841
I have made a few changes in the next beta that co-ordinate CRT vs API calls relating to the lower/upper case identifiers you used in your examples.

These resolve some of the differences you point out, however the Windows APIs are still classifying many characters in the way you describe above. You will need to look into this further to determine why this is the case. The best I can do is to use the APIs provided.

Regarding characters like the German Eszett, note that Unicode can be asymmetric. There is no guarantee that converting a letter from lower to upper to lower case will result in the same letter.

To make matters more complicated, correct mapping of some unicode characters/ranges depends on locale as well as transformation options, eg. see LCMapStringEx(), so there is a lot more to it than just lower/upper case.

In addition, although mIRC uses UTF-16, and Windows itself uses UTF-16, which means API calls generally handle surrogate pairs/planes, that does not mean these are handled in all contexts. While mIRC was changed to use UTF-16, there are many places where surrogate pairs can be split while parsing text, which is where work still needs to be done.

Re: Unicode $upper $lower $isupper $islower [Re: Khaled] #267823 06/10/20 08:19 PM
Joined: Aug 2003
Posts: 284
P
Protopia Offline
Fjord artisan
Offline
Fjord artisan
P
Joined: Aug 2003
Posts: 284
I would say that you have two choices:

1. Use the Windows APIs with all the idiosyncrasies that they have as described above; or
2. Use someone else's Unicode libraries - perhaps those from the Unicode organisation itself (see http://site.icu-project.org ) which can be considered definitive.

Personally I would tend to go for option 2. because:

a. Unicode is not static and these libraries will be maintained in line with Unicode development;
b. These libraries are O/S independent - I know that mIRC is Windows only, but since it talks with other IRC clients, using a platform independent library, particularly the definitive platform independent library, is likely to be more compatible in the end.
c. These libraries are open source and less likely to have idiosyncrasies than Windows.

I am not sure what libraries Python uses to deliver its Unicode support, but I don't hear a lot of stories about Python having idiosyncrasies - indeed, Python's implementation is perhaps a model of how to do it right.