mIRC Home    About    Download    Register    News    Help

Print Thread
Page 2 of 2 1 2
Khaled #230978 26/03/11 11:59 PM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Originally Posted By: Khaled
It has been a while since I researched this but if I recall correctly, the only way to determine the encoding format of a file without a BOM is to scan the contents of the file and to analyze it.


Well, there is no reliable way to detect the proper encoding of a UTF-16 encoded file without a BOM-- and by extension, there is no way to detect that a file is UTF-16 without a BOM (similarly, even the BOM itself isn't always a valid way to detect an encoding format).

The auto-detection method you propose would be extremely slow since it would potentially scan the entire file, and would need to do this for every file-- this would mean that for most files (non-utf16 ones) you would be always scanning ~1000 chars prior to reading-- every time! slow!

That's why I proposed an extra switch in all $read/$fread commands to force a specific encoding. It's basically impossible (impractical) to auto-detect, so the scripter should have to tell mIRC in these cases. Telling the runtime what encoding you want to read a file as is fairly common in every language with robust encoding support. It's fine to assume UTF-8 as default, and fair to allow basic auto-detection, but for certain encodings, we need a way to enforce this manually.


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #230979 27/03/11 12:05 AM
Joined: Oct 2003
Posts: 3,918
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,918
Note that having an encoding switch leaves room for adding back other encoding support, at least for reading such files from the FS, in the future. Not saying we should open up a debate about supporting codepages again, but it allows for this possibility if, in the future, it becomes easier support other encodings.

Basically, the encoding parameter could be undefined for any value other than utf8, ascii, utf16(bom?), utf16le and utf16be for now. In the future, you could add things like sjis, etc.

One way to add this would be to attach the parameter to the end of $read (and $fread too):

//echo -a $read(file.txt, wn, *foo*, 1, utf16le)

Since the current last param is [N] (numeric), you can autodetect when it is present, so we could also have:

//echo -a $read(file.txt, wn, *foo*, utf16le)

And of course everything should get transcoded to UTF-8 internally (or whatever mIRC's internal rep is).


- argv[0] on EFnet #mIRC
- "Life is a pointer to an integer without a cast"
argv0 #231029 29/03/11 02:26 PM
Joined: Apr 2003
Posts: 342
M
Fjord artisan
Offline
Fjord artisan
M
Joined: Apr 2003
Posts: 342
/fopen should contain the encoding switch. $read can detect this automatically or at least make a good guess. $read always reads the entire file up to a specific line. $read will read the entire file if no line is specified.

There is no way to select the OUTPUT encoding with /write or /fwrite. Though with /fopen /fwrite should use the specified encoding set with /fopen.

/fopen -e <encoding> <handle> <file path>


Beware of MeStinkBAD! He knows more than he actually does!
Joined: Dec 2002
Posts: 2,962
S
Hoopy frood
Offline
Hoopy frood
S
Joined: Dec 2002
Posts: 2,962
This really needs to be posted as a feature suggestion since this is nothing at all to do with the original bug report.


Spelling mistakes, grammatical errors, and stupid comments are intentional.
Page 2 of 2 1 2

Link Copied to Clipboard