|
My remark was made mostly tongue-in-cheek (hence the smiley after it). Of course the length of the rendered text is of interest in many/most applications. It's just that, luckily, I don't have to worry about it because people who write the nitty-gritty of UI have taken care of it. For instance, in Windows, I can just call GetTextExtentPoint32[^] function to have the text measured.
However this has noting to do with UTF-8 vs UTF-16. I remain of the opinion that UTF-16 has no particular advantage when compared with UTF-8. (If there are other readers of this conversation, please don't start a flame war now - this is just a personal opinion). I see UTF-16 as a stepping stone when computing world needed to move away from ASCII, but in this day and age, it has served its purpose and we can move away to something better.
Mircea
|
|
|
|
|
Mircea Neacsu wrote: However this has noting to do with UTF-8 vs UTF-16 That is certainly true.
Mircea Neacsu wrote: I remain of the opinion that UTF-16 has no particular advantage when compared with UTF-8. I am leaning towards agreeing with you.
Mostly, I am observing - and has been observing for 40+ years - that people strive for non-Einstein solutions, "Make it as simple as possible, but no simpler". People want to do it simpler! For years, I heard lots of people say that 32 bits is overkill, Unicode will never grow beyond the first plane, BMP - there isn't anywhere close to 65,536 different characters! And for a number of years, they were right: Unicode did manage with the basic plane only.
That is when people started using 16 bit characters, although I am not sure that the name UTF-16 was know that early. With BMP only, most simple(r than possible) developers thought it quite simple; a string of 16-bit characters was just like a string of 8-bit characters, only with more characters. (Look at the History section of Wikipedia: Unicode[^] - even the initial developers of Unicode argued the same!)
If it had ended up that way, it would have been significantly simpler: You can count the number of characters as easily in 16 bit as in 8 bit character code. You can index character 23 by string8[23] or string16[23]. In other words: I can fully understand why Windows NT (1993) and Java (1995) went for 16 bit characters. (At the time of Windows NT release, UTF-8 had been proposed, but was not yet accepted as a standard - anyway, you don't change the system character encoding from 16 bits fixed to n*8 bits a few weeks before the release of a new OS!)
As we all know now, the solution was simpler than possible. Several of my coworkers were highly surprised when BMP overflowed, but didn't worry: We are never going to encounter those characters in the entire lifetime of our software! I think that they for at least ten more years continued to access character 23 by string16[23]. I can understand them. Until we got emojis in other planes, they were essentially right.
But it was a too simple solution. When you were forced to handle multiple planes, and maybe you at the same time discovered combining and non-spacing codes, then the simplicity disappeared. You have all the same issues with UTF-8; it is not any worse with UTF-16, and in Western text, the special cases occur rarely. Most of the time, UTF-16 is more straightforward, but you have to be prepared for the exceptions. With UTF-8, you can never relax; you handle variable length characters all the time! (At least if you regularly write non-English text, which is the common case in most European countries.)
If UTF-8 didn't exist, I would be happy with sending UTF-16 memory strings straight to file. Having UTF-8 as an alternative in-memory format creates trouble; I want one single unambiguous string format. Now that Windows, Java and C# both use UTF-16, I am not going to start using UTF-8 in-memory.
But I also want to have one singe unambiguous file format. UTF-8 is established, UTF-16 is not. So UTF-8 wins. I am stressing: Don't waste your time trying to process UTF-xxx yourself; use library functions. So when I read text from or write text to file, I let library functions process the strings. Each format has its use.
After all, I guess I really disagree with you: If we start with a tabula rasa, but we are to select The One And Only Encoding, UTF-16 and UTF-8 are equally good. But that isn't the situation in memory: Windows and numerous other essential tools/subsystems have based themselves on UTF-16. Given that, using UTF-8 in my application strings would introduce a lot of complexities. So accepting the realities of life, my programs will continue to use UTF-16 strings.
Until, of course, I start working with an OS having UTF-8 as its system string encoding and languages/tools that use UTF-8 as their in-memory string encoding.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
trønderen wrote: After all, I guess I really disagree with you World would be too boring if we wouldn't have different opinions
trønderen wrote: I want one single unambiguous string format. You aren't going to get it, or at least not in this lifetime . If you go to Linux or Mac worlds, everything is UTF-8. In Windows world it's UTF-16 with a sprinkle of UTF-8.
trønderen wrote: But I also want to have one singe unambiguous file format. UTF-8 is established, UTF-16 is not. So UTF-8 wins. If I understand you correctly, you suggest having UTF-8 files converted to UTF-16 on entry, processed as UTF-16 inside the application and converted back to UTF-8 on output. That would complicate things very much if you target different OS-es. It would also be inefficient if your app doesn't require the UTF-16 parts of the OS (ReadFile and WriteFile functions in Windows work with any encoding).
My strategy is almost a mirror image of that: Everything is UTF-8 until it needs to call certain OS functions when a thin wrapper converts all inputs to UTF-16 and all results back to UTF-8.
Mircea
|
|
|
|
|
Mircea Neacsu wrote: If I understand you correctly, you suggest having UTF-8 files converted to UTF-16 on entry, processed as UTF-16 inside the application and converted back to UTF-8 on output. A simple UTF-8/16 conversion filter (included by default) in the StreamReader/Writer, or whatever your IO classes are named certainly does not complicate things for the developer.
For interaction with other systems, whether they use UTF-8, UTF-32 or UTF-16 as a working format, UTF-8 is the lingua franca, the Esperanto of textual information. It is The File Character Encoding. No application needs to be concerned about it.
That would complicate things very much if you target different OS-es. If your alternative is to reject any OS that does not use UTF-8 as its system character encoding, and all programming languages, libraries and development environments that does not use UTF-8, you are most certainly right. In that situation, I would certainly go for UTF-8 in-memory as well.
For a great number of developers, in-memory UTF-8 isn't a viable option. UTF-16-oriented OSes, languages and tools are a fact of life. I say; When in Rome, roam with the Romans (or however they saying goes). Even though a lot of developers of drivers and interrupt handlers and low-level network protocols with only rudimentary textual output work in *nix-like environments, the great majority of those communicating textually, with users and others, roam Rome in UTF-16 environments.
If you are talking about making applications that can be ported between UTF-16 and UTF-8 oriented OSes without a single source code change in the string handling, and your code assuming UTF-8 strings even under an APIs expecting UTF-16, then you are overly optimistic. You will have to do a lot of adaptations to UTF-16-oriented library and system functions. Or wrap every single one of them in a two-way-conversion wrapper.
The best way to handle it is to leave all string handling to library functions, you application knowing nothing about the encoding under the hood. (Your floating point application would probably run fine on a machine with an FP format different from IEEE-754!). Treat a string as a string, regardless of encoding. Make sure that when you handle characters individually, you use a char32 to hold them.
If you go to Linux or Mac worlds, everything is UTF-8. Most certainly not. Well, I never worked with Mac, but in Linux there are loads of software that can't handle anything but 8 bit character sets. You can even come across those that cannot handle 8 bit, but only 7 bit characters. A few years ago, I was editing a configuration file on a Unix system; that network module crashed immediately because I had added a comment which contained a non-ASCII 8859 character (in the name of one maintainer).
A number of RFC-822 based internet protocols still cannot handle ISO 8859 (it would be against RFC-822). There still is a real need for QP encoding, backslash or ampersand encoding, etc.
You may rightfully say that *nix/Mac apps written to handle UTF-8 does handle UTF-8. Big surprise. You may claim that languages/tools specifying UTF-16 string representation doesn't use that when running under *nix/Mac - all strings are converted to UTF-8 when ported to these OSes, both in source code and library APIs. I doubt very much that that is the case.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
Mircea Neacsu wrote: My strategy is almost a mirror image of that: Everything is UTF-8 until it needs to call certain OS functions when a thin wrapper converts all inputs to UTF-16 and all results back to UTF-8. Mine is exactly the same; which I thought I had explained earlier. I guess not clearly enough.
|
|
|
|
|
Sorry Richard, indeed I didn't notice that. It's been a rather busy weekend
Mircea
|
|
|
|
|
trønderen wrote: If you just send it the entire string, leaving to the display unit to discard what won't fit, for one: You may upset the display device.
To be fair however your example is suggesting that developer has no idea what the business space even is.
So for example if I expect the uses to use a PC with a larger display, and they use a phone, then it is not going to work very well. But if I expect them to use a phone also then I should account for that and test/develop for that also.
trønderen wrote: This software may put restrictions on the lengths of both prompt strings and data values.
That isn't true. Such data is always limited. Nothing in computing is unlimited. Doesn't matter how it is used. If a developer is not considering that from initial creation then that will cause problems. Display problems is just one case. It is similar to a very naive developer that designs a database and makes every text field into a blob just on the chance that the extra space might be needed.
|
|
|
|
|
I read your article on the use of UTF-8, but I have a point of (very slight) disagreement. The majority of Windows API calls have both UNICODE and ASCII versions, and the C and C++ runtimes are very much UTF-8 biased. So there is no need to make your projects UNICODE based, as there are only a few instances when you are forced to use Unicode strings. I used to write my applications so that I could easily build them either way, but gave that up and now manage the conversions for the odd functions that force me to use Unicode text.
|
|
|
|
|
Richard MacCutchan wrote: The majority of Windows API calls have both UNICODE and ASCII versions, Indeed, but the question is: what happens when you call an ANSI function, let's say MessageBoxA with a text that contains characters outside the 0-128 range? The result depends on the active code page setting. Until recently (2019) the programmer had no control on what ACP user has selected. He could only query the this setting using the GetACP function.
In newer versions of Windows, you can declare the ACP page you want to use using the app manifest. If you declare UTF-8 code page, you can use UTF-8 with the ANSI functions.
For more details see Use UTF-8 code pages in Windows apps - Windows apps | Microsoft Learn[^]
Mircea
|
|
|
|
|
Mircea Neacsu wrote: what happens when ... Assuming the developers understand the customers' requirements* this will have been catered for.
*which is probably on the toss of a coin
|
|
|
|
|
Mircea Neacsu wrote: Indeed, but the question is: what happens when you call an ANSI function, let's say MessageBoxA with a text that contains characters outside the 0-128 range? ANSI always was 8-bit, 0-255, ISO 8859-x, 'x' varying with the region. Almost all the Western world had their needs covered by 8859-1. (Lapps were one exception; they used 8859-8, if my memory is correct.)
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
trønderen wrote: Almost all the Western world had their needs covered by 8859-1 Funny you say that: I cannot even properly write my last name in 8859-1. It is written "Neacşu". I have to go to Windows-1252 for that. The Wikipedia page for ISO/IEC 8859-1[^] lists also other languages that are not fully covered.
Anyway, at this point, I think we should agree to disagree as I'll continue to keep everything in UTF-8 inside my programs.
Mircea
|
|
|
|
|
trønderen wrote: But a new situation has arisen: Emojis have procreated to a number far exceeding the number of WinDings. They do not all fit in BMP
I did not even consider that as a possibility.
So make sure I add something that says "no emojis!".
|
|
|
|
|
|
I understand. Thank you guys
|
|
|
|
|
You are welcome.
"In testa che avete, Signor di Ceprano?"
-- Rigoletto
|
|
|
|
|
Did *nix, or C under any other OS, ever run on a machine with a non-binary word size, like 12, 18 or 36 bits? Such as DEC-10 / DEC-20 or Univac 1100 series. I never heard of any, but I'd expect that it was done. Univac could either operate with 6 bit FIELDATA bytes or 9 bit bytes. DEC put five 7 bit bytes into each word, with one bit to spare.
If C was run on such machines, how was char handled? Enforcing 8 bits, making the strings incompatible with other programming languages? Or did C surrender to 6 or 7 bit bytes? If they stuck to 8 bits, did each word on a 36 bit machine have 4 bits to spare? (2 bits on an 18 bit machine) Or did they fit 4.5 char a word - 9 char per two words? (On Univac, you could address 6 or 9 bit bytes semi-directly, using string instructions, without the need for shifting and masking.)
By old definition, both architectures had a char size of one byte. The old understanding of 'byte' was the space required to store a single character; it could vary from 5 bits (Baudot code) to at least 9 (Univac and others) bits. Lots of international standards (developed outside of internet environments!) use the term 'octet' for 8 bits to avoid confusion with non-binary word/byte sizes.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
|
As best as I can remember, C only requires that type char be large enough to hold any character in the current (system) encoding. That is, if the CPU has 7, 8, or 9 bit "bytes", then a C char would also be 7, 8 or 9 bits.
"A little song, a little dance, a little seltzer down your pants"
Chuckles the clown
|
|
|
|
|
|
Voluntarily removed
modified 29-Sep-24 23:07pm.
|
|
|
|
|
|
As I'm sure we all know, there are basically three ways to handle currency values in code.
1) Can store the value as cents in an integer. So, $100.00 would be 10,000. Pro: You don't have to worry about floating point precision issues. Con: Harder to mentally glance at without converting it in your head. Con: Depending on size of integer may significantly reduce available numbers to be stored.
2) Can use a float with a large enough precision. Pro: Easy to read. Con: Rounding issues, precision issues, etc.
3) Can use a struct with a dollars and cents. Pro: Same benefit as integer with no number loss. Pro: Easy to read mentally. Con: Have to convert back and forth or go out of the way for calculations.
Historically, I've always gone with either 1 or 2 with just enough precision to get by. However, I'm working on a financial app and figured... why not live a little.
So, I'm thinking about using 128-bit ints and shift its "offset" by 6, so I can store up to 6 decimal places in the int. For a signed value, this would effectively max me out at ~170 nonillion (170,141,183,460,469,231,731,687,303,715,884.105727). Now, last I checked there's not that much money in the world. But, this will be the only app running on a dedicated machine and using 1GB of RAM is completely ok. So, it's got me thinking...
Here's the question... any of y'all ever use 128-bit ints and did you find them to be incredibly slow compared to 64-bit or is the speed acceptable?
Jeremy Falcon
modified 2-Sep-24 14:37pm.
|
|
|
|
|
If you're not concerned about speed, then the decimal::decimal[32/64/128] numerical types might be of interest. You'd need to do some research on them, though. It's not clear how you go about printing them, for example. Latest Fedora rawhide still chokes on
#include <iostream>
#include <decimal/decimal>
int main()
{
std::decimal::decimal32 x = 1;
std::cout << x << '\n';
} where the compiler produces a shed load of errors at std::cout << x , so the usefulness is doubtful. An alternative might be an arbitrary precision library like gmp
A quick test of a loop adding 1,000,000 random numbers showed very little difference between unsigned long and __uint128_t For unsigned long the loop took 0.0022 seconds, and for __int128_t it took 0.0026 seconds. Slower, but not enough to not consider them as a viable data type. But as with the decimal::decimal types, you would probably have to convert to long long for anything other than basic math.
"A little song, a little dance, a little seltzer down your pants"
Chuckles the clown
|
|
|
|
|
Well, just so you know I'm not using C++ for this. But the ideas are transferable, for instance a decimal type is just a fixed-point number. Which, in theory sounds great, but as you mentioned it's slow given the fact there's no FPU-type hardware support for them.
I was more interested in peeps using 128-bit integers in practice rather than simply looping. I mean ya know, I can write a loop.
While I realize 128-bit ints still have to be broken apart for just about every CPU to work with, I was curious to know if peeps have noticed any huge performance bottlenecks with doing heavy maths with them in a real project.
Not against learning a lib like GMP if I have to, but I think for my purposes I'll stick with ints, in a base 10 fake fixed-point fashion, as they are fast enough. It's only during conversions in and out of my fake fixed-point I'll need to worry about the hit if so.
So the question was just how much slower is 128-bit compared to 64-bit... preferably in practice.
Jeremy Falcon
|
|
|
|
|