Click here to Skip to main content
15,887,485 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
I have a university assignment i need some help with. Don't give me the solution; hints or small portions of code would be appreciated.

So, my university project is all about unicode. To be exact, I have to write code that takes character input in utf-16 format, converts it to utf-8 and places it in the appropriate exit (terminal console or file.txt), whilst I also do the following:
1)Don't use arrays
2)Use putchar
3)Use getchar

Note: I am in my second year there, but it would be best if I did not use pointers and scanf.

I'd rather not post code unless necessary, in case my professor is watching the forums.
Here's my start:
Objective-C
int main(){
int char1 = 0,char2;
while (char1 != EOF) 
{
		char1 = getchar();
		char2 = getchar();
		char1 <<= 8;
		char1 += char2;

		if (char1 >= 0xD800 && char1 <= 0xDBFF) {
			
			char2 = getchar();
			int tempchar = getchar();
			char2<<= 8;
			char2 += tempchar;

			if (char2 >= 0xDC00 && char2 <= 0xDFFF)
			{
				char1 -= 0xD800;
				char2 -= 0xDC00;
				char1 <<= 10;
				char1 += char2;
				char1 += 0x010000;
				//write code that converts to utf 8

			}
 else if((char1 >= 0x0000 && char1 <= 0xD7FF )||(char1 >= 0xE000 && char1 <= 0xFFFF)){
			//write code that converts to utf 8
		}
}

Is my code up to this point correct? Is the shifting right? If not explain to me how I could make it work.
Posted
Updated 4-Dec-15 9:49am
v4
Comments
[no name] 4-Dec-15 14:56pm    
Only a note:
"that takes character input in utf-16 format,"

In case you are using .NET it is very hard to get _a_ char other than Unicode coded in UTF16.....
Kobayashi Porcelain 4-Dec-15 14:58pm    
I am talking about using a file.txt as input through unix commands. Terminal style. But ok sure :)
[no name] 4-Dec-15 14:59pm    
Ok. Thanks for this, now it makes much more sense ;)

1 solution

First, you did not show how your objects named char… are declared. You need to do all the calculations on 32-bit unsigned integer; in other cases, the size would be not enough to represent a code point beyond BMP.

I did not check up UTF16 part, but at least one part is missing: there should be two different branches: one for UTF16LE and another for UTF16BE. In each of the cases, you first check up if you are reading a surrogate pair and then calculate your internal representation of a code point out of the pair, in the form of unsigned 32-bit integer. For big endian, all representations are flipped, including the surrogate pairs themselves. Other code points should be composed out of 16-bit words; and its unsigned integer interpretation will be arithmetically equal to a code point value. Please see:
https://en.wikipedia.org/wiki/Endianness[^],
https://en.wikipedia.org/wiki/UTF-16[^].

The goal of first stage is to interpret UTF16 encoding character by character, and each character should be represented as 16-bit unsigned value which should be arithmetically equal to the code point. Here, you need to realize that Unicode code points are mathematical abstraction representing cardinal value; they are abstracted from the bitwise representation of this data, from any kind of computer representation. They are just abstract mathematical values.

Now, UTF-8 is also variable-width encoding. It uses pretty cunning algorithm with very low redundancy. It is fully described, for example, here: https://en.wikipedia.org/wiki/UTF-8[^].

Just follow the algorithm description. I don't think it's anything too complicated.

There is another optional feature of the UTF-16 or UTF-8 streams: the BOM. This is the marker which is optional. You need to decide what to do with text with absent marker. You can deny processing if the marker is not found, or you need to have another function where the expected encoding is specified. That should be your design. Please see: http://unicode.org/faq/utf_bom.html[^].

And finally, one delicate point: both encodings allow invalid code points. In your particular problem, UTF-8 is never a source, so all problems you may have are with UTF-16. If, for example, you face a second member of a surrogate pair before the first one is encountered, this is invalid data. If you have only one member of a surrogate pairs surrounding by the non-surrogate words, this is invalid data. So, you have to decide what to do with such cases; and this should be just a voluntary decision. It should be by your design.

I hope I did all you wanted: no code, but now you have all ins and outs. It it clear?

—SA
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900