Click here to Skip to main content
16,018,916 members

The Weird and The Wonderful

   

The Weird and The Wonderful forum is a place to post Coding Horrors, Worst Practices, and the occasional flash of brilliance.

We all come across code that simply boggles the mind. Lazy kludges, embarrassing mistakes, horrid workarounds and developers just not quite getting it. And then somedays we come across - or write - the truly sublime.

Post your Best, your worst, and your most interesting. But please - no programming questions . This forum is purely for amusement and discussions on code snippets. All actual programming questions will be removed.

 
GeneralRe: What the hell gcc? Pin
honey the codewitch18-Jun-24 8:21
mvahoney the codewitch18-Jun-24 8:21 
GeneralRe: What the hell gcc? Pin
Rick York18-Jun-24 8:17
mveRick York18-Jun-24 8:17 
GeneralRe: What the hell gcc? Pin
0x01AA18-Jun-24 9:35
mve0x01AA18-Jun-24 9:35 
GeneralRe: What the hell gcc? Pin
honey the codewitch18-Jun-24 9:36
mvahoney the codewitch18-Jun-24 9:36 
GeneralRe: What the hell gcc? Pin
0x01AA18-Jun-24 9:45
mve0x01AA18-Jun-24 9:45 
GeneralRe: What the hell gcc? Pin
honey the codewitch18-Jun-24 9:46
mvahoney the codewitch18-Jun-24 9:46 
GeneralRe: What the hell gcc? Pin
0x01AA18-Jun-24 9:52
mve0x01AA18-Jun-24 9:52 
GeneralWTF-8 Pin
PIEBALDconsult17-Jun-24 18:40
mvePIEBALDconsult17-Jun-24 18:40 
Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. D'Oh! | :doh:

So now I have to write a recursive UTF-8 decoder... Sigh | :sigh: Why doesn't .net simply do that to begin with? Mad | :mad: <<== That's a rhetorical question.

It'll be breakfast at Milliways again.


Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal).

What was unexpected was that the GetString method of the encoding didn't call the custom Decoder.
I just had a look at the refercence code GetString and I see:

C#
// Returns a string containing the decoded representation of a range of
// bytes in a byte array.
//
// Internally we override this for performance
//
[Pure]
public virtual String GetString(byte[] bytes, int index, int count)
{
    return new String(GetChars(bytes, index, count));
}

Does that mean that it doesn't actually use my decoder?
Shouldn't it call GetDecoder() and use that decoder?
(I'm not experienced at reading the reference source.)
I'll get back to it on Monday.


Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:

The Unicode Standard requires decoders to
"... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

and
The standard also recommends replacing each error with the replacement character "�" (U+FFFD).


Which I choose not to do...

These recommendations are not often followed.


But it makes me think that the few U+FFFD characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought.
Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week.


Edit: 4/22 -- A rough logic diagram of my algorithm.
         --------------------------------------------------------------------
         | My custom Decoder                                                |
         |                                                                  |
bytes ---------> Is UTF-8 encoded multi-byte?---NO----------------------------- chars -->
         |   ^                              |                               |
         |   |                              |    --------------------       |
         |   |                              |    |                  |       |
         |   |                              YES----> UTF-8 decoder ----V    |
         |   |                                   |__________________|  |    |
         |   |                                                         |    |
         |   ^---------------------------------------------------------<    |
         |                                                                  |
         |__________________________________________________________________|

The thing to remember is that the UTF-8 Decoder will only ever be presented with byte sequences which are (or appear to be) valid UTF-8 encoded multi-byte characters. Anything else is passed along unchanged, this includes single-byte UTF-8 encoded characters.

I may need to implement a UTF-8 Encoder which won't double-encode UTF-8 characters.

modified 22-Jun-24 13:46pm.

GeneralRe: WTF-8 Pin
Peter_in_278017-Jun-24 19:09
professionalPeter_in_278017-Jun-24 19:09 
GeneralRe: WTF-8 Pin
honey the codewitch17-Jun-24 21:16
mvahoney the codewitch17-Jun-24 21:16 
GeneralRe: WTF-8 Pin
PIEBALDconsult18-Jun-24 3:01
mvePIEBALDconsult18-Jun-24 3:01 
GeneralRe: WTF-8 Pin
honey the codewitch18-Jun-24 4:15
mvahoney the codewitch18-Jun-24 4:15 
GeneralRe: WTF-8 Pin
PIEBALDconsult18-Jun-24 14:07
mvePIEBALDconsult18-Jun-24 14:07 
GeneralRe: WTF-8 Pin
honey the codewitch18-Jun-24 6:01
mvahoney the codewitch18-Jun-24 6:01 
GeneralRe: WTF-8 Pin
PIEBALDconsult18-Jun-24 14:10
mvePIEBALDconsult18-Jun-24 14:10 
GeneralRe: WTF-8 Pin
honey the codewitch18-Jun-24 15:01
mvahoney the codewitch18-Jun-24 15:01 
GeneralRe: WTF-8 Pin
PIEBALDconsult18-Jun-24 18:14
mvePIEBALDconsult18-Jun-24 18:14 
GeneralRe: WTF-8 Pin
honey the codewitch18-Jun-24 22:43
mvahoney the codewitch18-Jun-24 22:43 
GeneralRe: WTF-8 Pin
PIEBALDconsult19-Jun-24 3:12
mvePIEBALDconsult19-Jun-24 3:12 
General.NET Core & (auto)binding: Is it a bug? Pin
raddevus16-Jun-24 5:15
mvaraddevus16-Jun-24 5:15 
GeneralRe: .NET Core & (auto)binding: Is it a bug? Pin
0x01AA16-Jun-24 5:58
mve0x01AA16-Jun-24 5:58 
GeneralRe: .NET Core & (auto)binding: Is it a bug? Pin
raddevus16-Jun-24 6:52
mvaraddevus16-Jun-24 6:52 
GeneralRe: .NET Core & (auto)binding: Is it a bug? Pin
Richard Deeming16-Jun-24 21:52
mveRichard Deeming16-Jun-24 21:52 
GeneralRe: .NET Core & (auto)binding: Is it a bug? Pin
raddevus17-Jun-24 2:10
mvaraddevus17-Jun-24 2:10 
GeneralRe: .NET Core & (auto)binding: Is it a bug? Pin
jochance18-Jun-24 3:05
jochance18-Jun-24 3:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.