|
Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8.
So now I have to write a recursive UTF-8 decoder... Why doesn't .net simply do that to begin with? <<== That's a rhetorical question.
It'll be breakfast at Milliways again.
Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal).
What was unexpected was that the GetString method of the encoding didn't call the custom Decoder.
I just had a look at the refercence code GetString and I see:
[Pure]
public virtual String GetString(byte[] bytes, int index, int count)
{
return new String(GetChars(bytes, index, count));
}
Does that mean that it doesn't actually use my decoder?
Shouldn't it call GetDecoder() and use that decoder?
(I'm not experienced at reading the reference source.)
I'll get back to it on Monday.
Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:
The Unicode Standard requires decoders to
"... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
and
The standard also recommends replacing each error with the replacement character "�" (U+FFFD).
Which I choose not to do...
These recommendations are not often followed.
But it makes me think that the few U+FFFD characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought.
Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week.
Edit: 4/22 -- A rough logic diagram of my algorithm.
--------------------------------------------------------------------
| My custom Decoder |
| |
bytes ---------> Is UTF-8 encoded multi-byte?---NO----------------------------- chars -->
| ^ | |
| | | -------------------- |
| | | | | |
| | YES----> UTF-8 decoder ----V |
| | |__________________| | |
| | | |
| ^---------------------------------------------------------< |
| |
|__________________________________________________________________|
The thing to remember is that the UTF-8 Decoder will only ever be presented with byte sequences which are (or appear to be) valid UTF-8 encoded multi-byte characters. Anything else is passed along unchanged, this includes single-byte UTF-8 encoded characters.
I may need to implement a UTF-8 Encoder which won't double-encode UTF-8 characters.
modified 22-Jun-24 13:46pm.
|
|
|
|
|
I got a phishing email the other day.
Embedded HTML to imitate a Micro$oft login form (posting credentials to evil.org of course).
Inline base-64 encoded, easy peasy.
%-encoded inside that. One and a half times....
The outer decode works, but still got %'s in there.
Decode again and *barf*, it's broken (but only in some places).
Wasn't game to see what a browser would make of it.
Given browsers' general tolerance of coding errors, I suspect it might just have worked.
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
|
|
|
|
|
.NET doesn't do that to begin with because it wouldn't make any sense.
The problem is your CSV has bad encoding.
What you wrote is a workaround for a poorly encoded file.
That's not .NET's business, and frankly, if it did that, it would be a Bad Thing(TM)
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago.
What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure.
Anyway, it's a good exercise.
|
|
|
|
|
Because in the decades that UTF-8 is available you're the second person to need the feature.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
Probably the first. The other guy on the team hadn't noticed the issue.
|
|
|
|
|
Adding, there's another issue.
What if your intent was to embed control characters into UTF-8?
.NET cannot do this for you without breaking the UTF-8 spec.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
For instance?
honey the codewitch wrote: .NET cannot do this for you
Bet it can.
|
|
|
|
|
Yeah Microsoft could break UTF8 to make you happy and make everyone else mad.
And make .NET broken.
I'll get back to you when someone besides you thinks this is a good idea.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
But seriously, what are you saying it can't do?
|
|
|
|
|
I'm saying they can't recursively decode UTF-8 without breaking the spec.
Edit: I feel like I'm peeing in your Wheaties, but that's not my intent. I'm just saying it's not .NET's place to satisfy your requirement. You could write a Nuget package for it, but it's completely non-standard behavior and would break the spec + potentially break other code.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
modified 19-Jun-24 5:38am.
|
|
|
|
|
honey the codewitch wrote: they can't recursively decode UTF-8 without breaking the spec.
I don't see how you arrive at that conclusion.
honey the codewitch wrote: not .NET's place to satisfy your requirement
I agree.
honey the codewitch wrote: would break the spec
In what way exactly? Particularly if the caller has control over whether or not it does.
But you mentioned something about writing control characters in UTF-8 -- which include carriage-return, line-feed, form-feed, etc. -- so I don't understand what you meant that it would break UTF-8.
Whatever situation you are trying to communicate, I am sure .net can do it already, and it doesn't "break UTF-8".
|
|
|
|
|
My career seems to be entirely described by the following:
"Use a simple technology to do a thing that 80% of Devs take for granted and who also tell you it works, but discovering that the thing doesn't actually work the way they say it works."
Here's the latest. I will try to keep it short (because I'm going to write an article on it, but I can't wait to share this one because it is so weird (from multiple angles)).
1) JavaScript FormData Posted To Backend Is All Strings
Mabye a lot of you know this, but all of the values in the following (real) example end up being strings when they get to the backend.
var formData = new FormData();
formData.append("uuid",currentUuid);
var entryId = 9;
var jentry = {
Id:entryId,
Title: titleText,
Note: noteText,
Created: createdDate,
Updated: null
};
for (var key in jentry) {
formData.append(key, jentry[key]);
} Odd, But Acceptable
Ok, so the point of that is that all of the values in the jentry object are converted to string values. That means Updated = "null" (string) and the Id = "9".
This is odd, to me, but acceptable.
If you refute this, please provide citations. I've searched far & wide and asked Copilot -- all says they are strings when they hit the backend.
Now It Gets Real Weird
My WebAPI method which gets called by the JS Fetch with the previously shown FormData looks like:
public ActionResult Save([FromForm] String uuid,[FromForm] JournalEntry jentry)
AutoBinding In .NET Core Works With JS Fetch
1. The JS Fetch works.
2. The C# JournalEntry is created from the data posted via the FormData variables.
3. I can examine the data in the C# JournalEntry object and (almost) everything looks fine.
Autobinding Ignores The Id Value!
However, at some point I needed the Id value which was being passed in.
But I noticed that the Id value in the C# object was always ZERO Id = 0.
FormData Value Is Non-Zero, But Autobound Object Is 0
But, keep in mind that the FormData object shown above is passing in a non-zero value for Id (9 in the example.
What!?!
Why Is This Happening?
The only thing I could guess is that since the Id value (from the FormData) was being passed as a String that the autobinding just ignores the value and sets it to 0.
Here's How My C# Object (Used for Autobinding) Is Defined
public class JournalEntry{
public Int64 Id{get;set;}
public String? Title{get;set;}
public String Note{get;set;}
public String Created{get;set;}
public String? Updated{get;set;}
}
Yes, the point is that the Id is an Int64 (not a string).
This Part Should Alarm You!!! This Should Be The Bug
So, somehow the autobinder is able to bind the JournalEntry to the FormData object but it just ignores the Id value and sets it to 0!!!!
Why?? !! Please someone explain it to me.
A "Workaround" Test
I altered my JournalEntry class so the Id is a String
public class JournalEntry{
public String Id{get;set;}
and posted the data again.
Autobinder Works! What?!
After that change, the autobinder sets the Id value properly (non-zero value that is read from FormData).
Not What I Expected
This is not what I expected. JSON data is _supposed_ to get autobound to the C# object, right?
Just Binds the Parts It Wants To Bind?
It's weird because it doesn't fail to bind it just skips parts of the object that it wants to skip.
I could accept it if it failed to bind because it was the wrong type.
Article Is Coming Later Today
This was long but I will be writing it up in an article with associated examples and an example project you'll be able to download and try for yourself.
modified 17-Jun-24 8:08am.
|
|
|
|
|
Automatism is often also random.
Either you declare what absolutely has to be bound or you have to live with chance.
|
|
|
|
|
0x01AA wrote: Automatism is often also random.
That's quite true.
We often interact with behavior that is basically _undefined_.
Undefined behavior could result in anything.
So, I guess software development is just a form of "trial and error". Just see what you get.
This is why it is illegal in many areas to call it software engineering.
|
|
|
|
|
Quote:
formData.append(uuid,{"uuid":currentUuid}); I'm assuming that's a typo in your question. If not, you're sending a value with the name set to whatever's in your uuid variable, and the value set to the literal string "[object Object]" .
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Thanks for pointing that out.
That was a typo / bad test from one of the numerous examples I was trying to get it to work.
I updated my original post to fix it to the correct value for uuid I was sending through:
formData.append("uuid",currentUuid);
|
|
|
|
|
Gonna guess this is the "(almost)". Heh.
|
|
|
|
|
Over a decade ago, I wrote a SQL Server backup/rotation manager. Backups are zipped and password protected using IonicZip, then copied to a local repository on another disk, and optionally pushed out to an FTP resource. This is handy for when I find myself working on the laptop away from the home/office. This system has been used on multiple servers without issues.
A couple of weeks ago, I started using it on a newish Azure VM for one of our latest projects to manage 2 customer databases. It appeared to be working fine...backups/zips/copies all getting created with no errors...or so I thought. Yesterday, I was away from the office and decided to grab the previous day's backups and restore them on my laptop. Using 7-zip, the zips extracted, but with a CRC error detected. Of course, the backups were useless. Native windows zip refused to extract anything, failing with a generic error message. 'unrecognized error'. Every backup from that system was corrupted!
Backups from the other 2 systems are/were fine...one of the other systems is also an Azure VM with practically identical setups and has been working fine for years.
I'll skip the troubleshooting details and get to the fix. IonicZip has a property that I had never heard of before, and which until now had not been important: ParallelDeflateThreshold which needed to be set to -1. Doing so fixed the problem. If I understand correctly, IonicZip has a known problem with files that compress to < 4MB. In my case, mine were slightly over 2MB compressed. All of my other backups are much larger which is perhaps the reason why I've never seen this problem before.
At any rate, I wanted to post this here in case someone else here is using this component and not aware of this issue. The bottom line is, test your backups! Have a great weekend!
"Go forth into the source" - Neal Morse
"Hope is contagious"
|
|
|
|
|
So, just how long were you "puckered"?
I’ve given up trying to be calm. However, I am open to feeling slightly less agitated.
I’m begging you for the benefit of everyone, don’t be STUPID.
|
|
|
|
|
MarkTJohnson wrote: how long were you "puckered"?
I would have been puckered if I had truly needed those backups. As it was, I went through a few emotions:
0: Surprise! Your false sense of security has just been shattered...You are not as clever as you thought you were, and your backups are shite!
1: Doubt...Hmmm what about the other 20 daily backups? Are they all shite?
2: Relief...Whew! The other backups are fine. Just these two from this server are crap.
3: Annoyance. I just want to get on with work. Now I have to log on Azure, allow myself to RDP into that box, get raw unzipped backups, and start troubleshooting the problem.
4: Sleuth Mode...the problem seems to be with the zip lib...maybe a bug...maybe fixed? Go get the latest version to find that it's being deprecated, and the last version is 6 y/o. Whatever, I'll try it.
5: Disappointment. Nope that didn't work, time to open the project and debug with one of the dbs having the issues.
6: Excitement. Yay! I was able to replicate the issue...now on to understanding.
7: Discovery: A well-phrased search put me on the right track...a known issue with an easy fix.
8: Humility: I'm sure I would have discovered this eventually, but I put a lot of faith in an automated process without actually verifying the outputs, which was the only way to detect the problem. Lesson learned!
"Go forth into the source" - Neal Morse
"Hope is contagious"
|
|
|
|
|
... it occurred to me: What if CP itself, or any subsystem employed, has chosen to use the top 8 bits e.g. to flag privileges or other user properties, leaving only 24 bits for the member number?
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
If anyone did it, they did it unsigned!
|
|
|
|
|
This would be very poor database design. If this is actually the case, Chris should demand a refund from the database designer.
In any case, the change in schema should be quite simple, and, assuming that the user ID and privileges are accessed by two separate methods, the change in the interface code should also be small.
For that matter, encoding the various privileges in an integer is also problematic. What happens when the number of privileges grows (as it inevitably will), and exceeds 32?
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.
-- 6079 Smith W.
|
|
|
|
|
I agree. Giving a single field/value multiple meanings and uses has always led to trouble down the road for me.
There are no solutions, only trade-offs. - Thomas Sowell
A day can really slip by when you're deliberately avoiding what you're supposed to do. - Calvin (Bill Watterson, Calvin & Hobbes)
|
|
|
|
|