Click here to Skip to main content
15,886,026 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I'm trying to deal with a problem with "invalid high surrogate characters" in a string.

One of the examples I found online shows how to create such a string in C#

string s = "a\ud800b";


The person who created the example then claims that
s.Normalize();
will fail.

And I'm stumped. No matter what I try in VB.NET, it seems like the resulting string is always a good, "normalized" string.

How can I recreate that problematic string in VB.NET?

What I have tried:

Dim sFoo As String

Dim bytes() As Byte

bytes = {97, 216, 0, 98}
sFoo = System.Text.Encoding.Unicode.GetString(bytes)


and I've tried swapping the bytes for \ud800

Dim sFoo As String

Dim bytes() As Byte

bytes = {97, 0, 216, 98}
sFoo = System.Text.Encoding.Unicode.GetString(bytes)


I've tried various other encodings.... UTF7, UTF8, UTF32, BigEndianUnicode..... every time I try to check whether sFoo.IsNormalized it happily returns "True" and the .Normalize() call always succeeds.....
Posted
Updated 23-Jul-20 8:10am
v2

VB
Dim s As String = "a" + ChrW(&HD800) + "b"
Dim t As String = s.Normalize()

The reason it fails is that "D800" is not a valid Unicode characters, they are "surrogates" and can;t be "normalized": FAQ - UTF-8, UTF-16, UTF-32 & BOM[^]
 
Share this answer
 
Comments
Pino Carafa 23-Jul-20 10:13am    
gawd that simple eh? Many many thanks.
OriginalGriff 23-Jul-20 10:30am    
You're welcome!
OriginalGriff's solution is the perfect solution to this question. I'm just posting another "answer" here as it allows me to format the code I'm about to post.

My original intent was to DEAL with such strings, basically finding a way to remove such "garbage" from the string without affecting the string too much.

This was causing an issue writing to XML. The following would fail:

Dim sFoo As String = "a" + ChrW(&HD800) + "b"

Dim oXML As System.Xml.XmlDocument
Dim oRoot As System.Xml.XmlElement

oXML = New System.Xml.XmlDocument
oRoot = oXML.CreateElement("foo")
oXML.AppendChild(oRoot)
oRoot.InnerText = sFoo
oXML.Save(System.IO.Path.Combine(System.IO.Path.GetTempPath, "foo.xml"))


To fix it - or at least prevent the error -

Dim sFoo As String = "a" + ChrW(&HD800) + "b"

Try
    If Not sFoo.IsNormalized Then
        sFoo = sFoo.Normalize
    End If
Catch ex As Exception
    Dim bytes As Byte()
    bytes = System.Text.Encoding.Unicode.GetBytes(sFoo)
    sFoo = System.Text.Encoding.Unicode.GetString(bytes)
End Try

Dim oXML As System.Xml.XmlDocument
Dim oRoot As System.Xml.XmlElement

oXML = New System.Xml.XmlDocument
oRoot = oXML.CreateElement("foo")
oXML.AppendChild(oRoot)
oRoot.InnerText = sFoo
oXML.Save(System.IO.Path.Combine(System.IO.Path.GetTempPath, "foo.xml"))


To the reader: Now this replaces the "garbage" with other characters that a human reader might perceive as "random" or nonsensical. For my intents and purposes that is perfectly fine, but if you need to do better than that you'll have to address that problem yourself :-)
 
Share this answer
 
Quote:
How do I translate this very simple C# syntax to VB?

Not exactly a solution, but in the news :
https://ai.facebook.com/blog/deep-learning-to-translate-between-programming-languages[^]
Automatic translation is still a subject of research.
So beyond the most basic translations, automated translation is not really a solution.
The only efficient solution is to learn both languages and translate manually.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900