Click here to Skip to main content
15,887,746 members
Articles / Database Development / SQL Server

ASCII/Multibyte to Unicode Conversion

Rate me:
Please Sign up or sign in to vote.
3.10/5 (8 votes)
16 Feb 2007CPOL3 min read 166.5K   26   23
An article about converting ASCII/Multibyte to Unicode using C# and SQL/CLR integration.

Sample image

Introduction

A common problem when converting legacy database applications is that customers have a lot of data and they don't want to loose it. Data strings are usually encoded in ASCII /Multibyte, but our new system uses NVARCHAR fields, so the problem is that we have to convert old data into Unicode. This is exactly what we'll do here.

I have to highlight that the approach I will present is 'oversized' for most cases: in a common scenario, you will deal with a single codepage; in that case, SQL Server built-in tools are enough. But there are situations that require a more advanced approach, such as when you have a legacy database that stores strings with different codepages.

As a complete example to show you capabilities of SQL-CLR integration, I've also decided to use a Win32 API to perform the conversion so that you can also see how to use P/Invoke from SQL. This can be useful if you have an old DLL and want to use it... but *beware* that it can be really dangerous... if you don't have full knowledge of what you're doing, you can keep down the entire SQL Server process !!!

Just a note: I do not provide a test project since the code I will show here is really simple, you can use copy & paste, and it's faster!

Solution

The solution I want to show is simple and powerful at the same time. It is made of two logical parts: first, we'll build the conversion routine as a standard C# function. Second, we'll integrate it into SQL Server as a function, so you can use with T-SQL.

Clearly, depending on your application scenario, this may not be the best approach, so once you have the conversion routine, you may choose to follow my approach or you may prefer to use it in an external application that performs a batch conversion, or you may think of something else.

Part #1: Create the conversion routine

I've used the following well known Win32 API:

C++
int MultiByteToWideChar(
  UINT CodePage, 
  DWORD dwFlags,         
  LPCSTR lpMultiByteStr, 
  int cbMultiByte,       
  LPWSTR lpWideCharStr,  
  int cchWideChar        
);

Here, there's the C# signature that can be used for P/Invoking:

C#
[DllImport("kernel32.dll")]
private static extern int MultiByteToWideChar(
  uint CodePage,
  uint dwFlags,
  [MarshalAs(UnmanagedType.LPArray)] Byte[] lpMultiByteStr,
  int cbMultiByte,
  [Out, MarshalAs(UnmanagedType.LPArray)] Byte[] lpWideCharStr,
  int cchWideChar);

Now it's quite easy to write a class that performs a conversion:

C#
namespace ConvUtils {

  public static class Unicode {

    public static SqlString ConvToUnicode(SqlInt32 codepage , SqlString multibyteString) {
      byte[] b = (byte[])iConvToMultibyteArray(multibyteString);
      return (SqlString)ToUnicode((uint)(int)codepage, b);
    }

    private static string ToUnicode(uint codepage, Byte[] lpMultiByteStr) { 
      Byte[] lpWideCharStr = new Byte[2*lpMultiByteStr.Length];
      MultiByteToWideChar(codepage, 0, lpMultiByteStr, lpMultiByteStr.Length,
         lpWideCharStr, 2*lpMultiByteStr.Length);
      return System.Text.Encoding.Unicode.GetString(lpWideCharStr);
    }


    private static SqlBinary iConvToMultibyteArray(SqlString multibyteString) {
      byte[] result = multibyteString.GetUnicodeBytes();
      return (SqlBinary)result;
    }
  }
}

The example is quite easy, and does not require any other explanation. The SQL types has been used because I will integrate it into SQL Server, but if you don't need it, you may replace them with strings and byte arrays.

I've also decided to let the codepage be a parameter since the codepage on your PC or on SQL Server can be different from the one needed for the conversion.

Part #2: SQL Server code

OK, now we have a C# code that converts ASCII/Multibyte into Unicode. The next step is to integrate it into SQL Server so that any database user can have access to this conversion routine:

First, create a DLL that can be hosted by SQL Server: all we have to do is to add the [SQLFunction] attribute to ConvToUnicode:

C#
[SQLFunction] 
public static SqlString ConvToUnicode( ...

Finally, build the DLL and integrate it into SQL Server with a script like the following:

SQL
use TESTDB 
go 
exec sp_configure "clr enabled", '1' 
go 
reconfigure 
go 
-- for test purpouses only, not recommended in production environments
ALTER DATABASE TESTDB SET TRUSTWORTHY ON 
go 

begin try 
  CREATE ASSEMBLY [asmUni] FROM 'c:\project_output_dir\uniconv.dll'
     with permission_set=UNSAFE 
end try 
begin catch 
  alter assembly [asmUni] FROM 'c:\project_output_dir\uniconv.dll' WITH UNCHECKED DATA 
end catch 
go

if exists(
    select name from sys.objects where name = 
    'csConvToUnicode') drop function [dbo].[csConvToUnicode] 
go 
CREATE FUNCTION [dbo].[csConvToUnicode] ( 
@codepage int, 
@multibytestr nvarchar(max) 
) returns nvarchar(max) 
AS EXTERNAL name [asmUni].[ConvUtils.Unicode].[ConvToUnicode] 
go

That's all!

Now you can use this function like any other, for example, in a classic Select statement, to create a View or to create a Trigger that automatically keeps your data updated.

Here is a final example of how we can use this function in a T-SQL statement (950 is the codepage for traditional Chinese):

SQL
select 
  description, 
  dbo.csConvToUnicode(950, description) as converted 
from testtable

description             converted
----------------------- -------------------
¨àµ£ºô¸ô¦w¥þ            兒童網路安全
°ê»Ú¸ê°T °T°T°T°T°T        國際資訊 訊訊訊訊訊
°ê»Ú¸ê°T °T°T°T°T°T        國際資訊 訊訊訊訊訊
a                    a
Áô¨pÃ…v¬Fµ¦            隱私權æ"¿ç­–
test c                test c

Conclusion

I have shown a really simple but powerful way to leverage CLR integration provided by SQL Server to convert old ASCII/Multibyte data into Unicode. I hope that this is a good starting point for your personal solution.

History

  • February 16, 2007 - Added more comments on introduction.
  • February 8, 2007 - First version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Italy Italy
10 years of hard coding... and it's going on !

Comments and Discussions

 
QuestionNot work in my case Pin
Ruohong23-Nov-10 2:11
Ruohong23-Nov-10 2:11 
AnswerRe: Not work in my case Pin
carlop()23-Nov-10 10:08
carlop()23-Nov-10 10:08 
QuestionNot work with arabic Pin
Ali Habib11-Jan-09 23:27
Ali Habib11-Jan-09 23:27 
AnswerRe: Not work with arabic Pin
carlop()12-Jan-09 9:52
carlop()12-Jan-09 9:52 
GeneralRe: Not work with arabic Pin
Ali Habib12-Jan-09 9:56
Ali Habib12-Jan-09 9:56 
GeneralRe: Not work with arabic Pin
carlop()12-Jan-09 10:19
carlop()12-Jan-09 10:19 
GeneralIt's just what I want, would u please share your dll here? many thks! Pin
frx21-Dec-08 17:09
frx21-Dec-08 17:09 
GeneralRe: It's just what I want, would u please share your dll here? many thks! Pin
carlop()21-Dec-08 21:01
carlop()21-Dec-08 21:01 
GeneralRe: It's just what I want, would u please share your dll here? many thks! Pin
frx22-Dec-08 22:48
frx22-Dec-08 22:48 
GeneralPerhaps an alternative using System.Text.Encoding Pin
Cyrus Chan26-Mar-07 22:29
Cyrus Chan26-Mar-07 22:29 
QuestionWhy reinvent the wheel? Pin
Mihai Nita8-Feb-07 8:26
Mihai Nita8-Feb-07 8:26 
AnswerRe: Why reinvent the wheel? Pin
carlop()8-Feb-07 9:37
carlop()8-Feb-07 9:37 
GeneralRe: Why reinvent the wheel? [modified] Pin
Mihai Nita8-Feb-07 12:39
Mihai Nita8-Feb-07 12:39 
GeneralRe: Why reinvent the wheel? Pin
carlop()8-Feb-07 13:06
carlop()8-Feb-07 13:06 
GeneralRe: Why reinvent the wheel? Pin
Mihai Nita8-Feb-07 16:26
Mihai Nita8-Feb-07 16:26 
GeneralRe: Why reinvent the wheel? Pin
carlop()8-Feb-07 21:08
carlop()8-Feb-07 21:08 
Here there's a quote from SqlServer Books online:

BooksOnline
If the client application is not Unicode-enabled and retrieves the data into non-Unicode buffers, a client will only be able to retrieve or modify data that can be represented by the client machine's code page. This means that ASCII characters can always be retrieved, because the representation of ASCII characters is the same in all code pages, while any non-ASCII data depends on code-page-to-code-page conversion.

For example, suppose you have an application that is currently running only in the United States (U.S.), but is deployed to Japan. Because the SQL Server database is Unicode-aware, both the English and Japanese text can be stored in the same tables, even though the application has not yet been modified to deal with text as Unicode. As long as the application complies with one of the two previous options, Japanese users can use the non-Unicode application to input and retrieve Japanese data, and U.S. users can input and retrieve English data. All data from both sets of users is stored intact in the same column of the database and represented as Unicode. In this situation, a Unicode-enabled reporting application that generates reports that span the complete data set can be deployed. However, English users cannot view the Japanese rows, because the application cannot display any characters that do not exist in their code page (1252).

This situation might be acceptable if the two groups of users do not have to view each other's records. If an application user must be able to view or modify records with text that cannot be represented by a single code page, there is no alternative but to modify the application so that it can use Unicode.


So there are situation in which a non-standard following approach is needed. Probably in most cases you're right, the approach I've proposed is oversized, but there also unlucky people as me out there Frown | :(

---( carlop )---
web @ www.carlop.com
blog @ carlop-dev.blogspot.com

GeneralRe: Why reinvent the wheel? Pin
Mihai Nita8-Feb-07 23:24
Mihai Nita8-Feb-07 23:24 
GeneralRe: Why reinvent the wheel? Pin
carlop()8-Feb-07 23:42
carlop()8-Feb-07 23:42 
GeneralRe: Why reinvent the wheel? Pin
pemb31-Jul-07 17:10
pemb31-Jul-07 17:10 
AnswerRe: It's still a good example of CLR integration Pin
Jcmorin8-Feb-07 11:26
Jcmorin8-Feb-07 11:26 
GeneralRe: It's still a good example of CLR integration Pin
carlop()8-Feb-07 11:33
carlop()8-Feb-07 11:33 
GeneralRe: It's still a good example of CLR integration Pin
Mihai Nita8-Feb-07 12:42
Mihai Nita8-Feb-07 12:42 
GeneralRe: It's still a good example of CLR integration Pin
carlop()8-Feb-07 21:11
carlop()8-Feb-07 21:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.