Accessing legacy data in a .NET environment

Cd-MaN

4.00/5 (6 votes)

Feb 26, 2005

6 min read

34129

472

An article on how to access data from fixed size record data from the .NET environment

Download source files - 9.4 Kb

Sample Image - cpp_data_caster.png

Introduction

Access to legacy data is very often required. Unless developing a new system from zero or using a data abstraction layer (a database system for example), you can’t avoid the need to import and possibly export legacy data.

Many of these data files are binary fixed record ones. These were very popular because almost all programming languages offered support for them. In fact in languages with pointers like C the support was very easy to implement: you’ve read the data in a data buffer and then told the system to interpret it correspondingly (by casting or by using pointers for example).

With the introduction of modern languages (PERL, Java and the .NET family) languages the pointers disappeared and the preferred storage methods have changed: if the data is only internally used, the serialization mechanism included in the language is preferred since it is easy to implement. For data files that are / will be used by other systems the storage method can be XML or some kind of database management system (there are a few small and fast RDMSs out there that can be used without complicated installation or big memory usage – Berkley Sleepy Cat for example).

In this article I would like to show how to read and write fixed size records from streams in the .NET environment. More specifically I will give you some methods to convert the most generally used types to / from byte arrays. The included source code is in managed C++, however it can be imported and used in any .NET project.

Some words about the solution

Would the .NET allow direct access to memory, this would be an easy problem, and I would be writing an article about it :). However, it doesn’t and this is a good thing, because thanks to this approach we have much less memory leaks, we have a garbage collector, increased safety and a hard time solving this problem. One of my first approaches to the problem was to import the RtlMoveMemory function from the kernel32.dll and to try fooling around with it. However the results were not satisfying and I soon found a much more elegant solution which could be implemented 100% inside the .NET framework (thus making it compatible with every implementation – probably even with Mono, though I didn’t check).

The basic idea is the same: we declare two memory regions, one as a byte buffer, the other as the desired type and we move the data between them. Let’s say that our record consists of two elements: a 4 byte float and a 2 byte unsigned integer. We begin by declaring two classes:

[StructLayout(System::Runtime::InteropServices::LayoutKind::Sequential)]
private __gc class record_holder {
public:
  [MarshalAs(System::Runtime::InteropServices::UnmanagedType::R4)]
  unsigned short float_value;
  [MarshalAs(System::Runtime::InteropServices::UnmanagedType::U2)]
  unsigned short int_value;
};
[StructLayout(System::Runtime::InteropServices::LayoutKind::Sequential)]
private __gc class record_array_holder {
public:
  [MarshalAs(System::Runtime::InteropServices::UnmanagedType::ByValArray, 
                                                 SizeConst=8)]Byte buffer[];
};

Now we can convert between them using the following code:

using namespace System::Runtime::InteropServices;
...
//create a new instance of the record class
record_holder *rec = new record_holder; 
rec->float_value = 1.0; //assign values to the member elements
rec->int_value = 2.0; 
//create a memory buffer and copy the class there
IntPtr memBuff = Marshal::AllocHGlobal(Marshal::SizeOf(record_holder));
Marshal::StructureToPtr(record_holder, memBuff, false);
//now create a byte array and copy the bytes from the allocated memory area 
//one by one
Byte buffer[] = new Byte[Marshal::SizeOf(record_holder)];
for(int i = 0; i < buffer->Length; i++) 
    buffer[i] = Marshal::ReadByte(memBuff, i);
//destroy the memory buffer – we need to do this explicitly since this is 
//outside the garbage collectors reach
Marshal::FreeHGlobal(memBuff);

And we are done. Now the buffer variable contains the byte sequence corresponding to the record and is compatible with the usual binary file access methods. The backwards conversion is also straight forward:

using namespace System::Runtime::InteropServices;
...
record_array_holder *holder = new record_array_holder;
record_holder *valueHolder = new record_holder;
holder->buffer = new Byte[8];
//fill up the buffer with some values, possibly read from a file stream
holder->buffer[0] = ...
holder->buffer[2] = ...
...
//allocate a global memory
IntPtr memBuff = Marshal::AllocHGlobal(Marshal::SizeOf(record_holder));
//write the value class to the memory area
Marshal::StructureToPtr(holder, memBuff, false);
//and now read the value holder back from it. We can do this since they are 
//the same size
Marshal::PtrToStructure(memBuff, valueHolder);
//free the memory
Marshal::FreeHGlobal(memBuff);

Now we can access the data members as the fields of the object valueHolder. This is the basic approach that I used to write the general conversion class attached to the article. It is written in managed C++ and can be used in any .NET project file by adding it as a reference (I personally use it in a VB .NET project). Most of the methods are self explanatory; however I will analyze the specifics of each data type below. One thing I would like to mention that the included class supports both little endian and big endian byte order. To switch between them change the ReverseByteOrder field of the class instance accordingly and the code will perform the needed changes. The methods are thread safe, so if decide to use for performance reasons one single instance of the object, you can do that. However I would like to warn you that the mbf_ functions are not optimal if ReverseByteOrder is set to true (basically they perform the rotation of the bytes three times). This is done so to make the code more understandable, but if you would like to use them in any high performance system, I suggest that you modify it.

Also, the code doesn’t really handle exceptions, so you should check for them when the functions are called.

Integers and byte order

There is nothing interesting here. You can convert the most common integer formats (2 / 4 / 8 bytes, signed / unsigned) to / from byte arrays by the method presented above.

Floating point numbers and the Microsoft Binary Format

Here we have the two common floating point formats (4 / 8 bytes). What is interesting here is the support for the Microsoft Binary Format (MBF). All modern and most of the not so modern programming languages use the IEEE standard for storing floating point numbers. However in the early days (when mathematical coprocessors were not a common thing in PCs), Microsoft developed its own format for storing them. They were the same size but the bits were differently distributed between the fields (sign, mantissa, exponent). Their advantage was that with software emulation of the floating point coprocessor one could reach a slightly higher speed with the MBF. When coprocessor became mainstream, the support for them was slowly abandoned. Still some of you might be in the same situation as I am, so I included some functions to use them. These are the functions beginning with the "mbf_" prefix. These conversion functions are not written by, but taken from different sources on the Internet. I just adapted it. Also, I've read that until VC++ 6.0 these routines were included in the main system.

Strings (ASCII and Unicode)

Nothing interesting here either. These are really simple and you don’t even need the method described in the introduction to achieve the result, but they are included here so that you have a complete toolbox. If the ReverseByteOrder flag is set, the bytes for the Unicode version are swapped two byte two. The MSDN documentation states that the ASCIIEncoder (used in the ascii methods) only handles the 00-7F (the lower half of the character page) correctly, so be aware. I’m still looking into possible solutions. I would like to support the full character range and to make possible the selection of the code page used for conversion.

Conclusion

Hopefully after reading the article and downloading the attached class you will be able to read and write your legacy files. If you are writing a new system from scratch, please consider using some easy to decode format for information storage, so that the next generation of programmers won’t have to go through narrow hidden ways to import it. XML seems the best option for the moment.

Successful debugging to everyone,
Cd-MaN