Click here to Skip to main content
15,881,839 members
Please Sign up or sign in to vote.
5.00/5 (9 votes)
See more:

The problem is that utf16_codecvt methods never get called and, therefore, the result is wrong. I have search the net, but all I can find is examples of what is supposed to work. Unfortunately none of them has worked. I have also seen other posters, on the net, with the same problem, but no one gave them and answer to it.

I have tested to make sure that it has the facet (utf16_codecvt) and it does. So I see no reason why its virtual methods are never called. Instead it keeps calling the codecvt<wchar_t,char, mbstate> methods.

Any ideas?

C++
class utf16_codecvt : public std::codecvt<char16_t, char16_t, std::mbstate_t>
{
    ...//
};

void MyTestFunc()
{
    ... //
    std::wifstream myFile;
    std::locale myLoc = std::locale(myFile.getloc(), new utf16_codecvt);
    myFile.imbue(myLoc);
    myFile.open(pFileName, std::ios::in | std::ios::binary);
    ... //
    myFile.read(bom_buffer, 1);
    ... //
}

The following link gives an example of the types of things I am trying to do:
April 01, 1999 - Unicode Files - P.J. Plauger http://www.ddj.com/cpp/184403638?pgno=1[^]
Posted
Updated 26-Nov-09 10:56am
v2

1 solution

From what I can tell, the C++ stream system presumes that files are sequences of bytes, not characters - even when you use wide streams - the 'wide' part of wide stream (AFAICT) indicates how the stream object interacts with C++, not the underlying file or whatever. Thus, your codecvt facet has to take in characters.

By changing the declaration of your codecvt facet to that shown below, I was able to get breakpoints in the replacement facet being set.

C++
class utf16_codecvt : public std::codecvt<char16_t, char, std::mbstate_t>
{
   typedef std::codecvt<char16_t, char, std::mbstate_t> Base;
   typedef char16_t ElemT;
   typedef char ByteT;
   virtual result __CLR_OR_THIS_CALL do_in(std::mbstate_t& s,
      const ByteT *_First1, const ByteT *_Last1, const ByteT *& _Mid1,
      ElemT*_First2, ElemT* _Last2, ElemT *& _Mid2) const
   {	// convert bytes [_First1, _Last1) to [_First2, _Last)
      return Base::do_in(s, _First1, _Last1, _Mid1, _First2, _Last2, _Mid2);
   }

   virtual result __CLR_OR_THIS_CALL do_out(std::mbstate_t& s,
      const ElemT*_First1, const ElemT*_Last1, const ElemT*& _Mid1,
      ByteT*_First2, ByteT*_Last2, ByteT*& _Mid2) const
   {	// convert [_First1, _Last1) to bytes [_First2, _Last)
      return Base::do_out(s, _First1, _Last1, _Mid1, _First2, _Last2, _Mid2);
   }

   virtual result __CLR_OR_THIS_CALL do_unshift(std::mbstate_t& s,
      ByteT*_First2, ByteT*_Last2, ByteT*&_Mid2) const
   {	// generate bytes to return to default shift state
      return Base::do_unshift(s, _First2, _Last2, _Mid2);
   }

   virtual int __CLR_OR_THIS_CALL do_length(const std::mbstate_t& s, const ByteT*_First1,
      const ByteT*_Last1, size_t _Count) const
   {	// return min(_Count, converted length of bytes [_First1, _Last1))
      return Base::do_length(s, _First1, _Last1, _Count);
   }
};


So, your replacement facet will have to know it needs two bytes read for every character (and vice versa, obviously). The best reference for that sort of information is probably Standard C++ IOStreams and Locales by Angelika Langer and Klaus Kreft[^] - but even then, locales and facets are heavy going in C++ :(
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900