View Issue Details

IDProjectCategoryView StatusLast Update
0022125Open CASCADEOCCT:Foundation Classespublic2017-10-01 19:33
ReporterssvAssigned Tobugmaster  
PrioritynormalSeveritytrivial 
Status closedResolutionfixed 
OSAll 
Target Version6.8.0Fixed in Version6.8.0 
Summary0022125: TCollection_ExtendedString: conversion from UTF-8 to unicode
DescriptionThere is a problem in the following constructor of TCollection_ExtendedString class:

TCollection_ExtendedString(const Standard_CString astring,
                           const Standard_Boolean isMultiByte);

This constructor is used to restore a unicode string from its UTF-8
representation in case when isMultiByte = Standard_True.

Internally it invokes ConvertToUnicode3B & ConvertToUnicode2B functions which
are intended to construct a single Standard_ExtCharacter instead of 3 or 2
passed chars correspondingly. ConvertToUnicodeXB functions use the following
data structure:

union {
  struct {
    unsigned char h;
    unsigned char l;
  } hl;
  Standard_ExtCharacter chr;
} EL;

E.g: let's take a symbol 12510 (Japanese hieroglyph character) as an example. It
has the following UTF-8 representation (3 bytes):

1110_0011 10_000011 10_011110

which must be restored to 0011000011011110 16-bit value. However,
ConvertToUnicode3B will return the following instead:

11011110 00110000 (EL.hl.l and EL.hl.r appear in a wrong order).

Issue was reproduced on Win32.

This issue was faced during implementation of unified IGES-reading routine
accepting utf-8 string as a filename. You can find attached a draft workaround
for such a routine (win32-compliant only). This workaround uses
MultiByteToWideChar win-function instead.
TagsNo tags attached.
Test case numberbugs fclasses bug22125

Attached Files

  • iges_read_utf8_win.zip (3,258 bytes)
  • Part1_badname.zip (171,673 bytes)
  • test_iges_jp.tcl (100 bytes)

Relationships

related to 0022484 closedbugmaster Open CASCADE UNICODE characters support. 
related to 0029081 closedabv Community Foundation Classes, OSD_OpenStream - handle UNICODE file paths specifically in case of Mingw-w64 
child of 0014673 closedbugmaster Open CASCADE Provide true support for Unicode symbols 

Activities

2010-11-30 12:33

 

iges_read_utf8_win.zip (3,258 bytes)

pdn

2014-10-03 14:07

reporter   ~0032629

Please provide a test file

ssv

2014-10-14 16:23

developer  

Part1_badname.zip (171,673 bytes)

ssv

2014-10-14 16:24

developer   ~0033071

Please, find attached an IGES file with Japanese name.

pdn

2014-10-14 19:15

reporter  

test_iges_jp.tcl (100 bytes)

pdn

2014-10-14 19:16

reporter   ~0033080

Test script added.
This problem will be resolved after 0025367 integration

kgv

2014-10-14 20:06

developer   ~0033086

> This problem will be resolved after 0025367 integration
the problem in description is irrelevant to 0025367 patch.

kgv

2014-10-16 10:36

developer   ~0033182

Dear bugmaster,

please switch the bug to "verified".
The issue has been solved within patch for 0022484:
 inline Standard_ExtCharacter ConvertToUnicode3B (unsigned char *p)
 {
   // *p, *(p+1), *(p+2) =>0 , 1, 2
+  // little endian
   union {
     struct {
-      unsigned char  h;
       unsigned char  l;
+      unsigned char  h;
     } hl;


Available UTF-8/UTF-16 conversion APIs convert the filename "Part1_badname_マヹヱ.igs" from test case in the same way:
Utf16 SOURCE:
57 00 3A 00 5C 00 50 00|61 00 72 00 74 00 31 00
5F 00 62 00 61 00 64 00|6E 00 61 00 6D 00 65 00
5F 00 DE 30 F9 30 F1 30|2E 00 69 00 67 00 73 00

Utf16 TCol from Utf8:
57 00 3A 00 5C 00 50 00|61 00 72 00 74 00 31 00
5F 00 62 00 61 00 64 00|6E 00 61 00 6D 00 65 00
5F 00 DE 30 F9 30 F1 30|2E 00 69 00 67 00 73 00

Utf16 NCol from Utf8:
57 00 3A 00 5C 00 50 00|61 00 72 00 74 00 31 00
5F 00 62 00 61 00 64 00|6E 00 61 00 6D 00 65 00
5F 00 DE 30 F9 30 F1 30|2E 00 69 00 67 00 73 00

Utf8  WApi from Utf16:
57 3A 5C 50 61 72 74 31|5F 62 61 64 6E 61 6D 65
5F E3 83 9E E3 83 B9 E3|83 B1 2E 69 67 73
Utf8  NCol from Utf16:
57 3A 5C 50 61 72 74 31|5F 62 61 64 6E 61 6D 65
5F E3 83 9E E3 83 B9 E3|83 B1 2E 69 67 73
Utf8  TCol from Utf16:
57 3A 5C 50 61 72 74 31|5F 62 61 64 6E 61 6D 65
5F E3 83 9E E3 83 B9 E3|83 B1 2E 69 67 73


static TCollection_AsciiString formatHex (const Standard_Byte* theData,
                                          const Standard_Size  theSize)
{
  TCollection_AsciiString anOut;
  char aByte[4];
  for (size_t aByteId = 0; aByteId < theSize; ++aByteId)
  {
    unsigned char aChar = theData[aByteId];
    char anEsc = ' ';
    if (     (aByteId + 1) % 16 == 0 && aByteId != 0)
    {
      anEsc = '\n';
    }
    else if ((aByteId + 1) % 8  == 0)
    {
      anEsc = '|';
    }
    _snprintf (aByte, 4, "%02X%c", (unsigned int )aChar, anEsc);
    anOut += aByte;
  }
  return anOut;
}

static Standard_Integer testunicode (Draw_Interpretor& /*theDI*/, Standard_Integer , const char** )
{
  wchar_t aFilePath [MAX_PATH]; aFilePath [0] = L'\0';
  wchar_t aFileTitle[MAX_PATH]; aFileTitle[0] = L'\0';
  OPENFILENAMEW anOpenStruct; memset (&anOpenStruct, 0, sizeof(OPENFILENAMEW));
  anOpenStruct.lStructSize     = sizeof(OPENFILENAMEW);
  anOpenStruct.nFilterIndex    = 1;
  anOpenStruct.lpstrFile       = aFilePath;
  anOpenStruct.nMaxFile        = sizeof(aFilePath);
  anOpenStruct.lpstrFileTitle  = aFileTitle;
  anOpenStruct.nMaxFileTitle   = sizeof(aFileTitle);
  anOpenStruct.lpstrTitle      = L"No Title";
  anOpenStruct.Flags = OFN_PATHMUSTEXIST | OFN_FILEMUSTEXIST;
  if (!GetOpenFileNameW (&anOpenStruct)
   || *anOpenStruct.lpstrFile == L'\0')
  {
    return 0;
  }

  char aBuffU8[4096];
  WideCharToMultiByte (CP_UTF8, 0, anOpenStruct.lpstrFile, -1, aBuffU8, 4096, NULL, NULL);
  NCollection_String anUtf8NCol (anOpenStruct.lpstrFile, -1);

  char aBuffU8UsingExt[4096];
  char* aPtr = aBuffU8UsingExt;
  TCollection_ExtendedString anExtWide ((Standard_ExtString )anOpenStruct.lpstrFile);
  anExtWide.ToUTF8CString (aPtr);

  TCollection_AsciiString aHexUtf16Src = formatHex ((const Standard_Byte* )anOpenStruct.lpstrFile, wcslen (anOpenStruct.lpstrFile) * 2);
  TCollection_AsciiString aHexUtf8WApi = formatHex ((const Standard_Byte* )aBuffU8, strlen(aBuffU8));
  TCollection_AsciiString aHexUtf8NCol = formatHex ((const Standard_Byte* )anUtf8NCol.ToCString(), anUtf8NCol.Size());
  TCollection_ExtendedString anExtWideFromUtf8 (aBuffU8, Standard_True);
  TCollection_AsciiString aHexUtf16ExtFromU8 = formatHex ((const Standard_Byte* )anExtWideFromUtf8.ToExtString(), anExtWideFromUtf8.Length() * 2);

  TCollection_AsciiString aHexUtf8TColEx = formatHex ((const Standard_Byte* )aBuffU8UsingExt, strlen(aBuffU8UsingExt));

  NCollection_UtfWideString anUtf16NColFromUtf8 (aBuffU8, -1);
  TCollection_AsciiString aHexUtf16NColFromU8 = formatHex ((const Standard_Byte* )anUtf16NColFromUtf8.ToCString(), anUtf16NColFromUtf8.Size());

  std::cerr << "Utf16 SOURCE:\n"         << aHexUtf16Src << "\n"
            << "Utf16 TCol from Utf8:\n" << aHexUtf16ExtFromU8  << "\n"
            << "Utf16 NCol from Utf8:\n" << aHexUtf16NColFromU8 << "\n"
            << "Utf8  WApi from Utf16:\n" << aHexUtf8WApi  << "\n"
            << "Utf8  NCol from Utf16:\n" << aHexUtf8NCol  << "\n"
            << "Utf8  TCol from Utf16:\n" << aHexUtf8TColEx  << "\n";
  return 0;
}

bugmaster

2014-10-17 14:12

administrator   ~0033257

Mikhail,

Please create testing case

git

2014-10-31 16:47

administrator   ~0033966

Branch CR22125 has been created by apn.

SHA-1: 28d7ddb64363611911034b716439922bc0b362cf


Detailed log of new commits:

Author: apn
Date: Fri Oct 31 16:46:53 2014 +0300

    0022125: TCollection_ExtendedString: conversion from UTF-8 to unicode
    
    Added test case bugs/fclasses/bug22125

apn

2014-10-31 16:47

administrator   ~0033967

Problem is not reproduced on current state of master on Windows and Debian60-64 in Release and Debug modes.
Branch CR22125 was created. It contains test case:
bugs fclasses bug22125 - OK

git

2014-11-12 08:55

administrator   ~0034243

Branch CR22125 has been deleted by kgv.

SHA-1: 28d7ddb64363611911034b716439922bc0b362cf

Related Changesets

occt: master 5e5ce65b

2014-10-31 13:46:53

apn


Committer: bugmaster Details Diff
0022125: TCollection_ExtendedString: conversion from UTF-8 to unicode

Added test case bugs/fclasses/bug22125
Affected Issues
0022125
add - tests/bugs/fclasses/bug22125 Diff File

Issue History

Date Modified Username Field Change
2010-11-30 14:39 abv CC => pdn, nkv
2011-08-02 11:23 bugmaster Category OCCT:FDC => OCCT:Foundation Classes
2011-12-05 10:45 abv Relationship added child of 0014673
2011-12-20 15:02 pdn Fixed in Version EMPTY =>
2011-12-20 15:02 pdn Target Version => 6.5.3
2011-12-20 15:02 pdn Description Updated
2012-02-02 10:15 abv Target Version 6.5.3 => 6.5.4
2012-10-21 11:16 abv Target Version 6.5.4 => 6.6.0
2013-02-28 17:06 abv Target Version 6.6.0 => 6.7.0
2013-11-06 15:10 kgv Relationship added related to 0022484
2013-11-06 15:11 kgv Target Version 6.7.0 => 6.7.1
2014-04-04 18:32 abv Target Version 6.7.1 => 6.8.0
2014-09-11 10:24 abv Target Version 6.8.0 => 7.1.0
2014-10-03 14:07 pdn Note Added: 0032629
2014-10-03 14:07 pdn Assigned To bugmaster => ssv
2014-10-03 14:07 pdn Status new => feedback
2014-10-14 16:23 ssv File Added: Part1_badname.zip
2014-10-14 16:24 ssv Note Added: 0033071
2014-10-14 16:24 ssv Assigned To ssv => pdn
2014-10-14 16:29 pdn Status feedback => assigned
2014-10-14 19:15 pdn File Added: test_iges_jp.tcl
2014-10-14 19:16 pdn Note Added: 0033080
2014-10-14 19:17 pdn Assigned To pdn => kgv
2014-10-14 19:17 pdn Status assigned => resolved
2014-10-14 20:05 kgv Assigned To kgv => pdn
2014-10-14 20:05 kgv Status resolved => assigned
2014-10-14 20:06 kgv Note Added: 0033086
2014-10-16 10:36 kgv Note Added: 0033182
2014-10-16 10:36 kgv Assigned To pdn => bugmaster
2014-10-16 10:36 kgv Status assigned => feedback
2014-10-16 10:36 kgv Resolution open => fixed
2014-10-16 10:36 kgv Target Version 7.1.0 => 6.8.0
2014-10-17 14:12 bugmaster Assigned To bugmaster => mkv
2014-10-17 14:12 bugmaster Note Added: 0033257
2014-10-20 12:03 bugmaster Assigned To mkv => apn
2014-10-31 16:47 git Note Added: 0033966
2014-10-31 16:47 apn Note Added: 0033967
2014-10-31 16:47 apn Test case number => bugs fclasses bug22125
2014-10-31 16:47 apn Assigned To apn => bugmaster
2014-10-31 16:47 apn Status feedback => tested
2014-11-06 15:18 bugmaster Changeset attached => occt master 5e5ce65b
2014-11-06 15:18 bugmaster Status tested => verified
2014-11-11 12:42 aiv Fixed in Version => 6.8.0
2014-11-11 13:03 aiv Status verified => closed
2014-11-12 08:55 git Note Added: 0034243
2017-10-01 19:33 abv Relationship added related to 0029081