Double-Byte Character Support

In most East Asian languages, language-specific characters in code page strings (that is, Natural format A) are represented by 2 bytes (the so-called double-byte characters) and ASCII characters are represented by 1 byte. Thus, a code pages string consists of characters with different lengths: some have 1 byte and others have 2 bytes.

Natural provides a basic support for double-byte characters. On Windows, this support is activated when both the Natural default code page and the Windows system code page are defined as double-byte code pages. If Natural does not define a specific code page, it is sufficient when a double-byte Windows system code page has been defined. On Linux, the support for double-byte characters is activated when the Natural default code page is a double-byte code page.

When double-byte character support is enabled, Natural assures for all string manipulations that a double-byte character is treated as a unit. This is essential for keeping the meaning of a string.

If a single leading or trailing byte of a double-byte character is left over after the manipulation of a variable of format A (for example, after extracting a substring with the SUBSTRING option), this byte is replaced with a blank character.

For the example below, the code page Shift_JIS is selected. Variable #A contains a string which consists of four characters. The first and last character is the double-byte character "FULL WIDTH LATIN SMALL LETTER B" which is represented in code page Shift_JIS by the byte sequence H'8282'. The second and third character is the single byte character "LATIN SMALL LETTER A" which is represented by one byte H'61'. Thus, the hexadecimal representation of the full string is H'828261618282'.

DEFINE DATA LOCAL
 1   #A   (A10)
END-DEFINE
 
#A := 'baab'
 
WRITE #A #A (EM=H(6))
EXAMINE #A FOR PATTERN 'B' REPLACE 'a'
WRITE #A #A (EM=H(6))
 
END

Without double-byte character support the output of the above program is as follows:

Page         1                             07-02-07    17:22:09
  												
baab    828261618282
Bab     826161828220

This is the result of not having treated the character "b" (H'8282' in code page Shift_JIS) as one unit. The trailing byte of this character and the following character "a" (H'61') are falsely interpreted as the double-byte character "B" (H'8261' in code page Shift_JIS).

With double-byte character support, the output of the program is as expected:

Page         1                             07-02-07    17:22:09

baab    828261618282
baab    828261618282

Note:
On Windows, the Natural output window has been Unicode-enabled which means that all fields have Unicode format now. In case of A format fields containing double-byte characters, the behavior of the Natural output window has changed slightly. For A format input fields it is now possible to enter "Unicode-string-length" characters in the field. When leaving the field and the default code page is a double-byte code page, all characters which do not fit into the target A format field are removed. For example, an A10 field can hold 5 double-byte characters. In the output window, this field is represented by a Unicode field of length 10 with display length 5. So the user can enter 10 double-byte characters in the input field. When the user moves the cursor to another field on the page or leaves the page by pressing ENTER, the content of the field is converted to code page format so that only the first 5 double-byte characters remain.