Introduction to Internationalization

This document provides an introduction to the topic of character conversion with EntireX and covers the following topics:

Overview
ICU Conversion
Arabic Shaping
Hebrew Codepage 803
EBCDIC Stateful Codepages
Multibyte or Double-byte Codepages
Rules for Data Length Changes due to Character Conversion
Codepage Requirements for RPC Data Stream Conversions
Broker's Mechanism for Choosing the Character Conversion Approach
User Exits

Overview

Character conversion is a symmetric process. Everything that is valid for the request (client to server) relates also to the reply (server to client), with opposite roles. Therefore the terms sender and receiver are used instead of client and server in this section. Character conversion with EntireX provides the following:

Character conversion is available for senders and receivers, so any participant is able to work with the desired codepage. A participant tells the broker the codepage they use to send and receive messages. This means the broker is able to perform a conversion from/to the desired characters (code points) within the codepages.
Character conversion deals with the user's payload data in broker's send and receive buffers.
For ICU Conversion, the codepage that an EntireX component (sender and receiver) uses is described by so-called locale strings (alias name of a codepage) sent along with the request to the broker. Depending on the platform your EntireX component is running on, the locale string is sent automatically by default or must be provided. A huge set of codepages is available and supported; see ICU Converter Explorer.
Inside the EntireX Broker an automatic mechanism tries to find the best character conversion approach for your scenario. See Broker's Mechanism for Choosing the Character Conversion Approach.

The following sections discuss all of the character conversion approaches offered by EntireX.

ICU Conversion

This section covers the following topics:

Introduction
Configuration and Usage
ICU Resources
ICU Custom Converters

Introduction

ICU conversion is based on IBM's project International Components for Unicode. It is a mature, widely used set of C/C++ and Java libraries for Unicode support, software internationalization and globalization. ICU comes with a set of ICU converters (codepages) based on codepages from ISO and software vendors such as Microsoft and IBM. It is a standardized approach, and it is possible to extend the set with ICU Custom Converters.

Configuration and Usage

To use ICU conversion, the broker must be configured for the platform it is running on. See Configuring ICU Conversion under z/OS | UNIX | Windows | BS2000 | z/VSE.

By default it is assumed that the payload sent to/received from the broker matches the platform's default code page. EntireX components running under the Windows operating system and Java-based EntireX components send this platform default code page identifier automatically to the broker, so in most cases nothing needs to be configured or considered by a programmer here. Configuration or programmer attention is required in the following cases:

The EntireX component is running under the operating systems z/OS, UNIX, z/VSE or BS2000. No code page identifier is sent automatically on these platforms.
You require a code page other than the platform default.

Configuration for RPC Servers and Listeners

Refer to the respective sections of the documentation for how to enable RPC servers and listeners to send a codepage identifier to the broker or send a different identifier than the default codepage for the platform.

Configuring the RPC Server under C | .NET | XML/SOAP | Java | Micro Focus | z/OS (CICS, Batch, IMS) | z/VSE (Batch, CICS) | BS2000
Configuring the IBM MQ Side (RPC Server for IBM MQ | Listener for IBM MQ)
Configuring the IMS Connect Side
Configuring the CICS ECI Side
Configuring the IBM AS/400 Side

Configuration for RPC Clients

Enabling RPC clients to send a codepage identifier to the broker or send a different identifier than the default codepage for the platform is a task for a programmer. See the following sections of the documentation:

Using Internationalization with the C Wrapper | DCOM Wrapper | .NET Wrapper | Java Wrapper | PL/I Wrapper

Configuration for ACI-based Programming

For ACI-based programming, see:

Using Internationalization with Java ACI
For Broker ActiveX Control, see localeString under Reference > Properties
For EntireX Broker ACI for Assembler | C | COBOL | Natural | PL/I | RPG, see LOCALE-STRING under Broker ACI Fields.

ICU Resources

This section covers the following topics:

ICU Homepage
ICU Converter Explorer
UCM Format
ICU's Conversion Technique

ICU Homepage

The ICU home page (http://site.icu-project.org//) is the main point of entry for information on International Components for Unicode (ICU).

ICU Converter Explorer

The ICU Converter Explorer available at http://demo.icu-project.org/icu-bin/convexp shows aliases and more information on ICU converters. An ICU converter is the codepage definition used by ICU. The ICU converter is defined by a so-called UCM format. If the location has changed since this documentation was published, perform an internet search for the ICU home page and follow the links to the ICU Converter Explorer.

The mapping of aliases to ICU converters is also provided as a text source within an EntireX installation. The location depends on the operating system:

UNIX: /<Install_Dir>/EntireX/etc/convrtrs.txt
Windows: <drive>:\SoftwareAG\EntireX\etc\convrtrs.txt

UCM Format

The codepage definition text files for ICU are described in UCM format (extension ".ucm"). You can edit them with any text editor. The most important section is the mapping table between the CHARMAP and END CHARMAP lines. Each line contains a Unicode code point and the related codepage character byte sequence followed by an optional precision indicator. Four kinds of definitions are supported by the precision indicator:

`0`	Normal roundtrip mapping from a Unicode code point and back.
`1`	Fallback mappings are used during conversion from Unicode to the codepage, but not back again. This definition may be present if a character exists in Unicode but not in the codepage. This feature is useful for human-readable output where the missing character is mapped to a similar looking one.
`2`	Substitution mappings resulting in assignment of the alternative substitution sequence (subchar1 in UCM format) when a non-convertible character occurs, instead of assigning the default substitution sequence (subchar in UCM format).
`3`	Reverse fallback mappings are used during conversion from the codepage to Unicode, but not back again. This definition results in assigning the same Unicode code point for different codepage character byte sequences.

This brief explanation does not intend to describe the UCM file format fully. For further explanation of the UCM file format, see the ICU home page under ICU Resources above.

ICU's Conversion Technique

ICU uses algorithmic conversion, non-algorithmic conversion and combinations of both. With non-algorithmic conversion, tables are provided that contain a mapping of codepage characters to Unicode as a definition of a codepage. This format is also called UCM Format.

ICU conversion is a two-step process:

The conversion table designated by the sender is used to convert from characters of the source codepage to Unicode.
The conversion table designated by the receiver in the reverse direction is used to convert from Unicode to characters of the target codepage.

ICU uses line-oriented text files to define non-algorithmic converters. For complex codepages, partially or fully algorithmic converters may be used, which cannot be defined as simple text files.

ICU Custom Converters

If the provided standard ICU converters (codepages) do not match your requirements, the ICU codepages can be extended by user-written ICU custom converters. This is done with the ICU tool makeconv delivered with EntireX. With makeconv, ICU converter files in UCM Format are compiled into a binary format with extension cnv. The binary format cnv depends on the endianness (big/little endian) and charset family (ASCII/EBCDIC) where makeconv is executed. See Building and Installing ICU Custom Converters under z/OS | UNIX | Windows.

Arabic Shaping

Introduction

Arabic shaping is part of ICU Conversion and is available between UTF-8, the Arabic ASCII codepage windows-1256 and the Arabic EBCDIC codepage IBM-420 for all of the communication models EntireX Broker offers:

ACI-based programming in its various language bindings (Java, C, Assembler, Natural, etc.)
RPC-based components and Reliable RPC, such as DCOM Wrapper, Java Wrapper, XML/SOAP Wrapper, Web Services Wrapper, COBOL Wrapper, PL/I Wrapper, .NET Wrapper etc.

Shaping is performed only on the codepages listed above. Arabic text data must be in logical order; visual order is not supported.

During character conversion, data length may increase or decrease for this type of codepage. The Rules for Data Length Changes due to Character Conversion apply.

Configuration

Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.

Hebrew Codepage 803

Introduction

CP803 does not include Latin lowercase characters. For RPC-based components and Reliable RPC, error messages, ping replies etc. are converted to uppercase before conversion to CP803, so Hebrew Codepage 803 can be used with RPC. Application Latin lowercase characters cannot be used within application data IDL type A, IDL AV, IDL program and IDL library.

For ACI-based programming there is no special behavior. Latin lowercase characters cannot be used.

Configuration

Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.

EBCDIC Stateful Codepages

Introduction

These codepages are designed for use in Asian countries. They are encoded using escape technique (SI/SO bytes). An example is CP930 (Japan).

For RPC-based components and Reliable RPC, we recommend RPC programmers observe the following, otherwise unpredictable results may occur:

IDL data types K, KV
- SO and SI escape characters may not be contained
- only double-byte characters allowed
- single-byte characters cannot be transferred
IDL data types A, AV
- single-byte and double-byte characters can be transferred using SO and SI escape characters

For ACI-based programming, single-byte and double-byte characters can be transferred using SO and SI escape characters.

During character conversion, data length may inrease or decrease for this type of codepage. The Rules for Data Length Changes due to Character Conversion apply.

Configuration

Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.

Multibyte or Double-byte Codepages

Examples are UTF8, CP950, Big5 (traditional Chinese).

Introduction
Configuration

Introduction

During character conversion, data length may inrease or decrease for this type of codepage. The Rules for Data Length Changes due to Character Conversion apply.

Configuration

Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.

Rules for Data Length Changes due to Character Conversion

Character conversion may force a decrease or increase in data lengths. Such data length changes occur for Multibyte or Double-byte Codepages (for example UTF8), EBCDIC Stateful Codepages, and if Arabic Shaping is in effect. Data length changes do not occur for single-byte codepages.

For RPC-based components and Reliable RPC, behavior depends on the IDL data type (see IDL Data Types in the IDL Editor documentation). RPC programmers should be aware of the following:

IDL data types AV, KV
- IDL field length may increase or decrease. The resulting length is set accordingly.
IDL data types A, K
- IDL field length cannot change.
- If the data length decreases, padding with SPACE occurs.
- If the data length increases characters are cut off at character boundaries, which may force again padding with SPACE.
IDL data types AVn, KVn
- IDL field length has a maximum.
- If the data length decreases, the field length is adjusted accordingly.
- If the data length increases over the maximum length characters are cut off at character boundaries. The resulting length is set accordingly.

For ACI-based programming, the complete payload may change its length in bytes during conversion.

Codepage Requirements for RPC Data Stream Conversions

Codepages used to convert RPC data streams must meet several requirements:

Codepages used to convert RPC data streams must have the following code points (characters) defined:

Character	also known as	Rendered	Unicode Code Point
uppercase letters A-Z without special characters		A - Z	0x0041 to 0x005A
lowercase letters a-z without special characters		a - z	0x0061 to 0x007A
digits		0-9	0x0030 to 0x0039
SPACE		" "	0x0020
LEFT PARENTHESIS	OPENING PARENTHESIS	"("	0x0028
RIGHT PARENTHESIS	CLOSING PARENTHESIS	")"	0x0029
PLUS SIGN		"+"	0x002B
HYPHEN	MINUS	"-"	0x002D
SOLIDUS	SLASH	"/"	0x002F
COLON		":"	0x003A
COMMA		","	0x002C
FULL STOP	PERIOD	"."	0x002E
EQUALS SIGN		"="	0x003D

All code points (characters) listed in the table above must have a unique mapping (without any fallbacks and reverse fallbacks) to/from Unicode, that is, they must be roundtrip-compatible.
If the codepage used is a multibyte or double-byte codepage, the code points (characters) listed in the table above must have a length of 1 byte within the codepage. Therefore UTF-16 encoding cannot be used, but UTF-8 encoding is possible.

Codepages that do not obey the rules above cannot be used for RPC-based components, because those code points (characters) are used to code for example the IDL library and IDL program, descriptive metadata and IDL type fields in numeric, integer and binary form.

Broker's Mechanism for Choosing the Character Conversion Approach

The automatic mechanism for choosing the character conversion approach applies to the following versions:

EntireX Broker 10.1 and above (z/OS, UNIX, Windows)
EntireX Broker 10.3 and above (BS2000)

For example, RPC components indicate to the broker that the data stream is RPC; the broker uses this information to choose the character conversion approach. In this way, incorrect configurations are detected and corrected.

Broker Attribute File Definition	RPC Data Stream Detected ⁽¹⁾	ACI Data Stream Detected ⁽²⁾	ACI or RPC Data Stream ⁽³⁾
`CONVERSION=user exit`	`user exit`	`user exit`	`user exit`
`TRANSLATION=user exit`	`user exit`	`user exit`	`user exit`
`CONVERSION=SAGTCHA`	`CONVERSION=SAGTRPC`	`CONVERSION=SAGTCHA`	`CONVERSION=SAGTCHA`
`CONVERSION=SAGTRPC`	`CONVERSION=SAGTRPC`	`CONVERSION=SAGTCHA`	`CONVERSION=SAGTRPC`
`CONVERSION=NO` or `TRANSLATION=NO`	`CONVERSION=SAGTRPC`	no character conversion	no character conversion
`TRANSLATION=SAGTCHA` ⁽⁴⁾	`CONVERSION=SAGTRPC`	`CONVERSION=SAGTCHA`	`CONVERSION=SAGTCHA`

Key

`BOLD`		Character conversion is determined by the broker automatically; the definition in the broker attribute file is ignored and a warning message (one of 00200781, 00200782, 00200786, 00200787, 00200788, 00200789, 00200790) is written to the broker log file. Adapt your broker attribute file to avoid the message.

Notes

RPC data stream is detected automatically by the broker if the version of RPC server component is the following (or above):
- EntireX RPC Server (BS2000) 10.3
- EntireX RPC Server (z/OS CICS, z/OS Batch, z/OS IMS, Micro Focus, C, .NET) 9.10
- EntireX RPC Server (Java, CICS ECI, IMS Connect, XML/SOAP, RPC-ACI, IBM MQ) 9.9
- EntireX Adapter 9.9
- Natural RPC Server (Mainframe 8.2.7, Open Systems 8.4.1)
ACI data stream is detected automatically by the broker from EntireX Java ACI 9.12 or later.
If ACI communication is used from non-Java environments, or the EntireX RPC server or Natural RPC server is from an earlier version than listed under Note 1, the data stream can be ACI or RPC.
TRANSLATION=SAGTCHA is ignored. The broker uses CONVERSION.

Order of Precedence

Character conversion is chosen by the broker in the following order of precedence:

You can always write your own User Exits if this matches your requirements. This is the first choice if defined.
If the broker detects an RPC data stream (see Note 1 above), ICU conversion with broker attribute CONVERSION=SAGTRPC is used.
If neither broker attribute CONVERSION nor broker attribute TRANSLATION is defined (the attribute is omitted or set to NO) no character conversion occurs.
If the broker detects an ACI data stream (see Note 2 above), ICU conversion with broker attribute CONVERSION=SAGTCHA is used.
If the broker attribute CONVERSION=SAGTRPC is defined, ICU conversion approach SAGTRPC is used.
In all other cases, ICU conversion approach SAGTCHA is used.

Translation User Exit

Introduction

With translation user exits, the code points of the codepage used are under your control. You can distinguish between ASCII, IBM EBCDIC and BS2000 EBCDIC environments (where the caller or participant is running). Code points can be adapted to meet your requirements. This requires programming a user-specific translation routine. See Writing Translation User Exits under z/OS | UNIX | Windows | BS2000 | z/VSE. The delivered model for the translation user exit supports single-byte codepages.

For RPC-based components and Reliable RPC, the codepage you implement must meet the Codepage Requirements for RPC Data Stream Conversions.

For ACI-based programming, you can make any structure of the data (mixture of text and binary data) within your payload known to the user exit by means of the ACI field ENVIRONMENT, which can be shared between your application and the translation user exit.

Configuration

Configuration effort is easy, only TRANSLATION in the broker attribute file has to be set to the name of your user exit. Nothing needs to be configured or considered for the EntireX component (sender or receiver).

We do not recommend using a translation user exit. If you only need to adapt codepoints, consider migration to ICU Conversion.

Migration to ICU Conversion

Introduction

If a Translation User Exit is used to adapt code points only, that is, to implement a standard ASCII/EBCDIC codepage, the same functionality can be achieved with ICU conversion, simply by using Broker's Locale String Defaults, well configured, and CONVERSION-OPTION-SUBSTITUTE set for the same error behavior as translation. See OPTION Values for Conversion.

Configuration

Example: For an environment running in Spain using clients with the Windows 1252 codepage and servers on IBM mainframe with codepage 1145, set the following Codepage-specific Attributes:

DEFAULTS=CODEPAGE
            * Broker Locale String defaults 
            DEFAULT_ASCII=windows-1252
            DEFAULT_EBCDIC_IBM=ibm-1145

For ACI-based programming, set the service-specific attribute CONVERSION:

DEFAULTS=SERVICE
            . . . 
            CONVERSION=(SAGTCHA,OPTION=SUBSTITUTE)
            . . .

For RPC-based components and Reliable RPC, set the service-specific attribute CONVERSION

DEFAULTS=SERVICE
            . . . 
            CONVERSION=(SAGTRPC,OPTION=SUBSTITUTE)
            . . .

For more examples see Configuring Broker's Locale String Defaults.

SAGTRPC User Exit

Introduction

With the SAGTRPC user exit you can invent your own conversion package/method for RPC-based components and Reliable RPC if for any reason a codepage is not supported by ICU Conversion and SAGTRPC conversion, and CONVERSION=SAGTRPC is configured in the broker attribute file. SAGTRPC user exit cannot be used for ACI-based programming.

SAGTRPC user exit allows you to adapt codepages and their characters (code points) to meet your requirements. See Writing SAGTRPC User Exits under z/OS | UNIX | Windows. The codepage you implement must meet the Codepage Requirements for RPC Data Stream Conversions.

Configuration

The broker must be configured for the platform it is running on. See Configuring SAGTRPC User Exits under z/OS | UNIX | Windows.