Introduction to Internationalization

This document provides an introduction to the topic of internationalization with EntireX and describes the various approaches offered. It covers the following topics:

Overview
ICU Conversion
ICU Resources
Translation
Translation User Exit
Translation User Exit Replacement with ICU Conversion
SAGTRPC User Exit
Arabic Shaping

Overview

The translation and conversion of codepages is a symmetric process. Everything that is valid for the request (client to server) relates also to the reply (server to client), with opposite roles. Therefore the terms sender and receiver are used instead of client and server in this section.

Internationalization with EntireX provides the following:

Codepage conversion is available for senders and receivers, so any participant is able to work with the desired codepage. A participant tells the broker the codepage they use to send and receive messages. This means the broker is able to perform a conversion from/to the desired characters (code points) within the codepages.
Codepage conversion deals with the user's payload data in broker's send and receive buffers. The broker ACI control block is handled differently and does not require special attention concerning internationalization. See Broker ACI Fields.
For the simpler approaches Translation and Translation User Exits, participants give the codepage to the broker implicitly - nothing needs to be configured for EntireX components (senders and receivers).
For the more accurate approaches of ICU Conversion and SAGTRPC User Exit, the codepage that an EntireX component (sender and receiver) uses is described by so-called locale strings (alias name of a codepage) sent along with the request to the broker. The locale string always requires some attention. Depending on the platform your EntireX component is running on, the locale string is sent automatically by default or must be provided.
- As long as you use your platform's default codepage and the locale string is provided automatically, nothing else needs to be considered.
- If the locale string is not provided automatically, providing one can be a programming issue or a configuration issue, depending on the EntireX component used.
- For information on how to provide locale strings, or whether locale strings are sent automatically, see table Preparing EntireX Components for Internationalization.
Codepage conversion is available for all of the communication models the broker offers, for example:
- ACI-based Programming in its various language bindings (Java, C, Assembler, Natural, etc.).
- RPC-based Components and Reliable RPC such as DCOM Wrapper, Java Wrapper, XML/SOAP Wrapper, Web Services Wrapper, COBOL Wrapper, PL/I Wrapper, .NET Wrapper, etc.
- Publish and Subscribe.

The following sections discuss all of the internationalization approaches offered by EntireX.

Introduction

ICU conversion is based on IBM's project International Components for Unicode. It is a mature, widely used set of C/C++ and Java libraries for Unicode support, software internationalization and globalization.

ICU comes with a set of ICU converters (codepages) based on codepages from ISO and software vendors such as Microsoft and IBM. It is a standardized approach, and it is possible to extend the set with ICU custom converters.

Using ICU Conversion

You can use ICU conversion in the following situations:

if you need multiple codepages per environment, for example more than one unique ASCII, IBM mainframe or Fujitsu mainframe codepage
if you need single-byte, double-byte or multibyte conversions
if you need standardized codepages and the ICU converters provided meet your requirements
if you need Arabic Shaping

If you require special codepages that are not delivered, you can install user-written ICU Custom Converters.

Requirements for ICU Conversion

For ICU conversion to function correctly, the following requirements must be met:

The broker must be configured for the platform it is running on. See Configuring ICU Conversion under z/OS | UNIX | Windows | BS2000/OSD | z/VSE. The service-specific or topic-specific broker attribute CONVERSION in the attribute file must be set:
- for ACI-based Programming, use SAGTCHA for any type of codepage that has single-byte, double-byte and multibyte encoding schemes. See Conversion Details.
- for RPC-based Components and Reliable RPC, use SAGTRPC for any type of codepage that has single-byte, double-byte and multibyte encoding schemes. See Conversion Details.
  
  We recommend always using SAGTRPC for RPC data streams. Conversion with Multibyte, Double-byte and other Complex Codepages will always be correct, and Conversion with Single-byte Codepages is also efficient because SAGTRPC detects single-byte codepages automatically. See Conversion Details.
EntireX components (sender and receiver) must send a locale string to the broker. Depending on the platform your EntireX component is running on, this is done automatically by default - nothing else needs to be configured as long as you use your platform's default codepage. If the locale string is not provided automatically, it can be set as a programming issue or a configuration issue, depending on the EntireX component used. See Preparing EntireX Components for Internationalization.
The locale string sent by EntireX components (sender and receiver), the encoding of the ACI payload or RPC stream, and the ICU converter (codepage) must always match, otherwise unpredictable results occur. Checking for the availability of the correct ICU converters (codepages) is mandatory.

ICU's Conversion Technique

ICU uses algorithmic conversion, non-algorithmic conversion and combinations of both. With non-algorithmic conversion, tables are provided that contain a mapping of codepage characters to Unicode as a definition of a codepage. This format is also called UCM Format.

ICU conversion is a two-step process:

The conversion table designated by the sender is used to convert from characters of the source codepage to Unicode.
The conversion table designated by the receiver in the reverse direction is used to convert from Unicode to characters of the target codepage.

ICU uses line-oriented text files to define non-algorithmic converters. For complex codepages, partially or fully algorithmic converters may be used, which cannot be defined as simple text files.

ICU Resources

Please refer to "License Texts, Copyright Notices and Disclaimers of Third Party Products". This document is part of the product documentation available at http://documentation.softwareag.com.

This section covers the following topics:

ICU Homepage
ICU Converter Explorer
ICU Converter Resources
ICU Custom Converters
UCM Format

ICU Homepage

The ICU home page (http://www.icu-project.org/) is the main point of entry for information on International Components for Unicode (ICU).

ICU Converter Explorer

The ICU Converter Explorer available at http://demo.icu-project.org/icu-bin/convexp shows aliases and more information on ICU converters. An ICU converter is the codepage definition used by ICU. The ICU converter is defined by a so-called UCM format. If the location has changed since this documentation was published, perform an internet search for the ICU home page and follow the links to the ICU Converter Explorer.

The mapping of aliases to ICU converters is also provided as a text source within an EntireX installation. The location depends on the operating system:

UNIX: /opt/softwareag/EntireX/etc/convrtrs.txt
Windows: <drive>:\SoftwareAG\EntireX\etc\convrtrs.txt

ICU Converter Resources

EntireX includes a standard set of the most commonly used ICU converters (codepages) in binary format packed into shared libraries.

ICU Custom Converters

If the provided standard ICU converters (codepages) do not match your requirements, the ICU codepages can be extended by user-written ICU custom converters. This is done with the ICU tool makeconv delivered with EntireX. With makeconv, ICU converter files in UCM Format are compiled into a binary format with extension cnv. The binary format cnv depends on the endianness (big/little endian) and charset family (ASCII/EBCDIC) where makeconv is executed. See Building and Installing ICU Custom Converters under UNIX | Windows.

UCM Format

The codepage definition text files for ICU are described in UCM format (extension ".ucm"). You can edit them with any text editor. The most important section is the mapping table between the CHARMAP and END CHARMAP lines. Each line contains a Unicode code point and the related codepage character byte sequence followed by an optional precision indicator. Four kinds of definitions are supported by the precision indicator:

0 - normal roundtrip mapping from a Unicode code point and back.
1 - fallback mappings are used during conversion from Unicode to the codepage, but not back again. This definition may be present if a character exists in Unicode but not in the codepage. This feature is useful for human-readable output where the missing character is mapped to a similar looking one.
2 - substitution mappings resulting in assignment of the alternative substitution sequence (subchar1 in UCM format) when a non-convertible character occurs, instead of assigning the default substitution sequence (subchar in UCM format).
3 - reverse fallback mappings are used during conversion from the codepage to Unicode, but not back again. This definition results in assigning the same Unicode code point for different codepage character byte sequences.

This brief explanation does not intend to describe the UCM file format fully. For further explanation of the UCM file format, see the ICU home page under ICU Resources above.

Translation

Introduction

Translation is the quick-start approach with little configuration required, only service-specific or topic-specific broker attribute TRANSLATION in the broker attribute file has to be set to the value SAGTCHA. Nothing needs to be configured or considered for the EntireX component (sender or receiver). Translation does not need locale strings. If translation is specified and an EntireX component sends a locale string, the locale string will be ignored.

Translation has limitations on the number of environments supported and the number of different codepages for the environment in which your EntireX components (sender or receiver) are running:

all ASCII environments (UNIX, Windows, etc.) must use the same ASCII codepage
all IBM mainframes must use the same EBCDIC codepage
all Fujitsu mainframes must use the same EBCDIC codepage

Translation Codepages

Translation has further limitations on the code points used within the codepages provided. The translation routine SAGTCHA is loosely based on the following platform-dependent codepages:

Environment	Indicator sent from EntireX Component to Broker	Based on Codepage	Description
All ASCII environments (UNIX, Windows etc.)	x'80'	Microsoft Windows codepage 1252	Translation of characters for ASCII environments is loosely based on Windows codepage 1252. Not all of the characters of Windows codepage 1252 are supported by translation. All of the characters supported have the same code point in codepage ISO 8859-1, thus this is also suitable for UNIX.
IBM mainframe	x'22'	IBM codepage 273	Translation of characters for the IBM mainframe platform is loosely based on IBM codepage 273. Not all of the characters of the IBM codepage 273 are supported by translation.
Fujitsu mainframe	x'42'	EDF 03 national version for Germany	Translation of characters is loosely based on the EDF03 codepage for Germany.

Characters (code points) supported by SAGTCHA are the same as in the Translation user exit example. See Writing Translation User Exits under z/OS | UNIX | Windows | BS2000/OSD | z/VSE.

Using Translation

You can use translation in the following situations:

if you have a mixed environment consisting of ASCII, IBM mainframes and/or Fujitsu mainframe platforms but
- all of your ASCII Environments use the same codepage
- all of your IBM mainframes use the same codepage
- all of your Fujitsu mainframes use the same codepage
if single-byte codepages meet your requirements.
if the code points within the delivered codepages meet your requirements. Please note that not all code points implemented by SAGTCHA are roundtrip-compatible even if in your environment the Microsoft Windows codepage 1252, IBM codepage 273 and EDF 03 national version for Germany are used. Roundtrip incompatibility means that if you transfer a character from an ASCII platform to IBM EBCDIC or Fujitsu EBCDIC and back again you will get a different character. Important code points (characters) such as uppercase letters A-Z, lowercase letters a-z, digits 0-9 and the characters listed in Codepage Requirements for RPC Data Stream Conversions and also others are roundtrip-compatible.
for RPC-based Components, Reliable RPC as well as ACI-based Programming and other communication models.

See Configuring Translation under z/OS | UNIX | Windows | BS2000/OSD | z/VSE.

Translation User Exit

Introduction

With translation user exits, the code points of the codepage used are under your control. You can adapt them to meet your requirements. This requires programming a user-specific translation routine. See Writing Translation User Exits under z/OS | UNIX | Windows | BS2000/OSD | z/VSE. The delivered model for the translation user exit supports single-byte codepages only, but in principle any type of codepage can be implemented.

With translation user exits, you can make any structure of the data (mixture of text and binary data) within your payload known to the user exit by means of the ACI field ENVIRONMENT, which can be shared between your application and the translation user exit. For more information, see Using the ENVIRONMENT Field with the Translation User Exit for client and server | publish and subscribe.

Configuration effort is easy, only service-specific or topic-specific broker attribute TRANSLATION in the broker attribute file has to be set to the name of your user exit. Nothing needs to be configured or considered for the EntireX component (sender or receiver). Translation does not need locale strings. If a translation user exit is specified and an EntireX component sends a locale string, the locale string will be ignored.

The limitations on the number of environments and different codepages per environment remain the same as for translation.

Using Translation User Exit

You can use a translation user exit in the following situations:

if you want to invent your own conversion package
if you have to consider any payload data structure for translation
if it is necessary to share data between the application (client/server or publisher/subscriber) and the translation routine by using the broker ACI field ENVIRONMENT
if you have a mixed environment consisting of ASCII, IBM mainframes and/or Fujitsu mainframe platforms, but
- all of your ASCII environments use the same codepage
- all of your IBM mainframes use the same codepage
- all of your Fujitsu mainframes use the same codepage
for ACI-based Programming, if single-byte codepages meet your requirements. Otherwise you will have to invent a model for other types of codepages such as multibyte, double-byte and EBCDIC stateful - this can become very complicated and involve considerable effort.
for RPC-based Components, if single-byte codepages meet your requirements only. The codepage you implement must meet the Codepage Requirements for RPC Data Stream Conversions.

Translation User Exit Replacement with ICU Conversion

If a Translation User Exit is used to adapt code points only, that is, to implement a standard ASCII/EBCDIC codepage, the same functionality can be achieved with ICU conversion, simply by using Broker's Locale String Defaults, well configured, and service-specific or topic-specific broker attribute CONVERSION OPTION=SUBSTITUTE set for the same error behavior as translation. See OPTION Values for Conversion.

Example

For an environment running in Spain using clients with the Windows 1252 codepage and servers on IBM mainframe with codepage 1145, set the following Codepage-specific Attributes:

DEFAULTS=CODEPAGE
            /* Broker Locale String defaults */
            DEFAULT_ASCII=windows-1252
            DEFAULT_EBCDIC_IBM=ibm-1145

For ACI-based Programming, set the service-specific or topic-specific broker attribute CONVERSION:

DEFAULTS=SERVICE
            . . . 
            CONVERSION=(SAGTCHA,OPTION=SUBSTITUTE)
            . . .

For RPC-based Components and Reliable RPC, set the service-specific or topic-specific broker attribute CONVERSION

DEFAULTS=SERVICE
            . . . 
            CONVERSION=(SAGTRPC,OPTION=SUBSTITUTE)
            . . .

For more examples see Configuring Broker's Locale String Defaults.

SAGTRPC User Exit

Introduction

With the SAGTRPC user exit you can invent your own conversion package/method for RPC-based Components and Reliable RPC if for any reason a codepage is not supported by ICU Conversion and SAGTRPC conversion. SAGTRPC user exit cannot be used for ACI-based Programming.

SAGTRPC user exit allows you to adapt codepages and their characters (code points) to meet your requirements. This requires some effort in programming a SAGTRPC user exit. See Writing SAGTRPC User Exits under z/OS | UNIX | Windows. The delivered model for the SAGTRPC user exit supports single-byte codepages only, but in principle any type of codepage can be implemented.

Using SAGTRPC User Exit

You can use SAGTRPC user exit in the following situations:

if you want to invent your own conversion package for RPC-based data stream conversion
if you require different types of conversions depending on individual IDL field types
for RPC-based Components and Reliable RPC only; it cannot be used for ACI-based programming.
if the codepages you implement meet the Codepage Requirements for RPC Data Stream Conversions

Requirements for SAGTRPC User Exit

For SAGTRPC user exit to function correctly, the following requirements must be met:

The broker must be configured for the platform it is running on. See Configuring SAGTRPC User Exits under z/OS | UNIX | Windows. The CONVERSION in the broker attribute file must be set to the name of your routine.
Locale strings may be provided. It depends on your implementation of the SAGTRPC user exit whether the components (sender and receiver) have to send a locale string to the broker or not. See Preparing EntireX Components for Internationalization.
The handling of the different IDL type fields depends on the implementation of the SAGTRPC user exit, which is the customer's responsibility. See Writing SAGTRPC User Exits under z/OS | UNIX | Windows.

Arabic Shaping

Arabic shaping is part of ICU Conversion and is available between UTF-8, the Arabic ASCII codepage windows-1256 and the Arabic EBCDIC codepage IBM-420 for all of the communication models EntireX Broker offers, for example:

ACI-based Programming in its various language bindings (Java, C, Assembler, Natural, etc.)
RPC-based Components and Reliable RPC, such as DCOM Wrapper, Java Wrapper, XML/SOAP Wrapper, Web Services Wrapper, COBOL Wrapper, PL/I Wrapper, .NET Wrapper etc.
Publish and Subscribe.

Shaping is performed only on the codepages listed above. Arabic text data must be in logical order; visual order is not supported. See also Conversion with Multibyte, Double-byte and other Complex Codepages.