This document provides an introduction to the topic of internationalization with EntireX and describes the various approaches offered. It covers the following topics:
See also What is the Best Internationalization Approach to use?
The translation and conversion of codepages is a symmetric process. Everything that is valid for the request (client to server) relates also to the reply (server to client), with opposite roles. Therefore the terms sender and receiver are used instead of client and server in this section.
Internationalization with EntireX provides the following:
Codepage conversion is available for senders and receivers, so any participant is able to work with the desired codepage. A participant tells the broker the codepage they use to send and receive messages. This means the broker is able to perform a conversion from/to the desired characters (code points) within the codepages.
Codepage conversion deals with the user's payload data in broker's send and receive buffers. The broker ACI control block is handled differently and does not require special attention concerning internationalization. See Broker ACI Fields.
For the simpler approaches Translation and Translation User Exits, participants give the codepage to the broker implicitly - nothing needs to be configured for EntireX components (senders and receivers).
For the more accurate approaches of ICU Conversion and SAGTRPC User Exit, the codepage that an EntireX component (sender and receiver) uses is described by so-called locale strings (alias name of a codepage) sent along with the request to the broker. The locale string always requires some attention. Depending on the platform your EntireX component is running on, the locale string is sent automatically by default or must be provided.
As long as you use your platform's default codepage and the locale string is provided automatically, nothing else needs to be considered.
If the locale string is not provided automatically, providing one can be a programming issue or a configuration issue, depending on the EntireX component used.
For information on how to provide locale strings, or whether locale strings are sent automatically, see table Preparing EntireX Components for Internationalization.
Codepage conversion is available for all of the communication models the broker offers, for example:
ACI-based Programming in its various language bindings (Java, C, Assembler, Natural, etc.).
RPC-based Components and Reliable RPC such as DCOM Wrapper, Java Wrapper, XML/SOAP Wrapper, Web Services Wrapper, COBOL Wrapper, PL/I Wrapper, .NET Wrapper, etc.
The following sections discuss all of the internationalization approaches offered by EntireX.
ICU conversion is based on IBM's project International Components for Unicode. It is a mature, widely used set of C/C++ and Java libraries for Unicode support, software internationalization and globalization.
ICU comes with a set of ICU converters (codepages) based on codepages from ISO and software vendors such as Microsoft and IBM. It is a standardized approach, and it is possible to extend the set with ICU custom converters.
You can use ICU conversion in the following situations:
if you need multiple codepages per environment, for example more than one unique ASCII, IBM mainframe or Fujitsu mainframe codepage
if you need single-byte, double-byte or multibyte conversions
if you need standardized codepages and the ICU converters provided meet your requirements
if you need Arabic Shaping
If you require special codepages that are not delivered, you can install user-written ICU Custom Converters.
For ICU conversion to function correctly, the following requirements must be met:
The broker must be configured for the platform it is running on. See Configuring ICU Conversion under
z/OS |
UNIX |
Windows |
BS2000 |
z/VSE.
The service-specific attribute CONVERSION
in the attribute file must be set:
for ACI-based Programming, use SAGTCHA for any type of codepage that has single-byte, double-byte and multibyte encoding schemes. See Conversion Details.
for RPC-based Components and Reliable RPC, use SAGTRPC for any type of codepage that has single-byte, double-byte and multibyte encoding schemes. See Conversion Details.
We recommend always using SAGTRPC for RPC data streams. Conversion with Multibyte, Double-byte and other Complex Codepages will always be correct, and Conversion with Single-byte Codepages is also efficient because SAGTRPC detects single-byte codepages automatically. See Conversion Details.
EntireX components (sender and receiver) must send a locale string to the broker. Depending on the platform your EntireX component is running on, this is done automatically by default - nothing else needs to be configured as long as you use your platform's default codepage. If the locale string is not provided automatically, it can be set as a programming issue or a configuration issue, depending on the EntireX component used. See Preparing EntireX Components for Internationalization.
The locale string sent by EntireX components (sender and receiver), the encoding of the ACI payload or RPC stream, and the ICU converter (codepage) must always match, otherwise unpredictable results occur. Checking for the availability of the correct ICU converters (codepages) is mandatory.
ICU uses algorithmic conversion, non-algorithmic conversion and combinations of both. With non-algorithmic conversion, tables are provided that contain a mapping of codepage characters to Unicode as a definition of a codepage. This format is also called UCM Format.
ICU conversion is a two-step process:
The conversion table designated by the sender is used to convert from characters of the source codepage to Unicode.
The conversion table designated by the receiver in the reverse direction is used to convert from Unicode to characters of the target codepage.
ICU uses line-oriented text files to define non-algorithmic converters. For complex codepages, partially or fully algorithmic converters may be used, which cannot be defined as simple text files.
Please refer to "License Texts, Copyright Notices and Disclaimers of Third Party Products". This document is part of the product documentation available at http://documentation.softwareag.com.
This section covers the following topics:
The ICU home page (http://www.icu-project.org/) is the main point of entry for information on International Components for Unicode (ICU).
The ICU Converter Explorer available at http://demo.icu-project.org/icu-bin/convexp shows aliases and more information on ICU converters. An ICU converter is the codepage definition used by ICU. The ICU converter is defined by a so-called UCM format. If the location has changed since this documentation was published, perform an internet search for the ICU home page and follow the links to the ICU Converter Explorer.
The mapping of aliases to ICU converters is also provided as a text source within an EntireX installation. The location depends on the operating system:
UNIX: /<Install_Dir>/EntireX/etc/convrtrs.txt
Windows: <drive>:\SoftwareAG\EntireX\etc\convrtrs.txt
EntireX includes a standard set of the most commonly used ICU converters (codepages) in binary format packed into shared libraries.
If the provided standard ICU converters (codepages) do not match your
requirements, the ICU codepages can be extended by user-written ICU custom
converters. This is done with the ICU tool makeconv
delivered with
EntireX. With makeconv
, ICU converter files in
UCM Format are
compiled into a binary format with extension cnv
. The binary
format cnv
depends on the endianness (big/little endian) and
charset family (ASCII/EBCDIC) where makeconv
is executed.
See Building and Installing ICU Custom Converters under
z/OS |
UNIX |
Windows.
The codepage definition text files for ICU are described in UCM format (extension ".ucm"). You can edit them with any text editor. The most important section is the mapping table between the CHARMAP and END CHARMAP lines. Each line contains a Unicode code point and the related codepage character byte sequence followed by an optional precision indicator. Four kinds of definitions are supported by the precision indicator:
0 - normal roundtrip mapping from a Unicode code point and back.
1 - fallback mappings are used during conversion from Unicode to the codepage, but not back again. This definition may be present if a character exists in Unicode but not in the codepage. This feature is useful for human-readable output where the missing character is mapped to a similar looking one.
2 - substitution mappings resulting in assignment of the alternative substitution sequence (subchar1 in UCM format) when a non-convertible character occurs, instead of assigning the default substitution sequence (subchar in UCM format).
3 - reverse fallback mappings are used during conversion from the codepage to Unicode, but not back again. This definition results in assigning the same Unicode code point for different codepage character byte sequences.
This brief explanation does not intend to describe the UCM file format fully. For further explanation of the UCM file format, see the ICU home page under ICU Resources above.
Translation is the quick-start approach with little configuration
required, only TRANSLATION
in the broker attribute file has
to be set to the value SAGTCHA. Nothing needs to be configured or considered
for the EntireX component (sender or receiver). Translation does not need
locale strings. If translation is specified and an EntireX component sends a
locale string, the locale string will be ignored.
Translation has limitations on the number of environments supported and the number of different codepages for the environment in which your EntireX components (sender or receiver) are running:
all ASCII environments (UNIX, Windows, etc.) must use the same ASCII codepage
all IBM mainframes must use the same EBCDIC codepage
all Fujitsu mainframes must use the same EBCDIC codepage
Translation has further limitations on the code points used within the codepages provided. The translation routine SAGTCHA is loosely based on the following platform-dependent codepages:
Environment | Indicator sent from EntireX Component to Broker | Based on Codepage | Description |
---|---|---|---|
All ASCII environments (UNIX, Windows etc.) | x'80' | Microsoft Windows codepage 1252 | Translation of characters for ASCII environments is loosely based on Windows codepage 1252. Not all of the characters of Windows codepage 1252 are supported by translation. All of the characters supported have the same code point in codepage ISO 8859-1, thus this is also suitable for UNIX. |
IBM mainframe | x'22' | IBM codepage 273 | Translation of characters for the IBM mainframe platform is loosely based on IBM codepage 273. Not all of the characters of the IBM codepage 273 are supported by translation. |
Fujitsu mainframe | x'42' | EDF 03 national version for Germany | Translation of characters is loosely based on the EDF03 codepage for Germany. |
Characters (code points) supported by SAGTCHA are the same as in the Translation user exit example. See Writing Translation User Exits under z/OS | UNIX | Windows | BS2000 | z/VSE.
You can use translation in the following situations:
if you have a mixed environment consisting of ASCII, IBM mainframes and/or Fujitsu mainframe platforms but
all of your ASCII Environments use the same codepage
all of your IBM mainframes use the same codepage
all of your Fujitsu mainframes use the same codepage
if single-byte codepages meet your requirements.
if the code points within the delivered codepages meet your requirements. Please note that not all code points implemented by SAGTCHA are roundtrip-compatible even if in your environment the Microsoft Windows codepage 1252, IBM codepage 273 and EDF 03 national version for Germany are used. Roundtrip incompatibility means that if you transfer a character from an ASCII platform to IBM EBCDIC or Fujitsu EBCDIC and back again you will get a different character. Important code points (characters) such as uppercase letters A-Z, lowercase letters a-z, digits 0-9 and the characters listed in Codepage Requirements for RPC Data Stream Conversions and also others are roundtrip-compatible.
for RPC-based Components, Reliable RPC as well as ACI-based Programming and other communication models.
See Configuring Translation under z/OS | UNIX | Windows | BS2000 | z/VSE.
With translation user exits, the code points of the codepage used are under your control. You can adapt them to meet your requirements. This requires programming a user-specific translation routine. See Writing Translation User Exits under z/OS | UNIX | Windows | BS2000 | z/VSE. The delivered model for the translation user exit supports single-byte codepages only, but in principle any type of codepage can be implemented.
With translation user exits, you can make any structure of the data
(mixture of text and binary data) within your payload known to the user exit by
means of the ACI field ENVIRONMENT
,
which can be shared between your application and the translation user exit. For
more information, see Using the ENVIRONMENT
Field with the Translation User Exit.
Configuration effort is easy, only TRANSLATION
in the broker attribute file has
to be set to the name of your user exit. Nothing needs to be configured or
considered for the EntireX component (sender or receiver). A translation user exit does not
need locale strings. If a translation user exit is specified and an EntireX
component sends a locale string, the locale string will be ignored.
The limitations on the number of environments and different codepages per environment remain the same as for translation.
You can use a translation user exit in the following situations:
if you want to invent your own conversion package
if you have to consider any payload data structure for translation
if it is necessary to share data between the client/server application
and the translation routine by using
the broker ACI field ENVIRONMENT
if you have a mixed environment consisting of ASCII, IBM mainframes and/or Fujitsu mainframe platforms, but
all of your ASCII environments use the same codepage
all of your IBM mainframes use the same codepage
all of your Fujitsu mainframes use the same codepage
for ACI-based Programming, if single-byte codepages meet your requirements. Otherwise you will have to invent a model for other types of codepages such as multibyte, double-byte and EBCDIC stateful - this can become very complicated and involve considerable effort.
for RPC-based Components, if single-byte codepages meet your requirements only. The codepage you implement must meet the Codepage Requirements for RPC Data Stream Conversions.
If a Translation User Exit is used to adapt code points only,
that is, to implement a standard ASCII/EBCDIC codepage, the same functionality can be achieved with ICU conversion, simply
by using
Broker's Locale String Defaults, well configured, and CONVERSION-OPTION-SUBSTITUTE
set for the same error behavior as translation. See OPTION
Values for Conversion.
For an environment running in Spain using clients with the Windows 1252 codepage and servers on IBM mainframe with codepage 1145, set the following Codepage-specific Attributes:
DEFAULTS=CODEPAGE * Broker Locale String defaults DEFAULT_ASCII=windows-1252 DEFAULT_EBCDIC_IBM=ibm-1145
For ACI-based Programming, set the service-specific attribute CONVERSION
:
DEFAULTS=SERVICE . . . CONVERSION=(SAGTCHA,OPTION=SUBSTITUTE) . . .
For RPC-based Components and Reliable RPC, set the service-specific attribute CONVERSION
DEFAULTS=SERVICE . . . CONVERSION=(SAGTRPC,OPTION=SUBSTITUTE) . . .
For more examples see Configuring Broker's Locale String Defaults.
With the SAGTRPC user exit you can invent your own conversion package/method for RPC-based Components and Reliable RPC if for any reason a codepage is not supported by ICU Conversion and SAGTRPC conversion. SAGTRPC user exit cannot be used for ACI-based Programming.
SAGTRPC user exit allows you to adapt codepages and their characters (code points) to meet your requirements. This requires some effort in programming a SAGTRPC user exit. See Writing SAGTRPC User Exits under z/OS | UNIX | Windows. The delivered model for the SAGTRPC user exit supports single-byte codepages only, but in principle any type of codepage can be implemented.
You can use SAGTRPC user exit in the following situations:
if you want to invent your own conversion package for RPC-based data stream conversion
if you require different types of conversions depending on individual IDL field types
for RPC-based Components and Reliable RPC only; it cannot be used for ACI-based programming
if the codepages you implement meet the Codepage Requirements for RPC Data Stream Conversions
For SAGTRPC user exit to function correctly, the following requirements must be met:
The broker must be configured for the platform it is running
on. See Configuring SAGTRPC User Exits under
z/OS |
UNIX |
Windows.
The service-specific attribute CONVERSION
in the broker attribute file must
be set to the name of your routine.
Locale strings may be provided. It depends on your implementation of the SAGTRPC user exit whether the components (sender and receiver) have to send a locale string to the broker or not. See Preparing EntireX Components for Internationalization.
The handling of the different IDL type fields depends on the implementation of the SAGTRPC user exit, which is the customer's responsibility. See Writing SAGTRPC User Exits under z/OS | UNIX | Windows.
Arabic shaping is part of ICU Conversion and is available between UTF-8, the Arabic ASCII codepage windows-1256 and the Arabic EBCDIC codepage IBM-420 for all of the communication models EntireX Broker offers, for example:
ACI-based Programming in its various language bindings (Java, C, Assembler, Natural, etc.)
RPC-based Components and Reliable RPC, such as DCOM Wrapper, Java Wrapper, XML/SOAP Wrapper, Web Services Wrapper, COBOL Wrapper, PL/I Wrapper, .NET Wrapper etc.
Shaping is performed only on the codepages listed above. Arabic text data must be in logical order; visual order is not supported. See also Conversion with Multibyte, Double-byte and other Complex Codepages.