This document provides an introduction to the topic of character conversion with EntireX and covers the following topics:
Character conversion is a symmetric process. Everything that is valid for the request (client to server) relates also to the reply (server to client), with opposite roles. Therefore the terms sender and receiver are used instead of client and server in this section. Character conversion with EntireX provides the following:
Character conversion is available for senders and receivers, so any participant is able to work with the desired codepage. A participant tells the broker the codepage they use to send and receive messages. This means the broker is able to perform a conversion from/to the desired characters (code points) within the codepages.
Character conversion deals with the user's payload data in broker's send and receive buffers.
For ICU Conversion, the codepage that an EntireX component (sender and receiver) uses is described by so-called locale strings (alias name of a codepage) sent along with the request to the broker. Depending on the platform your EntireX component is running on, the locale string is sent automatically by default or must be provided. A huge set of codepages is available and supported; see ICU Converter Explorer.
Inside the EntireX Broker an automatic mechanism tries to find the best character conversion approach for your scenario. See Broker's Mechanism for Choosing the Character Conversion Approach.
The following sections discuss all of the character conversion approaches offered by EntireX.
This section covers the following topics:
ICU conversion is based on IBM's project International Components for Unicode. It is a mature, widely used set of C/C++ and Java libraries for Unicode support, software internationalization and globalization. ICU comes with a set of ICU converters (codepages) based on codepages from ISO and software vendors such as Microsoft and IBM. It is a standardized approach, and it is possible to extend the set with ICU Custom Converters.
To use ICU conversion, the broker must be configured for the platform it is running on. See Configuring ICU Conversion under z/OS | UNIX | Windows | BS2000.
By default it is assumed that the payload sent to/received from the broker matches the platform's default code page. EntireX components running under the Windows operating system and Java-based EntireX components send this platform default code page identifier automatically to the broker, so in most cases nothing needs to be configured or considered by a programmer here. Configuration or programmer attention is required in the following cases:
The EntireX component is running under the operating systems z/OS, UNIX, z/VSE or BS2000. No code page identifier is sent automatically on these platforms.
You require a code page other than the platform default.
Refer to the respective sections of the documentation for how to enable RPC servers and listeners to send a codepage identifier to the broker or send a different identifier than the default codepage for the platform.
Configuring the RPC Server under C | .NET | XML/SOAP | Java | z/OS (CICS, Batch, IMS) | BS2000
Configuring the IBM MQ Side (RPC Server for IBM MQ | Listener for IBM MQ)
Enabling RPC clients to send a codepage identifier to the broker or send a different identifier than the default codepage for the platform is a task for a programmer. See the following sections of the documentation:
Using Internationalization with the C Wrapper | DCOM Wrapper | .NET Wrapper | Java Wrapper | PL/I Wrapper
For ACI-based programming, see:
For Broker ActiveX Control, see localeString
under Reference > Properties
For EntireX Broker ACI for Assembler | C | COBOL | Natural | PL/I | RPG, see LOCALE-STRING
under Broker ACI Fields.
This section covers the following topics:
The ICU home page (http://site.icu-project.org//) is the main point of entry for information on International Components for Unicode (ICU).
The ICU Converter Explorer available at http://demo.icu-project.org/icu-bin/convexp shows aliases and more information on ICU converters. An ICU converter is the codepage definition used by ICU. The ICU converter is defined by a so-called UCM format. If the location has changed since this documentation was published, perform an internet search for the ICU home page and follow the links to the ICU Converter Explorer.
The mapping of aliases to ICU converters is also provided as a text source within an EntireX installation. The location depends on the operating system:
UNIX: /<Install_Dir>/EntireX/etc/convrtrs.txt
Windows: <drive>:\SoftwareAG\EntireX\etc\convrtrs.txt
The codepage definition text files for ICU are described in UCM format
(extension ".ucm"). You can edit them with any text
editor. The most important section is the mapping table between the CHARMAP
and
END CHARMAP
lines. Each line contains a Unicode code point and the
related codepage character byte sequence followed by an optional precision
indicator. Four kinds of definitions are supported by the precision
indicator:
0 |
Normal roundtrip mapping from a Unicode code point and back. |
---|---|
1 |
Fallback mappings are used during conversion from Unicode to the codepage, but not back again. This definition may be present if a character exists in Unicode but not in the codepage. This feature is useful for human-readable output where the missing character is mapped to a similar looking one. |
2 |
Substitution mappings resulting in assignment of the alternative substitution sequence (subchar1 in UCM format) when a non-convertible character occurs, instead of assigning the default substitution sequence (subchar in UCM format). |
3 |
Reverse fallback mappings are used during conversion from the codepage to Unicode, but not back again. This definition results in assigning the same Unicode code point for different codepage character byte sequences. |
This brief explanation does not intend to describe the UCM file format fully. For further explanation of the UCM file format, see the ICU home page under ICU Resources above.
ICU uses algorithmic conversion, non-algorithmic conversion and combinations of both. With non-algorithmic conversion, tables are provided that contain a mapping of codepage characters to Unicode as a definition of a codepage. This format is also called UCM Format.
ICU conversion is a two-step process:
The conversion table designated by the sender is used to convert from characters of the source codepage to Unicode.
The conversion table designated by the receiver in the reverse direction is used to convert from Unicode to characters of the target codepage.
ICU uses line-oriented text files to define non-algorithmic converters. For complex codepages, partially or fully algorithmic converters may be used, which cannot be defined as simple text files.
If the provided standard ICU converters (codepages) do not match your
requirements, the ICU codepages can be extended by user-written ICU custom
converters. This is done with the ICU tool makeconv
delivered with
EntireX. With makeconv
, ICU converter files in
UCM Format are
compiled into a binary format with extension cnv
. The binary
format cnv
depends on the endianness (big/little endian) and
charset family (ASCII/EBCDIC) where makeconv
is executed.
See Building and Installing ICU Custom Converters under z/OS | UNIX | Windows.
Arabic shaping is part of ICU Conversion and is available between UTF-8, the Arabic ASCII codepage windows-1256 and the Arabic EBCDIC codepage IBM-420 for all of the communication models EntireX Broker offers:
ACI-based programming in its various language bindings (Java, C, Assembler, Natural, etc.)
RPC-based components and Reliable RPC, such as DCOM Wrapper, Java Wrapper, XML/SOAP Wrapper, Web Services Wrapper, COBOL Wrapper, PL/I Wrapper, .NET Wrapper etc.
Shaping is performed only on the codepages listed above. Arabic text data must be in logical order; visual order is not supported.
During character conversion, data length may increase or decrease for this type of codepage. The Rules for Data Length Changes due to Character Conversion apply.
Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.
CP803 does not include Latin lowercase characters. For RPC-based components and Reliable RPC, error messages, ping replies etc. are converted to uppercase before conversion to CP803, so Hebrew Codepage 803 can be used with RPC. Application Latin lowercase characters cannot be used within application data IDL type A, IDL AV, IDL program and IDL library.
For ACI-based programming there is no special behavior. Latin lowercase characters cannot be used.
Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.
These codepages are designed for use in Asian countries. They are encoded using escape technique (SI/SO bytes). An example is CP930 (Japan).
For RPC-based components and Reliable RPC, we recommend RPC programmers observe the following, otherwise unpredictable results may occur:
SO and SI escape characters may not be contained
only double-byte characters allowed
single-byte characters cannot be transferred
single-byte and double-byte characters can be transferred using SO and SI escape characters
For ACI-based programming, single-byte and double-byte characters can be transferred using SO and SI escape characters.
During character conversion, data length may inrease or decrease for this type of codepage. The Rules for Data Length Changes due to Character Conversion apply.
Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.
Examples are UTF8, CP950, Big5 (traditional Chinese).
During character conversion, data length may inrease or decrease for this type of codepage. The Rules for Data Length Changes due to Character Conversion apply.
Configuration is the same as with ICU Conversion, see Configuration and Usage in section ICU Conversion.
Character conversion may force a decrease or increase in data lengths. Such data length changes occur for Multibyte or Double-byte Codepages (for example UTF8), EBCDIC Stateful Codepages, and if Arabic Shaping is in effect. Data length changes do not occur for single-byte codepages.
For RPC-based components and Reliable RPC, behavior depends on the IDL data type (see IDL Data Types in the IDL Editor documentation). RPC programmers should be aware of the following:
IDL field length may increase or decrease. The resulting length is set accordingly.
IDL field length cannot change.
If the data length decreases, padding with SPACE occurs.
If the data length increases characters are cut off at character boundaries, which may force again padding with SPACE.
IDL field length has a maximum.
If the data length decreases, the field length is adjusted accordingly.
If the data length increases over the maximum length characters are cut off at character boundaries. The resulting length is set accordingly.
For ACI-based programming, the complete payload may change its length in bytes during conversion.
Codepages used to convert RPC data streams must meet several requirements:
Codepages used to convert RPC data streams must have the following code points (characters) defined:
Character | also known as | Rendered | Unicode Code Point |
---|---|---|---|
uppercase letters A-Z without special characters | A - Z | 0x0041 to 0x005A | |
lowercase letters a-z without special characters | a - z | 0x0061 to 0x007A | |
digits | 0-9 | 0x0030 to 0x0039 | |
SPACE | " " | 0x0020 | |
LEFT PARENTHESIS | OPENING PARENTHESIS | "(" | 0x0028 |
RIGHT PARENTHESIS | CLOSING PARENTHESIS | ")" | 0x0029 |
PLUS SIGN | "+" | 0x002B | |
HYPHEN | MINUS | "-" | 0x002D |
SOLIDUS | SLASH | "/" | 0x002F |
COLON | ":" | 0x003A | |
COMMA | "," | 0x002C | |
FULL STOP | PERIOD | "." | 0x002E |
EQUALS SIGN | "=" | 0x003D |
All code points (characters) listed in the table above must have a unique mapping (without any fallbacks and reverse fallbacks) to/from Unicode, that is, they must be roundtrip-compatible.
If the codepage used is a multibyte or double-byte codepage, the code points (characters) listed in the table above must have a length of 1 byte within the codepage. Therefore UTF-16 encoding cannot be used, but UTF-8 encoding is possible.
Codepages that do not obey the rules above cannot be used for RPC-based components, because those code points (characters) are used to code for example the IDL library and IDL program, descriptive metadata and IDL type fields in numeric, integer and binary form.
The automatic mechanism for choosing the character conversion approach applies to the following versions:
EntireX Broker 10.1 and above (z/OS, UNIX, Windows)
EntireX Broker 10.3 and above (BS2000)
For example, RPC components indicate to the broker that the data stream is RPC; the broker uses this information to choose the character conversion approach. In this way, incorrect configurations are detected and corrected.
Broker Attribute File Definition | RPC Data Stream Detected (1) | ACI Data Stream Detected (2) | ACI or RPC Data Stream (3) |
---|---|---|---|
CONVERSION=user exit |
user exit |
user exit |
user exit |
TRANSLATION=user exit |
user exit |
user exit |
user exit |
CONVERSION=SAGTCHA |
CONVERSION=SAGTRPC |
CONVERSION=SAGTCHA |
CONVERSION=SAGTCHA |
CONVERSION=SAGTRPC |
CONVERSION=SAGTRPC |
CONVERSION=SAGTCHA |
CONVERSION=SAGTRPC |
CONVERSION=NO or TRANSLATION=NO |
CONVERSION=SAGTRPC |
no character conversion | no character conversion |
TRANSLATION=SAGTCHA (4) |
CONVERSION=SAGTRPC |
CONVERSION=SAGTCHA |
CONVERSION=SAGTCHA |
BOLD |
Character conversion is determined by the broker automatically; the definition in the broker attribute file is ignored and a warning message (one of 00200781, 00200782, 00200786, 00200787, 00200788, 00200789, 00200790) is written to the broker log file. Adapt your broker attribute file to avoid the message. |
---|
RPC data stream is detected automatically by the broker if the version of RPC server component is the following (or above):
EntireX RPC Server (BS2000) 10.3
EntireX RPC Server (z/OS CICS, z/OS Batch, z/OS IMS, C, .NET) 9.10
EntireX RPC Server (Java, CICS ECI, IMS Connect, XML/SOAP, RPC-ACI, IBM MQ) 9.9
EntireX Adapter 9.9
Natural RPC Server (Mainframe 8.2.7, Open Systems 8.4.1)
ACI data stream is detected automatically by the broker from EntireX Java ACI 9.12 or later.
If ACI communication is used from non-Java environments, or the EntireX RPC server or Natural RPC server is from an earlier version than listed under Note 1, the data stream can be ACI or RPC.
TRANSLATION=SAGTCHA
is ignored. The broker uses CONVERSION
.
Character conversion is chosen by the broker in the following order of precedence:
You can always write your own User Exits if this matches your requirements. This is the first choice if defined.
If the broker detects an RPC data stream (see Note 1 above), ICU conversion with broker attribute CONVERSION=SAGTRPC
is used.
If neither broker attribute CONVERSION
nor broker attribute TRANSLATION
is defined (the attribute is omitted or set to NO
) no character conversion occurs.
If the broker detects an ACI data stream (see Note 2 above), ICU conversion with broker attribute CONVERSION=SAGTCHA
is used.
If the broker attribute CONVERSION=SAGTRPC
is defined, ICU conversion approach SAGTRPC is used.
In all other cases, ICU conversion approach SAGTCHA is used.
With translation user exits, the code points of the codepage used are under your control. You can distinguish between ASCII, IBM EBCDIC and BS2000 EBCDIC environments (where the caller or participant is running). Code points can be adapted to meet your requirements. This requires programming a user-specific translation routine. See Writing Translation User Exits under z/OS | UNIX | Windows | BS2000. The delivered model for the translation user exit supports single-byte codepages.
For RPC-based components and Reliable RPC, the codepage you implement must meet the Codepage Requirements for RPC Data Stream Conversions.
For ACI-based programming, you can make any structure of the data
(mixture of text and binary data) within your payload known to the user exit by
means of the ACI field ENVIRONMENT
,
which can be shared between your application and the translation user exit.
Configuration effort is easy, only TRANSLATION
in the broker attribute file has
to be set to the name of your user exit. Nothing needs to be configured or
considered for the EntireX component (sender or receiver).
We do not recommend using a translation user exit. If you only need to adapt codepoints, consider migration to ICU Conversion.
If a Translation User Exit is used to adapt code points only,
that is, to implement a standard ASCII/EBCDIC codepage, the same functionality can be achieved with ICU conversion, simply
by using
Broker's Locale String Defaults, well configured, and CONVERSION-OPTION-SUBSTITUTE
set for the same error behavior as translation. See OPTION
Values for Conversion.
Example: For an environment running in Spain using clients with the Windows 1252 codepage and servers on IBM mainframe with codepage 1145, set the following Codepage-specific Attributes:
DEFAULTS=CODEPAGE * Broker Locale String defaults DEFAULT_ASCII=windows-1252 DEFAULT_EBCDIC_IBM=ibm-1145
For ACI-based programming, set the service-specific attribute CONVERSION
:
DEFAULTS=SERVICE . . . CONVERSION=(SAGTCHA,OPTION=SUBSTITUTE) . . .
For RPC-based components and Reliable RPC, set the service-specific attribute CONVERSION
DEFAULTS=SERVICE . . . CONVERSION=(SAGTRPC,OPTION=SUBSTITUTE) . . .
For more examples see Configuring Broker's Locale String Defaults.
With the SAGTRPC user exit you can invent your own conversion
package/method for RPC-based components and Reliable RPC if for any reason a codepage is not
supported by ICU Conversion and SAGTRPC conversion, and CONVERSION=SAGTRPC
is configured in the broker attribute file.
SAGTRPC user exit cannot be used for ACI-based programming.
SAGTRPC user exit allows you to adapt codepages and their characters (code points) to meet your requirements. See Writing SAGTRPC User Exits under z/OS | UNIX | Windows. The codepage you implement must meet the Codepage Requirements for RPC Data Stream Conversions.
The broker must be configured for the platform it is running on. See Configuring SAGTRPC User Exits under z/OS | UNIX | Windows.