Configuration and Administration of the Unicode/Code Page Environment

This Dokument covers the following topics:

Customizing the ICU Data Library
Profile Parameters and Macros
Encoding Information

Notation `vr`:

When used in this document, the notation vr represents the 2-digit ICU version number.

Customizing the ICU Data Library

ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. The ICU data library for Natural is provided as a package that contains the desired data items. The usage of packages instead of single data item files increases the performance since there is only one file access during the initialization to load the package. However, it is not so flexible since it requires a rebuild of a package if data items need to be added.

The ICU data library may be customized in order to add existing or new converter mapping tables or to add other data items such as collation rules, break iterator rules and other locale data.

The customization tool for the ICU data library is available from the Download Components area in Empower (https://empower.softwareag.com/). Use the supplied icudtvr.zip file (vr = version) to customize the data libraries for the ICU version required in your Natural environment: ICS Version 1.4 requires icudt54.zip. The files described in this section are contained in the icudtvr.zip file.

Several steps are required to create a new ICU data library package. Some steps are performed on a PC and other steps are performed on the appropriate mainframe platform.

The following topics are covered below:

Obtaining New Data Items
Compiling Conversion Tables and Locales
Creating a New Package
Transferring the Assembler Source to the Mainframe Platform
Using the New Data Library on the Target Mainframe Platform

Obtaining New Data Items

There are different sources for new data items:

The ICU Data Library Customizer at http://apps.icu-project.org/datacustom/.
ICU converter data archive at http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/.
User-defined converter mapping tables.

The ICU Data Library Customizer is a web-based tool provided by IBM. The ICU Data Library Customizer is version-dependent. Therefore, an ICU Data Library Customizer is provided for each supported version. The ICU version can be retrieved using the SYSCP utility (see the function ICU Information).

The Data Library Customizer displays the data items in a tree view. Primarily, all data items are selected. It is possible to deselect all items by expanding the advanced options and choosing the Deselect All button. Expanding a tree shows all available items of that type. It is possible to reduce the amount of displayed items by using the Filter Items button of the advanced options. For example, to reduce the tree view to show only items for Japanese countries, enter the string "japanese" in the text box and choose the Filter Items button. Now, several items can be selected or deselected. All selected items are later added to the delivered base ICU data library and will be available for Natural.

The second possibility is to obtain or create a .ucm (source) mapping data file for the desired converter. A large archive of converter data is maintained by the ICU team. This archive is version-independent. If the desired conversion table is not available in the archive, it is possible to create a new one. For the documentation of the layout of converter mapping tables (.ucm files), refer to the chapter Conversion Data of the ICU User Guide (http://userguide.icu-project.org/conversion/data). It is recommended to take a similar mapping table from the archive, rename it and adjust it to the new requirements.

Anmerkung:
It is forbidden to change the character mapping of an existing converter. In fact, this is the creation of a new converter and requires a new converter name to avoid confusion.

Detailed information on how to customize the ICU data library is provided in the readme.txt file which is part of the downloaded zip file.

Compiling Conversion Tables and Locales

Converter source files are compiled into binary converter files (.cnv files) by using the makeconv.exe tool. It is possible to specify more than one converter.

Example:

Command	Description
makeconv ibm-1142_P100-1997.ucm	Compile the Danish character mapping table of code page IBM-1142 into the binary format. With a subsequent step, the output file ibm-1142_P100-1997.cnv can be added to the new data library package.

Converters obtained from ICU are already registered in the alias name table. No additional step is necessary. If the converter source file is a user-defined file, there will be no appropriate entry in the alias name table. In this case, it is necessary to register the new code page in the alias name table. Open the text file convrtrs.txt and append an appropriate entry at the end of that file in the section "User defined code pages". The name of the code page is required and the IANA name is optional. The string { IANA* } declares iana-name as a IANA name. Each user-defined code page requires an entry in the alias name table.

The entry has the following format:

name-of-code-page  iana-name { IANA* }

If the code page "my_cp-100" with the IANA name "MYCP" is to be added, the following line in convrts.txt is required:

my_cp-100            MYCP { IANA*}

For more information, refer to the header of convrtrs.txt or to the ICU User Guide.

The modified alias name table has to be compiled with gencnval.exe into a binary file (cnvalias.icu) to be linked to the new data library package.

Creating a New Package

A new package is created with the tool makpkg.bat. It uses the delivered package icudtvr.dat (vr = version) and merges new, user-defined items. A user-defined item can be an additional package that contains new data items, a single data item such as a new converter (.cnv file), or a text file that contains a list of new items.

Examples:

Command	Description
makepkg icudt`vr`l.dat	Add the data items contained in icudtvrl.dat to the base package icudtvrb.dat.
makepkg ibm-1142_P100-1997.cnv	Add the Danish code page IBM-1142 to the base data library package icudtvrb.dat.
makepkg newitems.txt	Add all data items (converters) that are listed in the text file newitems.txt to the base data library package icudtvrb.dat.

makepkg.bat produces two files, a big-endian EBCDIC-based binary file and an HL assembler source. The assembler source contains the binary image of the first file packed into DC X'...' statements. The name of the binary file is icudtvre.dat and the name of the assembler source is icudtvre_dat.s. The file icudtvre is a copy of icudtvre_dat.s. The files must never be renamed since the package name "icudtvre" is used as a part of internal references of data items. It is used by the ICU runtime to access data items such as converters and to validate the data file. "icudt" identifies an ICU data file, "vr" is the version and "e" identifies the file as big-endian EBCDIC-encoded.

If more than one item is to be added or if the alias name table has been changed, the items have to be declared as a list in the newitems.txt file.

Examples:

Add code pages ibm-939_P120-1999 and ibm-942_P12A-1999
Entries of newitems.txt:

ibm-939_P120-1999.cnv
ibm-942_P12A-1999.cnv
Add user defined code page my_cp-100
Entries of newitems.txt:

cnvalias.icu
my_cp-100.cnv

For more information, refer to the ICU User Guide.

Transferring the Assembler Source to the Mainframe Platform

The result of the previous step is an assembler source module. The assembler module with the new data library package icudtvre has to be transferred to the target platform. The File Transfer Protocol (FTP) is available on each PC and can be used for this task. Ask the system administrator for the required information (such as host name, port number, user name and password) for accessing the target machine via FTP. Since icudtvre is a text file, the transfer mode must be set to ASCII to ensure the correct translation of the file on the target platform. The name of the file on the target platform is arbitrary. However, it is recommended to use the name icudtvre. If it is desired to rename icudtvre, the renamepkg.bat tool has to be used.

Using the New Data Library on the Target Mainframe Platform

The assembler source module must be assembled and linked on the target platform. It can either be linked to the nucleus or it can be loaded dynamically with RCA=name and CFICU=(DATFILE=name).

Profile Parameters and Macros

This section lists the profile parameters and macros which are used in conjunction with Unicode and code page support.

Unless otherwise noted, the profile parameters and macros mentioned in this section are explained in detail in the Parameter Reference.

Parameter or Macro	Description
`CFICU` or `NTCFICU` macro	Enables Unicode support for various Unicode settings. See also `CFICU` Parameter and `CFICU` and `CP`: Session Modes
`CMPO` or `CPAGE` keyword subparameter of `NTCMPO` macro	Generates code page-sensitive Natural programs. See also `CPAGE` Compiler Option.
`CP`	Defines the default code page for Natural. This code page is used for the runtime and development environment if not superposed with a code page defined for a single object (for example, for a Natural source). Only platform-suitable code pages can be used. This means, for example, that no ASCII code page can be defined for a mainframe platform. An initialization error message occurs if a wrong code page is used. See also `CFICU` and `CP`: Session Modes.
`CPCVERR`	Specifies whether a conversion error that occurs when converting from Unicode to code page or from code page to Unicode or from one code page to another code page results in a Natural error or not. This parameter is not regarded for the conversion of Natural sources when loading them into the source area or when cataloging them. It is not regarded whether a Unicode field is converted into the code page before an I/O on a terminal emulation. In this case, the substitution character defined by ICU is replaced by the place holder character which is defined in `NATCONFG`.
`CPOBJIN`	Specifies the code page in which the batch input file for data is encoded. This file is defined in the data set `CMOBJIN`.
`CPPRINT`	Specifies the code page in which the batch output file shall be encoded. This file is defined in the data set `CMPRINT`.
`CPSYNIN`	Specifies the code page in which the batch input file for commands is encoded. This file is defined in the data set `CMSYNIN`.
`NTCPAGE` macro	In the `NATCONFG` module, this macro defines a code page and all related information, such as place holder character, locale ID and collation tables. See also `NTCPAGE` Macro. `NTCPAGE` and `NATCONFG` are explained in detail in the Operations documentation.
`OPRB` or `NTOPRB` macro	Sets the `ACODE` and/or `WCODE` option to define the user encoding if the used Adabas database is enabled for UES (universal encoding support).
`PRINT` or `CP` keyword subparameter of `NTPRINT` macro	Defines the code page for a report.
`SRETAIN`	Specifies that all existing sources have to be saved in their original encoding format. See also Customizing Your Environment.

CFICU Parameter

The parameter CFICU and its subparameters are explained in detail in the Parameter Reference. Some of the subparameters have an impact on the performance.

If collation services are used to compare Unicode strings, both strings are checked whether they are normalized or not. The check itself consumes a lot of CPU time. If you are sure that the strings are already normalized, you can switch off the check (COLNORM=OFF).

In Unicode, it is possible to represent the same character as one code point or as a combination of two or more code points. For example, the German character "ä" can be represented by "U+00E4" or by the combination of the code points "U+0061" and "U+0308". The conversion from Unicode to, for example, IBM01140 treats combined characters as single code points and produces an "a" followed by a substitution character since code point "U+0308" is not represented in the target code page. With CNVNORM=ON, a normalization is performed right before the actual conversion. The normalization consumes additional CPU time and temporary storage. If you are sure that no combining characters are involved in MOVE statements (except MOVE NORMALIZED), you should set CNVNORM to OFF to increase performance. Note that all possible combinations are represented by a single coded Unicode code point.

Conversion from Unicode to code page and vice versa is not high-performance. The reason is that the ICU implementation is written in C++ and that it covers nearly all Unicode, code page and language aspects in the world. However, some code pages can be mapped to Unicode (and vice versa) via translation tables to accelerate conversion. Accelerator tables are activated with the CPOPT subparameter. If it is set to ON, Natural automatically creates two accelerator tables during session initialization by using ICU conversion functions. The first table (with a size of 512 bytes) is used for conversion from code page to Unicode and the other table (with a size of 65535 bytes) is used for conversion from Unicode to code page. During a Natural session, all conversions are then executed via the accelerator tables instead of ICU calls. Accelerator tables are only provided for the default code page (*CODEPAGE). Temporary code pages (for example, in MOVE ENCODED statements) do not use accelerator tables if the module NATCPTAB is not linked. If it linked, up to 30 accelerator tables based on the ICU database are used to speed up performance.

CFICU and CP: Session Modes

The parameters CFICU and CP can be used to adjust Natural to specific purposes:

Settings	Description
`CFICU=OFF, CP=OFF`	Compatibility mode. For running existing applications without Unicode and without code page support. Legacy translation tables are used for I/O translation. Compared with former versions, there is no significant increase in resource consumption (CPU time and buffer usage). This mode does not need the ICS module `SAGICU` (or an alternative ICS module on z/VSE and z/OS) to be linked to the Natural nucleus.
`CFICU=ON, CP=OFF`	For new applications that are using Unicode and code page conversion (`MOVE ENCODED`) but not default code page support. Therefore, the system variable `*CODEPAGE` is empty. It is possible to use U format variables, but it is not possible to use, for example, `MOVE A TO U`, since this requires the default code page information. The error NAT3411 will be issued indicating that no default code page is available.
`CFICU=ON, CP=value` ^*	For new applications that are using full Unicode as well as code page support.
`CFICU=OFF, CP=value` ^*	This combination does not make sense, because code page support needs ICU services for conversion. Therefore, `CFICU=ON` is enforced in this case and a session initialization message is issued.

^* where value is any value other than OFF.

CPAGE Compiler Option

The compiler option CPAGE creates objects that can be executed with a code page which is different from the code page used at creation time. This means that all alphanumeric constants of the object which are coded with the code page at creation time, have to be converted to the code page which is active at execution time. To make it possible for the Natural object loader to find and convert alphanumeric constants, an additional table is created by the compiler. This increases the size of the generated object, depending on the number of used alphanumeric constants. The conversion at runtime consumes additional CPU time. If the default code page (value of the system variable *CODEPAGE) is the same as the code page at creation time or if the session has no default code page (CP=OFF), no conversion is done. Conversion errors are ignored, independent from the setting of the parameter CPCVERR. If the compiler option CPAGE is set to OFF, no conversion is performed at runtime and the alphanumeric constants are treated as they are.

The following sample program is cataloged with code page IBM01141 (German) and is executed with default code page IBM01140 (us). The characters "Ä", "Ö" and "Ü" are defined in both code pages, but at different code points.

Example 1 - CPAGE=OFF:

OPTIONS CPAGE=OFF
WRITE *CODEPAGE  'ÄÖÜ'
END

Output with code page IBM01140 (us):

Page      1                                                 
                                                                              
IBM01140                                                         ¢\!

Example 2 - CPAGE=ON:

OPTIONS CPAGE=ON
WRITE *CODEPAGE  'ÄÖÜ'
END

Output with code page IBM01140 (us):

Page      1                                                 

IBM01140                                                         ÄÖÜ

NTCPAGE Macro

The most common standard for code page names is the IANA name. Therefore, the system variable *CODEPAGE contains the IANA name of the default code page. On z/VSE and z/OS, a code page is qualified by its Coded Character Set ID (CCSID). On BS2000/OSD, the Coded Character Set Name (CCSN) is most popular. Currently, Adabas uses the Entire Conversion Service definition (ADAECS). The macro NTCPAGE can be used to assign these different names to the unambiguous IANA name. NTCPAGE is part of the Natural configuration module (NATCONFG).

It does not matter whether the IANA name, the CCSID/CCSN or the alias name is entered with the CP parameter. The alias name can be a user-defined name which is used to assign a more significant name to the code page. In any case, *CODEPAGE contains the IANA name of the selected code page.

In addition, a place holder character can be defined for a code page. It overwrites the default substitution character of that code page, which is normally a non-displayable character (for example, H’3F’ in an EBCDIC code page). The place holder character can be used to avoid that non-displayable characters are sent to terminals.

Example:

NTCPAGE IANA=IBM01140,CCSID=1140,ECS=1140,ALIAS=’US’,PHC=003F

The values IBM01140, 1140 or US can be entered with the CP parameter to activate the code page. *CODEPAGE contains the name IBM01140. The substitution character of the code page will be replaced by "U+003F", which is a quotation mark (?).

The number of available code pages depends on the used ICU data library.

All code pages defined in the currently used data package can be used by Natural. An NTCPAGE entry is only necessary if an alternative alias name or place holder character is desired.

Natural Development Server

The following configuration parameter is available with Natural Development Server (NDV):

Settings	Description
`TERMINAL_EMULATION=WEBIO`	Specifies that the Natural Web I/O Interface client (which supports Unicode) is used for input and output.

Encoding Information

The code page information of the object is part of the object directory displayed with the LIST system command. For details, see Displaying Directory Information in the System Commands documentation.

The encoding of code page data can be specified on different levels.

Level 1 - Default Code Page

The default code page can be defined with the CP parameter.

Level 2 - Code Page for a Single Object

A code page can be defined for Natural sources, batch input (CPOBJIN, CPSYNIN) and output files (CPPRINT).

If a code page is defined at object level, this overwrites the default code page.