This document covers the following topics:
When used in this document, the notation vr represents the 2-digit ICU version number.
ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. The ICU data library for Natural is provided as a package that contains the desired data items. The usage of packages instead of single data item files increases the performance since there is only one file access during the initialization to load the package. However, it is not so flexible since it requires a rebuild of a package if data items need to be added.
The ICU data library may be customized in order to add existing or new converter mapping tables or to add other data items such as collation rules, break iterator rules and other locale data.
The customization tool for the ICU data library is available from the Download Components area in Empower (https://empower.softwareag.com/). Use the supplied icudtvr.zip file (vr = version) to customize the data libraries for the ICU version required in your Natural environment: ICS Version 1.4 requires icudt54.zip. The files described in this section are contained in the icudtvr.zip file.
Several steps are required to create a new ICU data library package. Some steps are performed on a PC and other steps are performed on the appropriate mainframe platform.
The following topics are covered below:
There are different sources for new data items:
The ICU Data Library Customizer at http://apps.icu-project.org/datacustom/.
ICU converter data archive at https://github.com/unicode-org/icu.
User-defined converter mapping tables.
The ICU Data Library Customizer is a web-based tool provided by
IBM. The ICU Data Library Customizer is version-dependent. Therefore, an ICU
Data Library Customizer is provided for each supported version. The ICU version
can be retrieved using the SYSCP
utility (see the function
ICU Information).
The Data Library Customizer displays the data items in a tree view. Primarily, all data items are selected. It is possible to deselect all items by expanding the advanced options and choosing the Filter Items button. Now, several items can be selected or deselected. All selected items are later added to the delivered base ICU data library and will be available for Natural.
button. Expanding a tree shows all available items of that type. It is possible to reduce the amount of displayed items by using the button of the advanced options. For example, to reduce the tree view to show only items for Japanese countries, enter the string "japanese" in the text box and choose theThe second possibility is to obtain or create a .ucm (source) mapping data file for the desired converter. A large archive of converter data is maintained by the ICU team. This archive is version-independent. If the desired conversion table is not available in the archive, it is possible to create a new one. For the documentation of the layout of converter mapping tables (.ucm files), refer to the chapter Conversion Data of the ICU User Guide (http://userguide.icu-project.org/conversion/data). It is recommended to take a similar mapping table from the archive, rename it and adjust it to the new requirements.
Note:
It is forbidden to change the character mapping of an existing
converter. In fact, this is the creation of a new converter and requires a new
converter name to avoid confusion.
Detailed information on how to customize the ICU data library is provided in the readme.txt file which is part of the downloaded zip file.
Converter source files are compiled into binary converter files (.cnv files) by using the makeconv.exe tool. It is possible to specify more than one converter.
Example:
Command | Description |
---|---|
makeconv ibm-1142_P100-1997.ucm |
Compile the Danish character mapping table of code page IBM-1142 into the binary format. With a subsequent step, the output file ibm-1142_P100-1997.cnv can be added to the new data library package. |
Converters obtained from ICU are already registered in the alias
name table. No additional step is necessary. If the converter source file is a
user-defined file, there will be no appropriate entry in the alias name table.
In this case, it is necessary to register the new code page in the alias name
table. Open the text file convrtrs.txt and append an
appropriate entry at the end of that file in the section "User
defined code pages". The name of the code page is required and
the IANA name is optional. The string { IANA* }
declares
iana-name as an IANA name. Each user-defined code
page requires an entry in the alias name table.
The entry has the following format:
name-of-code-page iana-name { IANA* }
If the code page "my_cp-100" with the IANA name "MYCP" is to be added, the following line in convrts.txt is required:
my_cp-100 MYCP { IANA*}
For more information, refer to the header of convrtrs.txt or to the ICU User Guide.
The modified alias name table has to be compiled with gencnval.exe into a binary file (cnvalias.icu) to be linked to the new data library package.
A new package is created with the tool makpkg.bat. It uses the delivered package icudtvr.dat (vr = version) and merges new, user-defined items. A user-defined item can be an additional package that contains new data items, a single data item such as a new converter (.cnv file), or a text file that contains a list of new items.
Examples:
Command | Description |
---|---|
makepkg icudtvrl.dat |
Add the data items contained in icudtvrl.dat to the base package icudtvrb.dat. |
makepkg ibm-1142_P100-1997.cnv |
Add the Danish code page IBM-1142 to the base data library package icudtvrb.dat. |
makepkg newitems.txt |
Add all data items (converters) that are listed in the text file newitems.txt to the base data library package icudtvrb.dat. |
makepkg.bat produces two files, a big-endian
EBCDIC-based binary file and an HL assembler source. The assembler source
contains the binary image of the first file packed into DC X'...'
statements. The name of the binary file is icudtvre.dat
and the name of the assembler source is icudtvre_dat.s.
The file icudtvre is a copy of
icudtvre_dat.s. The files must never be renamed since the
package name "icudtvre"
is used as a part of internal references of data items. It is used by the ICU
runtime to access data items such as converters and to validate the data file.
"icudt" identifies an ICU data file,
"vr" is the version and
"e" identifies the file as big-endian
EBCDIC-encoded.
If more than one item is to be added or if the alias name table has been changed, the items have to be declared as a list in the newitems.txt file.
Examples:
Add code pages ibm-939_P120-1999 and
ibm-942_P12A-1999
Entries of newitems.txt:
ibm-939_P120-1999.cnv
ibm-942_P12A-1999.cnv
Add user defined code page my_cp-100
Entries of newitems.txt:
cnvalias.icu
my_cp-100.cnv
For more information, refer to the ICU User Guide.
The result of the previous step is an assembler source module. The assembler module with the new data library package icudtvre has to be transferred to the target platform. The File Transfer Protocol (FTP) is available on each PC and can be used for this task. Ask the system administrator for the required information (such as host name, port number, user name and password) for accessing the target machine via FTP. Since icudtvre is a text file, the transfer mode must be set to ASCII to ensure the correct translation of the file on the target platform. The name of the file on the target platform is arbitrary. However, it is recommended to use the name icudtvre. If it is desired to rename icudtvre, the renamepkg.bat tool has to be used.
The assembler source module must be assembled and linked on the
target platform. It can either be linked to the nucleus or it can be loaded
dynamically with RCA=name
and
CFICU=(DATFILE=name)
.
This section lists the profile parameters and macros which are used in conjunction with Unicode and code page support.
Unless otherwise noted, the profile parameters and macros mentioned in this section are explained in detail in the Parameter Reference.
Parameter or Macro | Description |
---|---|
CFICU or
NTCFICU
macro
|
Enables Unicode support for various
Unicode settings.
See also |
CMPO or
CPAGE keyword subparameter of
NTCMPO
macro
|
Generates code page-sensitive Natural
programs.
See also |
CP |
Defines the default code page for Natural. This code page is used for the runtime and development environment if not superposed with a code page defined for a single object (for example, for a Natural source). Only platform-suitable code pages can be used. This means, for example, that no ASCII code page can be defined for a mainframe platform. An initialization error message occurs if a wrong code page is used. See also |
CPCVERR |
Specifies whether a conversion error that occurs when converting from Unicode to code page or from code page to Unicode or from one code page to another code page results in a Natural error or not. This parameter is not regarded for the conversion of Natural sources when loading them into the source area or when cataloging them. It is not regarded whether a Unicode field is converted
into the code page before an I/O on a terminal emulation. In this case, the
substitution character defined by ICU is replaced by the placeholder character
which is defined in |
CPOBJIN |
Specifies the code page in which the
batch input file for data is encoded. This file is defined in the data set
CMOBJIN .
|
CPPRINT |
Specifies the code page in which the
batch output file shall be encoded. This file is defined in the data set
CMPRINT .
|
CPSYNIN |
Specifies the code page in which the
batch input file for commands is encoded. This file is defined in the data set
CMSYNIN .
|
NTCPAGE
macro
|
In the
NATCONFG
module, this macro defines a code page and all related information, such as
placeholder character, locale ID and collation tables.
See also
|
OPRB or
NTOPRB
macro
|
Sets the ACODE and/or
WCODE option to define the user encoding if the used Adabas
database is enabled for UES (universal encoding support).
|
PRINT or
CP keyword subparameter of
NTPRINT
macro
|
Defines the code page for a report. |
SRETAIN |
Specifies that all existing sources have to be saved in their original encoding format. See also Customizing Your Environment. |
See also:
Natural in Batch Mode in the Operations documentation.
For valid code pages, see http://www.iana.org/assignments/character-sets.
This section covers the following topics:
The parameter CFICU
and its subparameters are
explained in detail in the Parameter Reference. Some of
the subparameters have an impact on the performance.
If collation services are used to compare Unicode strings, both
strings are checked whether they are normalized or not. The check itself
consumes a lot of CPU time. If you are sure that the strings are already
normalized, you can switch off the check (COLNORM=OFF
).
In Unicode, it is possible to represent the same character as one
code point or as a combination of two or more code points. For example, the
German character "ä" can be represented by
"U+00E4" or by the combination of the code points
"U+0061" and "U+0308".
The conversion from Unicode to, for example, IBM01140 treats combined
characters as single code points and produces an "a"
followed by a substitution character since code point
"U+0308" is not represented in the target code page.
With CNVNORM=ON
,
a normalization is performed right before the actual conversion. The
normalization consumes additional CPU time and temporary storage. If you are
sure that no combining characters are involved in
MOVE
statements (except
MOVE
NORMALIZED
), you should set CNVNORM
to
OFF
to increase performance. Note that all possible combinations
are represented by a single coded Unicode code point.
Conversion from Unicode to code page and vice versa is not
high-performance. The reason is that the ICU implementation is written in C++
and that it covers nearly all Unicode, code page and language aspects in the
world. However, some code pages can be mapped to Unicode (and vice versa) via
translation tables to accelerate conversion. Accelerator tables are activated
with the CPOPT
subparameter. If it is set to ON
, Natural automatically creates
two accelerator tables during session initialization by using ICU conversion
functions. The first table (with a size of 512 bytes) is used for conversion
from code page to Unicode and the other table (with a size of 65535 bytes) is
used for conversion from Unicode to code page. During a Natural session, all
conversions are then executed via the accelerator tables instead of ICU calls.
Accelerator tables are only provided for the default code page (*CODEPAGE
).
Temporary code pages (for example, in
MOVE
ENCODED
statements) do not use accelerator tables if the
module NATCPTAB
is not linked. If it linked, up to 30 accelerator
tables based on the ICU database are used to speed up performance.
The parameters CFICU
and
CP
can be
used to adjust Natural to specific purposes:
Settings | Description |
---|---|
CFICU=OFF,
CP=OFF |
Compatibility mode. For running
existing applications without Unicode and without code page support. Legacy
translation tables are used for I/O translation. Compared with former versions,
there is no significant increase in resource consumption (CPU time and buffer
usage). This mode does not need the ICS module SAGICU (or an
alternative ICS module on
z/VSE and z/OS) to be linked to the Natural nucleus.
|
CFICU=ON,
CP=OFF |
For new applications that are using
Unicode and code page conversion (MOVE ENCODED ) but not
default code page support. Therefore, the system variable
*CODEPAGE
is empty. It is possible to use U format variables, but it is not possible to
use, for example, MOVE A TO U , since this requires the default
code page information. The error NAT3411 will be issued indicating that no
default code page is available.
|
CFICU=ON,
CP=value * |
For new applications that are using full Unicode as well as code page support. |
CFICU=OFF,
CP=value * |
This combination does not make sense,
because code page support needs ICU services for conversion. Therefore,
CFICU=ON is enforced in this case and a session initialization
message is issued.
|
* where value is any value
other than OFF
.
The compiler option CPAGE
creates objects
that can be executed with a code page which is different from the code page
used at creation time. This means that all alphanumeric constants of the object
which are coded with the code page at creation time have to be converted to the
code page which is active at execution time. To make it possible for the
Natural object loader to find and convert alphanumeric constants, an additional
table is created by the compiler. This increases the size of the generated
object, depending on the number of used alphanumeric constants. The conversion
at runtime consumes additional CPU time. If the default code page (value of the
system variable *CODEPAGE
)
is the same as the code page at creation time or if the session has no default
code page (CP=OFF
), no conversion
is done. Conversion errors are ignored, independent from the setting of the
parameter CPCVERR
. If the
compiler option CPAGE
is set to OFF
, no conversion is
performed at runtime and the alphanumeric constants are treated as they are.
The following sample program is cataloged with code page IBM01141 (German) and is executed with default code page IBM01140 (us). The characters "Ä", "Ö" and "Ü" are defined in both code pages, but at different code points.
Example 1 - CPAGE=OFF
:
OPTIONS CPAGE=OFF WRITE *CODEPAGE 'ÄÖÜ' END
Output with code page IBM01140 (us):
Page 1 IBM01140 ¢\!
Example 2 - CPAGE=ON
:
OPTIONS CPAGE=ON WRITE *CODEPAGE 'ÄÖÜ' END
Output with code page IBM01140 (us):
Page 1 IBM01140 ÄÖÜ
The most common standard for code page names is the IANA name.
Therefore, the system variable *CODEPAGE
contains the IANA name of the default code page. On z/VSE and z/OS, a code page is qualified
by its Coded Character Set ID (CCSID). On BS2000, the Coded Character Set Name
(CCSN) is most popular. Currently, Adabas uses the Entire Conversion Service
definition (ADAECS). The macro
NTCPAGE
can be used to assign these different
names to the unambiguous IANA name. NTCPAGE
is part of the Natural
configuration module (NATCONFG
).
It does not matter whether the IANA name, the CCSID/CCSN or the
alias name is entered with the CP
parameter. The alias
name can be a user-defined name which is used to assign a more significant name
to the code page. In any case, *CODEPAGE
contains
the IANA name of the selected code page.
In addition, a placeholder character can be defined for a code
page. It overwrites the default substitution character of that code page, which
is normally a non-displayable character (for example, H’3F’
in an
EBCDIC code page). The placeholder character can be used to avoid that
non-displayable characters are sent to terminals.
Example:
NTCPAGE IANA=IBM01140,CCSID=1140,ECS=1140,ALIAS=’US’,PHC=003F
The values IBM01140
, 1140
or
US
can be entered with the CP
parameter to
activate the code page. *CODEPAGE
contains the
name IBM01140. The substitution character of the code page will be replaced by
"U+003F", which is a quotation mark (?).
The number of available code pages depends on the used ICU data library.
All code pages defined in the currently used data package can be
used by Natural. An NTCPAGE
entry is only necessary if an
alternative alias name or placeholder character is desired.
The following configuration parameter is available with Natural Development Server (NDV):
Settings | Description |
---|---|
TERMINAL_EMULATION=WEBIO
|
Specifies that the Natural Web I/O Interface client (which supports Unicode) is used for input and output. |
The code page information of the object is part of the
object directory displayed with the LIST
system
command. For details, see Displaying Directory
Information in the System Commands
documentation.
The encoding of code page data can be specified on different levels.
The default code page can be defined with the
CP
parameter.
A code page can be defined for Natural sources, batch input
(CPOBJIN
,
CPSYNIN
)
and output files (CPPRINT
).
If a code page is defined at object level, this overwrites the default code page.