ADACMP (Compression Of Data)

This document describes the utility "ADACMP".

The following topics are covered:


Functional Overview

The compression utility ADACMP compresses user raw data into a form which can be used by the mass update utility ADAMUP.

The input data for this utility must be contained in a sequential file. LOB field values can also be provided in separate files.

The logical structure and characteristics of the input data are described by a field definition table (FDT). These statements specify the level number, field name, standard length and format together with any definition options that are to be assigned to the field (descriptor, unique descriptor, multiple-value field, null value suppression, fixed storage, periodic group). See Administration, FDT Record Structure for more detailed information about the layout of the file in the database and characteristics of the input data.

Each field in the input record without the option SY (system generated) is compressed. Compression consists of removing trailing blanks from alphanumeric fields and leading zeros from numeric fields. Unpacked and packed fields are checked for correct data. Fields defined with the fixed storage option are not compressed. A user exit is provided to allow additional editing of each input record with a user-written routine.

System generated fields are either regenerated or decompressed, depending on the keyword specified for the ADACMP parameter SYFINPUT.

This utility creates three types of output files:

  • Compressed data.

  • Descriptor values.

  • Records with errors.

The sizes of the descriptor values of all descriptors are listed at the end of execution.

If the utility writes records to the error file, it will exit with a non-zero status.

This utility is a single-function utility.

Procedure Flow

graphics/adacmp.png

The sequential files CMPDTA, CMPDVT and CMPERR can have multiple extents. For detailed information about sequential files with multiple extents, see Administration, Using Utilities, Adabas Sequential Files, Multiple Extents . CMPLOB is a directory that contains files which may be stored as LOB values in the database.

Data Set Environment
Variable/
Logical Name
Storage
Medium
Additional Information
Associator ASSOx Disk  
Compressed data CMPDTA Disk, Tape (* see note) output by ADACMP
Descriptor Value Table CMPDVT Disk, Tape (* see note) output by ADACMP
Rejected data CMPERR Disk, Tape (* see note) output by ADACMP
Input data FDT CMPFDT Disk, Tape (* see note) Utilities Manual
User input data CMPIN Disk (* see note) Utilities Manual
User LOB input data CMPLOB Disk Utilities Manual
ADACMP
control statements
stdin/
SYS$INPUT
  Utilities Manual
ADACMP messages stdout/
SYS$OUTPUT
  Messages and Codes

Note:
(*) A named pipe can be used for this sequential file (see Administration, Using Utilities, Adabas Sequential Files, Using Named Pipes for details).

If the SINGLE_FILE option is set, the Descriptor Value Table (DVT) and the compressed user data are written together to the logical name CMPDTA.

Checkpoints

The utility writes no checkpoints.

Control Parameters

The following control parameters are available:

     DBID = number

D    [NO]DST

     FDT

     FIELDS {uncompressed_field_definition | FDT}...[END_OF_FIELDS | . ]

     FILE = number

D    [NO]LOBS

D    [NO]LOWER_CASE_FIELD_NAME

D    MAX_DECOMPRESSED_SIZE = number [K|M]

D    MUPE_C_L = {1|2|4}

D    [NO]NULL_VALUE

D    NUMREC = number

D    RECORD_STRUCTURE = keyword

     SEPARATOR = character | \character 

D    [NO]SHORT_RECORDS

D    [NO]SINGLE_FILE

     SKIPREC = number

D    SOURCE_ARCHITECTURE = (keyword[,keyword][,keyword])

D    SYFINPUT = keyword

D    TZ {=|:} [timezone]

D    [NO]USEREXIT

D    [NO]USERISN

D    WCHARSET = char_set

DBID

DBID = number

This parameter selects the database that contains the file to be specified by the FILE parameter.

[NO]DST

[NO]DST

The parameter DST is required if a daylight saving time indicator is provided for date/time fields with the option TZ. The daylight saving time indicator must be appended behind the date/time value as a 2-byte integer value (format F) that contains the number of seconds to be added to the standard time in order to get the actual time (usually 0 or 3600).

Without the parameter DST, it is not possible to define time values in the hour before the time is switched back to standard time.

The default is NODST.

Notes:

  1. The DST parameter is ignored if the FIELDS parameter is specified. In this case, you must specify a D element for fields with the daylight saving time indicator.
  2. The DST parameter is not compatible with the RECORD_STRUCTURE = NEWLINE_SEPARATOR parameter because the daylight saving indicator in format F contains non-printable characters.

Example:

A DT field has the following definition: 1,DT,8,P,DT=E(DATE_TIME),TZ

The following values must then be specified for this field:

  • The local date/time value corresponding to the edit mask DATE_TIME as an 8-byte packed value

  • The daylight saving time indicator, usually 0 for standard time and 3600 for summer time as a 2-byte fixed point value

Case 1 (DT has a date/time value with daylight saving time): 0x0200910250230000E10
Case 2 (DT has a date/time value with standard time): 0x0200910250230000000

FDT

FDT

If this parameter is specified as the first parameter, or as the second parameter after [NO]LOWER_CASE_FIELD_NAMES, ADACMP reads the FDT information contained in the sequential file CMPFDT and displays the FDT.

Note:
Alternatively, instead of FDT, you can specify DBID and FILE as the first parameters, or as the second parameters after [NO]LOWER_CASE_FIELD_NAMES (which is allowed before DBID and FILE). In this case, the FDT of the file is used as the base for the compression.

The FDT parameter can be specified several times, but if you have already determined the FDT to be used for the compression by specifying the FDT or DBID and FILE parameters, specifying the FDT parameter again will only display the FDT; the FDT is not overwritten by the CMPFDT file.

FIELDS

FIELDS {uncompressed_field_definition | FDT}...[END_OF_FIELDS | . ]

This parameter is used to specify a subset of fields given in the FDT and their format and length. This means that the input records do not have to contain all of the fields given in the FDT, or that fields can be provided with a different format or length. The syntax and semantics are the same as for the format buffer, with the exception that you can also specify an R-element (for LOB references) if the decompressed record contains the name of a file containing the LOB value instead of the LOB value itself. See Administration, Loading and Unloading Data, Uncompressed Data Format for more detailed information.

While entering the specification list, the FDT function can be used to display the FDT of the file to be decompressed. The specification list can be terminated or interrupted by entering END_OF_FIELDS or `.'. The `.' option is an implicit END_OF_FIELDS and is compatible with the format buffer syntax. FIELDS or END_OF_FIELDS must always be entered on a line by itself, whereas the `.' may be entered on a line by itself or at the end of the format buffer elements.

If the field definitions are terminated with the END_OF_FIELDS parameter, this parameter must be specified in upper case when the LOWER_CASE_FIELD_NAMES parameter is used. In addition, the FDT parameter must also be specified in upper case when the LOWER_CASE_FIELD_NAMES parameter is used.

FILE

FILE = number

This parameter specifies the file from which the FDT information is to be read. This parameter can only be specified after the DBID parameter.

[NO]LOWER_CASE_FIELD_NAMES

[NO]LOWER_CASE_FIELD_NAMES

If LOWER_CASE_FIELD_NAMES is specified, Adabas field names are not converted to upper case. If NOLOWER_CASE_FIELD_NAMES is specified, Adabas field names are converted to upper case. The default is NOLOWER_CASE_FIELD_NAMES.

If lower case field names in the FDT are not to be converted to upper case, the parameter must be specified as the first parameter before the FDT parameter; if lower case field names in the FIELDS parameter are not to be converted to upper case, the parameter must be specified before the FIELDS parameter.

Warning:
If the LOWER_CASE_FIELD_NAMES parameter is specified for the CMPFDT file, not upper case conversion is done for the complete file. Lower case characters for field formats and field options will cause FDT syntax errors. This problem also exists for lower case characters in the FIELDS parameter.

[NO]LOBS

[NO]LOBS

This parameter specifies whether LA and LB field values are to be stored in a LOB file after loading the compressed data into the database:

  • If the parameters DBID and file number have been specified, this parameter is ignored, and the field is handled as described below;

  • If the parameters DBID and file number have not been specified and LOBS is specified, field values for LA and LB fields are prepared for storage in a LOB file, except the field is defined as a descriptor.

  • If the parameters DBID and file number have not been specified and NOLOBS is specified, field values for LA and LB fields are prepared for storage in the base file. In this case, the length of field values for LA and LB fields must not exceed 16381 bytes and the compressed record must fit into a 32 KB DATA block.

Please note that LA and LB fields which are descriptors or parent fields of a derived descriptor, e.g. a super descriptor, are always handled as described for the NOLOBS parameter.

Default behaviour is as follows:

  • If the parameters DBID and file number have been specified and the file is a base file with corresponding LOB file, LOBS is default.

  • If the parameters DBID and file number have been specified and the file is not a base file with corresponding LOB file, NOLOBS is default.

  • If the parameters DBID and file number have not been specified, LOBS is default.

MAX_DECOMPRESSED_SIZE

MAX_DECOMPRESSED_SIZE = number  [K|M]

This parameter specifies the maximum size of a decompressed record in bytes, kilobytes or megabytes, depending on the specification of "K" or "M" after the number. This parameter is intended to recognize invalid CMPIN files as early as possible.

The default is 65536. This is also the minimum value.

Notes:

  1. This parameter does not include the size of LOB values stored in separate files.
  2. The exact definition of this parameter is the size of the I/O buffer required for the largest decompressed record. Only multiples of 256 bytes are used for the I/O buffers, which means that you must specify a value greater than or equal to the largest decompressed record (including the preceding length field) rounded up to the next multiple of 256.

MUPE_C_L

MUPE_C_L = {1|2|4}

If the uncompressed data contain multiple-value fields or periodic groups, they are preceded by a binary count field with the length of MUPE_C_L bytes.

The default is 1.

[NO]NULL_VALUE

[NO]NULL_VALUE

The parameter NULL_VALUE is required if you are compressing data according to the standard FDT and the status values of the NC option fields are given in the input data. Normally, such input data is generated by ADADCU with the NULL_VALUE option set.

The default is NONULL_VALUE.

Example

The definition in the FDT for the field AA is: 1, AA, 2, A, NC

Case 1 (AA has a non-NULL value): input record (hexadecimal) = 00004142

Case 2 (AA has a NULL value): input record (hexadecimal) = FFFF2020

NUMREC

NUMREC = number

This parameter specifies the number of input records to be processed. If this parameter is omitted, all input records contained on the input file are processed.

Use of this parameter is recommended for the initial execution of ADACMP if the input data file contains a large number of records. This avoids unnecessary processing of all records in cases where a data definition error or invalid input data results in a large number of rejected records.

This parameter is also useful for creating small files for test purposes.

RECORD_STRUCTURE

RECORD_STRUCTURE = keyword

This parameter specifies the type of record separation used in the input file with the environment variable CMPIN. The following keywords can be used:

Keyword Meaning
ELENGTH_PREFIX The records in the CMPIN file are separated by a two-byte exclusive length field.
E4LENGTH_PREFIX The records in the decompressed data file are separated by a 4-byte exclusive length field.
ILENGTH_PREFIX The records in the CMPIN file are separated by a two-byte inclusive length field.
I4LENGTH_PREFIX The records in the decompressed data file are separated by a 4-byte inclusive length field.
NEWLINE_SEPARATOR The records in the CMPIN file are separated by a new-line character. This keyword may only be specified if the field values do not contain characters interpreted as new-line (i.e. if there are only unpacked, alphanumeric and Unicode fields, and the alphanumeric and Unicode fields contain only printable characters). This keyword and the USERISN parameter are mutually exclusive.
RDW The records in the CMPIN file contain data that has been transferred from an IBM host using the FTP site rdw option. ADACMP is able to process such data without having to use cvt_fmt first (in previous versions, the unsupported tool cvt_fmt was used for such format conversions). For example:
% ftp IBM-host
ftp> binary
200 Representation type is Image
ftp> site rdw
200 Site command was accepted
ftp> get decomp
% setenv CMPIN decomp
% adacmp fdt record_structure=rdw source=(ebcdic,high)
RDW_HEADER Like RDW, for data decompressed on a mainframe with HEADER=YES.
HEADER For data decompressed on a mainframe with HEADER=YES, if the decompressed data do not contain any additional information about block or record length.
VARIABLE_BLOCKED The variable blocked format from BS2000 or IBM.

The default is ELENGTH_PREFIX.

SEPARATOR

SEPARATOR = character | \character

If you specify this option, ADACMP expects the fields in the raw data record to be separated by the character specified. You can omit the apostrophes round the character specification if the character has no special meaning for the Adabas utilities. The same fields in different records are then permitted to be of different lengths.

If a format buffer is specified using the FIELDS parameter, the order of the specified field names must correspond with the order in which the fields are specified in the FDT. A mismatch results if this is not the case.

If the FDT contains multiple value fields or periodic groups, a format buffer must be specified with the FIELDS parameter. Members of periodic groups must be ordered by 1) periodic group index and 2) field sequence in the FDT (see example 2 below).

Because no binary data is expected in the input file using the SEPARATOR option, the RECORD_STRUCTURE parameter will be set to NEWLINE_SEPARATOR.

Example 1

FDT:      1, AA, 2, U
          1, AB, 8, U
          1, AC, 2, A

CMPIN:    12;12345678;AA
          1234;5;BB

adacmp
fdt
separator=\;

or for UNIX

adacmp fdt  separator=\\\;

or

adacmp fdt  separator='\;'

In this example, 2 records are compressed with the default FDT, the separator character is the semicolon, and the default record structure is NEWLINE_SEPARATOR. Note that the semicolon must be preceded by a backslash, otherwise it would be treated as the start of a comment. If you enter the parameters under UNIX directly from the command line, it is necessary to precede the backslash and the semicolon by additional backslashes or to put them in quotes or double quotes since they are special characters.

Example 2

FDT:      1, XX, PE
          2, AA, 8, A
          2, AB, 8, U
          1, YY, 2, A

Correct:  CMPIN:    aaaa,1,bbbb,2,yy

          Command:  adacmp fdt separator=, fields AA1,AB1,AA2,AB2,YY.
          First, the field values for the periodic group index 1 are
          specified, and then the field values for periodic group index 2.

Invalid:  CMPIN:   aaaa,bbbb,1,2,yy
          Command: adacmp fdt separator=, fields AA1-2,AB1-2,YY.
          The fields specification is invalid because the 2nd value of
          AA is specified before the 1st value of AB; you will get
          the error SEPINV.

In this example, 1 record with fields given in the format buffer is compressed, the separator character is the comma.

Example 3

FDT:      1, AA, 8, A
          1, MA, 1, A, MU

CMPIN:    aaaa%2%A%B
          bbbb%3%C%D%E

adacmp  dbid=9  file=15  separator=%, fields "AA,MAC,1,U,MA1-N"

In this example, 2 records with fields given in the format buffer are compressed, the occurrence count or the multiple value field MA is different in different records. The separator character is the percent character.

[NO]SHORT_RECORDS

[NO]SHORT_RECORDS

If SHORT_RECORDS is specified, it is possible to omit fields at the end of the decompressed record that contain null values.

The default is NOSHORT_RECORDS.

You can only omit complete fields; it is not possible to truncate the last value:

Example

Assuming you have specified the parameters for a file containing alphanumeric fields AA and AB:

FIELDS
AA,20,AB,20
END_OF_FIELDS
SHORT_RECORDS

Then the following record is allowed:

"Field AA          "

The following record is not allowed:

"Field AA"

[NO]SINGLE_FILE

[NO]SINGLE_FILE

If the SINGLE_FILE option is set, ADACMP writes the Descriptor Value Table (DVT) and the compressed user data to a single file (CMPDTA) instead of writing them to separate files.

The default is NOSINGLE_FILE.

SKIPREC

SKIPREC = number

This parameter specifies the number of records to be skipped before compression is started.

SOURCE_ARCHITECTURE

SOURCE_ARCHITECTURE = ( keyword [,keyword [,keyword] ] )

This parameter specifies the format (character set, floating-point format and byte order) of the input data records. The following keywords can be used:

Keyword Group Valid Keywords
Character set

ASCII

EBCDIC

Floating-point format

IBM_370_FLOATING

IEEE_FLOATING

VAX_FLOATING

Byte order

HIGH_ORDER_BYTE_FIRST

LOW_ORDER_BYTE_FIRST

If no keyword of a keyword group is specified, the default for this keyword group is the keyword that corresponds to the architecture of the machine on which ADACMP is running.

Note:
The FDT is always input in ASCII format.

Example

If the input records that are to be compressed are in IBM format, the user must specify the following:

SOURCE_ARCHITECTURE = (EBCDIC, IBM_370_FLOATING, HIGH_ORDER_BYTE_FIRST)

SYFINPUT

SYFINPUT = keyword

This parameter specifies the input used for the compression of system generated fields. The following keywords can be used:

Keyword Meaning
SYSTEM The system generated field values are regenerated by the system in ADACMP.
USER The system generated field values are taken from the decompressed file.

The default is SYFINPUT = USER.

TZ

TZ  {=|:} [timezone]

The specified time zone must be a valid time zone name that is contained in the time zone database known as the Olson database (http://www.twinsun.com/tz/tz-link.htm). If a time zone has been specified, this time zone is used for time zone conversions of date/time fields with the option TZ.

The default is UTC, which is used internally to store date/time fields with option TZ; no conversion is required.

If you specify an empty value, no checks are made to ensure that date/time fields are correct.

Note:
The time zone names are file names. Depending on the platform, these file names may or may not be case sensitive. Also, the time zone names, depending on the platform, may or may not be case sensitive.

Examples:

tz:Europe/Berlin

This is correct on all platforms.

TZ=Europe/Berlin

With this specification, TZ is converted to upper case EUROPE/BERLIN. This is correct on Windows, because file names are not case sensitive on Windows, but it is not correct on Unix, because Unix file names are case sensitive.

[NO]USEREXIT

[NO]USEREXIT

This option specifies whether a user exit is to be taken or not. If USEREXIT is specified, the environment variable ADAUEX_6/logical name ADABAS$USEREXIT_6 must point to a loadable user-written routine.

See Administration, User Exits and Hyperexits for more details.

The default is NOUSEREXIT.

[NO]USERISN

[NO]USERISN

If this option is set to USERISN, the ISN for each record in the input file will be assigned by the user.

If USERISN is specified, the user must give the ISN to be assigned to each record as a four-byte binary number immediately preceding each data record.

ISNs may be assigned in any order and must be unique (for the file). The ISN must not exceed the maximum number of records (MAXISN) specified for the file (see the file definition utility ADAFDU for more detailed information).

ADACMP does not check for unique ISNs or for ISNs which exceed MAXISN. These checks are performed by the mass update utility ADAMUP (if an error is detected, the ADAMUP run terminates with an error message).

If this option is set to NOUSERISN, the ISN is assigned by Adabas.

The default is NOUSERISN.

WCHARSET

WCHARSET = char_set

This parameter specifies the default encoding used in the decompressed file based on the encoding names listed at http://www.iana.org/assignments/character-sets - most of the character sets listed there are supported by ICU, which is used by Adabas for internationalization support.

The default is UTF-8.

Output

The ADACMP utility outputs three files:

  1. Compressed data

  2. Descriptor values

  3. Records with errors

Compressed Data Records

The data records which ADACMP has processed, modified and compressed are output together with the FDT information to a sequential file. This file is used as input for the mass update utility ADAMUP.

If the output file contains no records (no records on the input file or all records rejected), the output may still be used as input for the mass update utility ADAMUP.

Descriptor-Value Table File

This file contains the descriptor value tables (DVT).

Compressed data records and descriptor value tables are written to one file if the SINGLE_FILE option is specified.

Rejected Data Records

Any records rejected by ADACMP are written to the ADACMP error file. The contents of this error file should be displayed using the ADAERR utility. Do not print the error file using the standard operating system print utilities since the records contain unprintable characters.

See the ADAERR utility for further information.

Report

The ADACMP report begins with a display of the field definition entered if CMPFDT is used for input. Any statement which contains a syntax error will be printed with a message immediately following the statement.

Following the display of the data-definition statements, a descriptor summary, the number of input records processed, the number of input records rejected, and the number of input records compressed are printed.

Restart Considerations

ADACMP does not have a restart capability. An interrupted ADACMP run must be re-started from the beginning.

ADACMP does not change the database; therefore, no considerations need to be made concerning database status before restarting ADACMP.