Bidirectional Language Support

This document covers the following topics:


General Information

Some languages, for example Arabic and Hebrew, are written from right-to-left (RTL), whereas the majority of the languages, for example English and German, are written from left-to-right (LTR). Text which contains both left-to-right and right-to-left characters is called bidirectional text.

Natural provides a basic support for bidirectional languages. On Windows, this support is activated when both the Natural default code page and the Windows system code page are defined as bidirectional code pages. If Natural does not define a specific code page, it is sufficient when a bidirectional Windows system code page has been defined. On Linux, the support for bidirectional languages is activated when the Natural default code page is a bidirectional code page.

The output of Natural programs can be controlled using the profile parameter PM, the terminal command %V, and the session parameter PM.

On Linux, the profile parameter DO (Display Order) is additionally used to support applications that have been originally written for terminals which support inverse (right-to-left) print mode, but no bidirectional data. These applications create the display order of bidirectional data in the application code. With the parameter DO, these applications are enabled to run compatibly also with I/O devices that support bidirectional data. This is for instance the case if an application runs in a browser with the Natural Web I/O Interface.

Screen Direction

The profile parameter PM defines the default screen direction. When PM is set to R (reset), the default screen direction is left-to-right. When PM is set to I (inverse), the default screen direction is right-to-left. All non-alphanumeric fields and system variables are automatically inverted by Natural so that they are displayed correctly from right-to-left if the screen direction is right-to-left. PF key lines (Linux) are not inverted; they are always shown from left-to-right.

The terminal command %V can be used to change the screen direction. If the screen direction is right-to-left, the layout of the current window is mirrored, which means that the origin of all window components or fields is the upper right corner. The screen direction is changed to right-to-left using %VON and is reverted to left-to-right using %VOFF.

Field Direction

The session parameter PM reverses the direction of a field. The effect of "reversing the direction of a field" depends on the statement in which the PM parameter is used and the platform. If the PM parameter is used in a MOVE statement, the content of the field is simply reversed (that is, the first character will become the last character, and so on); the result does not depend on the characters of the field. Trailing blanks are removed before the field is reversed.

For example, the following program

DEFINE DATA LOCAL
1  TEST1  (A10)
1  TEST2  (A10)
END-DEFINE
TEST1 := 'program'

MOVE TEST1 (PM=I) TO TEST2
INPUT TEST1 (AD=O) TEST2 (AD=O)

END

produces the following output:

TEST1 program    TEST2 margorp

where "margorp" is the reversed version of "program".

When the PM parameter is used for IO statements such as INPUT or DISPLAY, its effect is even more complex. In this case, the field direction is based on the screen direction:

  • If the screen direction is left-to-right and PM=I is applied to a field, the field direction changes to right-to-left.

  • If the screen direction is right-to-left and PM=I is applied to a field, the field direction changes to left-to-right.

On Windows and browser terminals (Natural Web I/O Interface), "reversing the field direction" does not mean that the characters of the field are simply reversed. Instead, the complex bidirectional algorithm is applied (for more information, see the Microsoft Windows documentation). On character-oriented terminals, however, the characters of a field are not resorted; they are simply reversed.

In the following example, the characters assigned to the variable TEST have been entered in the following sequence:

Test variable

The following is an example program for Windows. The characters of the constant are already resorted when entering them in the program editor.

DEFINE DATA LOCAL
1  TEST  (A20)
END-DEFINE
TEST := 'abc 123 Symbols'

SET CONTROL 'voff'

INPUT TEST (AD=O)  /
      TEST (AD=O PM=I) 

SET CONTROL 'von'

INPUT TEST (AD=O)  /
      TEST (AD=O PM=I) 
END

This program produces the following two screens on Windows:

TEST abc 123 Symbols 
TEST          123 graphics/uni-testvariable-reversed.png  abc

and

                                           123 Symbols  abc TEST
                                  abc 123 Symbols           TEST

The following is an example program for Linux. If the characters are entered in the sequence as described above, the program is displayed in the following way, because the characters are simply displayed in the keying sequence.

DEFINE DATA LOCAL
1  TEST  (A20)
END-DEFINE
TEST := 'abc Symbols 123'
 
SET CONTROL 'voff'
 
INPUT TEST (AD=O)  /
      TEST (AD=O PM=I) 
 
SET CONTROL 'von'
 
INPUT TEST (AD=O)  /
      TEST (AD=O PM=I) 
END

On Linux, this program produces the following two screens:

TEST abc Symbols 123 
TEST          321 Symbols  cba

and

                                           321 Symbols  cba TSET
                                  abc Symbols 123           TSET

Maps and Dialogs

On Windows and Linux, the map editor simplifies the handling of maps with bidirectional fields by offering the Reverse Map command. This command changes the display direction of the current map. The position of the fields is not changed; only the view is changed. On Windows, this command applies only to the current map. On Linux, a flag is set so that all following maps are displayed reversed; a following Reverse Map command will restore the original situation.

On Windows, the output of dialogs can be controlled in a similar way: both the dialog itself and most of the dialog controls offer an RTL attribute. If the RTL attribute of the dialog is checked, the screen direction of the dialog is right-to-left. If the RTL attribute of other controls is checked, the direction of these controls is right-to-left.

The profile parameter PM defines the default setting of the RTL attribute for new dialogs. When PM is set to R (reset), the RTL attribute is unchecked by default. When PM is set to I (inverse), the RTL attribute is checked by default. The default setting of the RTL attribute for newly created controls of a dialog is derived from the RTL attribute setting of the dialog.

If the RTL attribute of a dialog is changed when the dialog already contains controls, a dialog appears asking whether the RTL attributes of the controls should also be changed.

Print Methods

When working with bidirectional languages on Windows, "GUI" is the preferred print method. With the print method "GUI", the printed page will show the same layout as the window displayed on the screen. The sorting of the field characters is identical.

If the print method "TTY" is used, the printed layout will most probably differ from the layout of the screen window because the field characters are printed in logical sequence. For fields with right-to-left direction, all characters are simply reversed (that is, the first character will become the last character, and so on).

Terminal Capabilities

On Linux, some special terminal capabilities for bidirectional support can be defined with the Natural Termcap utility.

The key which is defined by the RTLF capability can be used to toggle the input direction of a field at runtime.

With the RTLM and LTRM capabilities, it is possible to switch automatically between right-to-left and left-to-right input mode - provided that the terminal emulation supports this functionality. The RTLM escape sequence will be inserted in front of right-to-left fields, and the LTRM escape sequence will be inserted in front of left-to-right fields.

Arabic Shaping

In Arabic text, all characters of a string are normally connected with each other. For this reason, Arabic characters have up to 4 presentation forms: the isolated, the final, the initial and the medial form. The form that will be used depends on the position of the character in the string. For example, the Arabic character "MEEM" has the following forms in Unicode:

U+0645 Arabic letter ARABIC LETTER MEEM
U+FEE1 Arabic letter ARABIC LETTER MEEM ISOLATED FORM
U+FEE2 Arabic letter ARABIC LETTER MEEM FINAL FORM
U+FEE3 Arabic letter ARABIC LETTER MEEM INITIAL FORM
U+FEE4 Arabic letter ARABIC LETTER MEEM MEDIAL FORM

Moreover, some characters are combined to a new form if they appear consecutively in a string. This is called a "ligature". For example, the characters

U+0644 Arabic letter ARABIC LETTER LAM
U+0627 Arabic letter ARABIC LETTER ALEF

have the following combined form:

U+FEFB Arabic letter ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM

Unicode strings should include only the Arabic characters in the Arabic block (U+0600 through U+06FF) or the Arabic Supplement block (U+0750 through U+077F); it is not recommended to use the presentation forms in regular Arabic text. It is up to the user interface to display the correct shapes of the characters.

"Shaped" means that every Arabic base character is converted to the appropriate Arabic presentation form. The string may contain each of the four presentation forms of a character. For example, if U+0645 (ARABIC LETTER MEEM) is used as the last character of a string, it is converted to U+FEE2 (ARABIC LETTER MEEM FINAL FORM).

"Unshaped" means that each character is represented only by its basic form. For example, instead of U+FEE2 (ARABIC LETTER MEEM FINAL FORM), U+0645 (ARABIC LETTER MEEM) is used. The conversion to the correct presentation form is performed by the rendering engine of the output device.

Natural strings are internally represented as unshaped alpha or Unicode strings. If strings are displayed with a browser using the Natural Web I/O Interface client or the PROCESS PAGE statement, no transformation is required since the rendering engine of the browser takes care of the correct presentation. Incoming strings from such devices are already unshaped and can be directly passed to Natural. If a string is displayed on a terminal such as 3279 or a terminal emulator such as IBM Personal Communications, it must be converted into the shaped form since the terminal itself does not take care of the correct presentation. Accordingly, incoming strings are in the shaped form and must be transformed into the unshaped form to be processed correctly by Natural. The most popular code page for Arabic terminals on the mainframe is IBM420. Compared to Unicode, the number of characters is reduced and not each form of a character is contained. The conversion of strings into IBM420 substitutes unavailable forms of a character by a similar presentation form. For example, the medial form of the Arabic letter MEEM (U+FEE4) is substituted by the initial form (U+FEE3) of the character.

In the Arabic EBCDIC code page IBM420, the Arabic character "MEEM" is represented by the following presentation forms:

H’BA’ Arabic letter ARABIC LETTER MEEM
H’BB’ Arabic letter ARABIC LETTER MEEM INITIAL FORM

Arabic Tail Fragment

The Arabic characters SEEN (U+0633), SHEEN (U+0634), SAD (U+0635) and DAD (U+0636) (Seen Family) are displayed on terminals as two bytes if they appear in the final form. Code page IBM420 contains a so-called "Arabic tail fragment" that completes the final form of a Seen Family character on terminals or terminal emulators. Of course, the Arabic tail fragment needs an additional position on the screen. The Arabic tail fragment is not required by the browsers. If a string with the final form of a Seen Family character is entered in a browser (Natural Web I/O Interface client or PROCESS PAGE statement) and subsequently displayed on a terminal, the Arabic tail fragment is appended to the string with the consequence that the length of the string increases. If a string with the final form of a Seen Family character is entered via a terminal or terminal emulator and subsequently displayed in a browser, the Arabic tail fragment is removed from the string.

Note:
For more information about control of character shaping, see SHAPED - Control of Character Shaping in the Parameter Reference documentation.