Character Encoding

This section describes the character encoding mechanisms for HTTP requests to Tamino and responses from Tamino.

The term "encoding" in this section is used with the semantic defined in the W3C XML specification at http://www.w3.org/TR/REC-xml/. The terms "charset" and "character set" are used with the semantic defined in the HTTP/1.1 description at http://www.ietf.org/rfc/rfc2616.txt.

Character Encoding of Input Documents

Input documents can be supplied for X-Machine commands such as _process and _define. The encoding of an input document can be specified explicitly in several ways:

in the encoding attribute of the document's XML declaration
in the _encoding parameter passed in the X-Machine command
in the charset value that is defined in conjunction with the document's Content-Type parameter in the HTTP request

If the encoding is not specified in one of these ways, the document is assumed to be encoded according to the value of the server XML parameter XML document default encoding (for details see the list of server XML properties in the section Database Properties in the documentation for the Tamino Manager).

All input documents with top-level media type "text" are converted to Unicode. Input data is converted from the client's encoding to Unicode. The original encoding of the input is not remembered. X-Machine uses the internet standards for character set names as defined in the document http://www.iana.org/assignments/character-sets.

Hint for users of Microsoft Windows: Please note that Microsoft code page 1252 is close to but not identical with ISO-8859-1 (latin1).

Example

Database queries specifying character encoding can be sent to the X-Machine using the X-Machine command _encoding followed for example by the X-Machine command _xql in a single HTTP request, for example

http://myhost:80/tamino/mydb/mycollection?_encoding=utf-8&_xql=patient/name[surname="Bloggs"].

The value of the _encoding parameter will be applied to the values of all commands that are subsequently executed. See also the section Order of Execution of Commands.

Character Encoding of Output Documents

Output documents are converted by the X-Machine to the encoding desired by the client. Character references are used to represent characters that do not exist in the desired encoding. The desired encoding of the output can be specified in the HTTP header "Accept-Charset". If "Accept-Charset" is omitted, X-Machine uses the encoding of the client request.

Supported Character Encodings

The Tamino server supports all standard character encodings and their well known aliases, as shown in the following list.

Note:
It is possible that some Tamino product components do not support some of these encodings. Please see the documentation for the individual developer components for a list of their supported encodings.

Encoding Name	Well known aliases
Adobe-Standard-Encoding	csAdobeStandardEncoding
Big5	950, cp950, csBig5, ibm-1370_VSUB_VPUA, x-big5
CESU-8
cp850	850, csPC850Multilingual, IBM850
cp851	851, csPC851, IBM851
cp856	856, ibm-856
cp857	857, csIBM857
cp858	IBM00858
cp859
cp860	860, csIBM860, IBM860
cp861	861, cp-is, csIBM861, IBM861
cp862	862, cp867, cspc862latinhebrew
cp863	cp863, csIBM863, IBM863
cp864	csIBM864
cp865	865, csIBM865, IBM865
cp866	866, csIBM866
cp868	868, cp-ar, csIBM868, IBM868
cp869	869, cp-gr, csIBM869
cp921	921
cp922	922
EUC-JP	csEUCPkdFmtJapanese, eucjis, Extended_UNIX_Code_Packed_Format_for_Japanese, ibm-33722_VPUA, ibm-eucJP, X-EUC-JP
EUC-KR	csEUCKR, ibm-970_VPUA, ibm-eucKR, X-EUC-KR
gb18030	ibm-1392
GB2312	1383, chinese, cp1383, csGB2312, csISO58GB231280, EUC-CN, gb, gb2312-1980, GB_2312-80, ibm-1383, ibm-1383_VPUA, ibm-eucCN, iso-ir-58, X-EUC-CN
GBK	CP936, ibm-1386_VSUB_VPUA, MS936, zh_cn, windows-936
hp-roman8	csHPRoman8, r8, roman8
HZ-GB-2312	HZ
IBM01140	CCSID01140, CP01140, cpibm1140, ebcdic-us-37+euro
IBM01141	CCSID01141, CP01141, cpibm1141, ebcdic-de-273+euro
IBM01142	CCSID01142, CP01142, cpibm1142, ebcdic-dk-277+euro, ebcdic-no-277+euro
IBM01143	CCSID01143, CP01143, cpibm1143, ebcdic-fi-278+euro, ebcdic-se-278+euro
IBM01144	CCSID01144, CP01144, cpibm1144, ebcdic-it-280+euro
IBM01145	CCSID01145, CP01145, cpibm1145, ebcdic-es-284+euro
IBM01146	CCSID01146, CP01146, cpibm1146, ebcdic-gb-285+euro
IBM01147	CCSID01147, CP01147, cpibm1147, ebcdic-fr-297+euro
IBM01148	CCSID01148, CP01148, cpibm1148, ebcdic-international-500+euro
IBM01149	CCSID01149, CP01149, cpibm1149, ebcdic-is-871+euro
IBM037	cpibm37, ebcdic-cp-us, ebcdic-cp-ca, ebcdic-cp-wt, ebcdic-cp-nl, cp37, cp037, 037
IBM1026	CP1026, csIBM1026, Ibm-1026_STD
IBM273	273, CP273, cpibm273, csIBM273, ebcdic-de
IBM277	277, csIBM277, cpibm277, EBCDIC-CP-DK, EBCDIC-CP-NO, ebcdic-dk
IBM278	278, cp278, cpibm278, csIBM278, ebcdic-cp-fi, ebcdic-cp-se, ebcdic-sv
IBM280	280, CP280, cpibm280, csIBM280, ebcdic-cp-it
IBM284	284, CP284, cpibm284, csIBM284, ebcdic-cp-es
IBM285	285, CP285, cpibm285, csIBM285, ebcdic-cp-gb, ebcdic-gb
IBM290	cp290, csIBM290, EBCDIC-JP-kana
IBM297	297, cp297, cpibm297, csIBM297, ebcdic-cp-fr
IBM367
IBM420	420, cp420, csIBM420, ebcdic-cp-ar1
IBM424	424, cp424, csIBM424, ebcdic-cp-he
IBM500	500, CP500, cpibm500, csIBM500, ebcdic-cp-be, ebcdic-cp-ch
IBM852
IBM855
IBM857
IBM862
IBM864
IBM869
IBM870	CP870, csIBM870, ibm-870, ibm-870_STD, ebcdic-cp-roece, ebcdic-cp-yu
IBM871	871, CP871, cpibm871, csIBM871, ebcdic-cp-is, ebcdic-is
IBM918	CP918, csIBM918, , ebcdic-cp-ar2, ibm-918_STD, ibm-918_VPUA
ISO-2022-CN-EXT
ISO-2022-CN
ISO-2022-JP-2	csISO2022JP2
ISO-2022-JP	csISO2022JP
ISO-2022-KR	csISO2022KR
ISO-2022	2022, cp2022
iso-8859-15
ISO-8859-1	8859-1, cp819, csISOLatin1, IBM819, ISO_8859-1:1987, iso-ir-100, l1, latin1
iso-8859-2	8859-2, 912, cp912, csISOLatin2, ISO_8859-2:1987, iso-ir-101, l2, latin2
iso-8859-3	8859-3, 913, cp913, csISOLatin3, iso-ir-109, l3, latin3
iso-8859-4	8859-4, 914, cp914, csISOLatin4, ISO_8859-4:1988, iso-ir-110, l4, latin4
iso-8859-5	8859-5, 915, cp915, csISOLatinCyrillic, cyrillic, ISO_8859-5:1988, iso-ir-144
iso-8859-6	1089, 8859-6, arabic, asmo-708, cp1089, csISOLatinArabic, ecma-114, ISO_8859-6:1987, iso-ir-127
iso-8859-7	813, 8859-7, cp813, csISOLatinGreek, ecma-118, elot_928, greek, greek8, ISO_8859-7:1987, iso-ir-126
iso-8859-8	916, cp916, csISOLatinHebrew, Hebrew, 8859-8, ISO_8859-8:1988, iso-ir-138
iso-8859-9	8859-9, 920, cp920, latin5, csISOLatin5, ISO_8859-9:1989, iso-ir-148, l5
JIS_Encoding	ISO-2022-JP-1, JIS
KOI8-R	cp878, cskoi8r, koi8
KSC_5601	949, csKSC56011987, ibm949, ibm949_VSUB_VPUA, iso-ir-149, johab, Korean, ksc5601_1992, KS_C_5601-1987, KS_C_5601-1989, ks_x_1001:1992
mac	csMacintosh
SCSU
Shift_JIS	943, cp943, cp932, csShiftJIS, csWindows31J, MS_Kanji, pck, sjis, windows-31j, x-sjis
TIS-620	874, cp874, cp9066, ms874, windows-874
US-ASCII	ANSI_X3.4-1968, ASCII, ANSI_X3.4-1986, cp367, csASCII, ISO_646.irv:1983, ISO_646.irv:1991, ISO646-US, iso-ir-6, us
UTF-16BE	cp1201, UTF16_BigEndian, x-utf-16be
UTF-16LE	cp1200, UTF16_LittleEndian, x-utf-16le
UTF-32BE	UTF32_BigEndian
UTF-32LE	UTF32_LittleEndian
UTF-7	cp65000
UTF-8	cp1208, cp65001
UTF-16	csUnicode, ISO-10646-UCS-2, ucs-2
UTF-32	csUCS4, ISO-10646-UCS-4, ucs-4
windows-1250	cp1250
windows-1251	cp1251
windows-1252	cp1252
windows-1253	cp1253
windows-1254	cp1254
windows-1255	cp1255
windows-1256	cp1256
windows-1257	cp1257
windows-1258	cp1258