This section describes the character encoding mechanisms for HTTP requests to Tamino and responses from Tamino.
The term "encoding" in this section is used with the semantic defined in the W3C XML specification at http://www.w3.org/TR/REC-xml/. The terms "charset" and "character set" are used with the semantic defined in the HTTP/1.1 description at http://www.ietf.org/rfc/rfc2616.txt.
Input documents can be supplied for X-Machine commands such as
_process
and _define
. The
encoding of an input document can be specified explicitly in several ways:
in the encoding
attribute of the
document's XML declaration
in the _encoding
parameter passed in the
X-Machine command
in the charset
value that is defined in
conjunction with the document's Content-Type
parameter
in the HTTP request
If the encoding is not specified in one of these ways, the document is
assumed to be encoded according to the value of the server XML parameter
XML document default encoding
(for details see the list
of server XML properties in the section
Creating a Database
in the documentation for the Tamino Manager).
All input documents with top-level media type "text" are converted to Unicode. Input data is converted from the client's encoding to Unicode. The original encoding of the input is not remembered. X-Machine uses the internet standards for character set names as defined in the document http://www.iana.org/assignments/character-sets.
Hint for users of Microsoft Windows: Please note that Microsoft code page 1252 is close to but not identical with ISO-8859-1 (latin1).
Database queries specifying character encoding can be sent to the
X-Machine using the X-Machine command _encoding
followed for example by the X-Machine command _xql
in a single HTTP request, for example
http://myhost:80/tamino/mydb/mycollection?_encoding=utf-8&_xql=patient/name[surname="Bloggs"].
The value of the _encoding
parameter will be
applied to the values of all commands that are subsequently executed. See also
the section Order of Execution of
Commands.
Output documents are converted by the X-Machine to the encoding desired by the client. Character references are used to represent characters that do not exist in the desired encoding. The desired encoding of the output can be specified in the HTTP header "Accept-Charset". If "Accept-Charset" is omitted, X-Machine uses the encoding of the client request.
The Tamino server supports all standard character encodings and their well known aliases, as shown in the following list.
Note:
It is possible that some Tamino product components do not support
some of these encodings. Please see the documentation for the individual
developer components for a list of their supported encodings.
Encoding Name | Well known aliases |
---|---|
Adobe-Standard-Encoding | csAdobeStandardEncoding |
Big5 | 950, cp950, csBig5, ibm-1370_VSUB_VPUA, x-big5 |
CESU-8 | |
cp850 | 850, csPC850Multilingual, IBM850 |
cp851 | 851, csPC851, IBM851 |
cp856 | 856, ibm-856 |
cp857 | 857, csIBM857 |
cp858 | IBM00858 |
cp859 | |
cp860 | 860, csIBM860, IBM860 |
cp861 | 861, cp-is, csIBM861, IBM861 |
cp862 | 862, cp867, cspc862latinhebrew |
cp863 | cp863, csIBM863, IBM863 |
cp864 | csIBM864 |
cp865 | 865, csIBM865, IBM865 |
cp866 | 866, csIBM866 |
cp868 | 868, cp-ar, csIBM868, IBM868 |
cp869 | 869, cp-gr, csIBM869 |
cp921 | 921 |
cp922 | 922 |
EUC-JP | csEUCPkdFmtJapanese, eucjis, Extended_UNIX_Code_Packed_Format_for_Japanese, ibm-33722_VPUA, ibm-eucJP, X-EUC-JP |
EUC-KR | csEUCKR, ibm-970_VPUA, ibm-eucKR, X-EUC-KR |
gb18030 | ibm-1392 |
GB2312 | 1383, chinese, cp1383, csGB2312, csISO58GB231280, EUC-CN, gb, gb2312-1980, GB_2312-80, ibm-1383, ibm-1383_VPUA, ibm-eucCN, iso-ir-58, X-EUC-CN |
GBK | CP936, ibm-1386_VSUB_VPUA, MS936, zh_cn, windows-936 |
hp-roman8 | csHPRoman8, r8, roman8 |
HZ-GB-2312 | HZ |
IBM01140 | CCSID01140, CP01140, cpibm1140, ebcdic-us-37+euro |
IBM01141 | CCSID01141, CP01141, cpibm1141, ebcdic-de-273+euro |
IBM01142 | CCSID01142, CP01142, cpibm1142, ebcdic-dk-277+euro, ebcdic-no-277+euro |
IBM01143 | CCSID01143, CP01143, cpibm1143, ebcdic-fi-278+euro, ebcdic-se-278+euro |
IBM01144 | CCSID01144, CP01144, cpibm1144, ebcdic-it-280+euro |
IBM01145 | CCSID01145, CP01145, cpibm1145, ebcdic-es-284+euro |
IBM01146 | CCSID01146, CP01146, cpibm1146, ebcdic-gb-285+euro |
IBM01147 | CCSID01147, CP01147, cpibm1147, ebcdic-fr-297+euro |
IBM01148 | CCSID01148, CP01148, cpibm1148, ebcdic-international-500+euro |
IBM01149 | CCSID01149, CP01149, cpibm1149, ebcdic-is-871+euro |
IBM037 | cpibm37, ebcdic-cp-us, ebcdic-cp-ca, ebcdic-cp-wt, ebcdic-cp-nl, cp37, cp037, 037 |
IBM1026 | CP1026, csIBM1026, Ibm-1026_STD |
IBM273 | 273, CP273, cpibm273, csIBM273, ebcdic-de |
IBM277 | 277, csIBM277, cpibm277, EBCDIC-CP-DK, EBCDIC-CP-NO, ebcdic-dk |
IBM278 | 278, cp278, cpibm278, csIBM278, ebcdic-cp-fi, ebcdic-cp-se, ebcdic-sv |
IBM280 | 280, CP280, cpibm280, csIBM280, ebcdic-cp-it |
IBM284 | 284, CP284, cpibm284, csIBM284, ebcdic-cp-es |
IBM285 | 285, CP285, cpibm285, csIBM285, ebcdic-cp-gb, ebcdic-gb |
IBM290 | cp290, csIBM290, EBCDIC-JP-kana |
IBM297 | 297, cp297, cpibm297, csIBM297, ebcdic-cp-fr |
IBM367 | |
IBM420 | 420, cp420, csIBM420, ebcdic-cp-ar1 |
IBM424 | 424, cp424, csIBM424, ebcdic-cp-he |
IBM500 | 500, CP500, cpibm500, csIBM500, ebcdic-cp-be, ebcdic-cp-ch |
IBM852 | |
IBM855 | |
IBM857 | |
IBM862 | |
IBM864 | |
IBM869 | |
IBM870 | CP870, csIBM870, ibm-870, ibm-870_STD, ebcdic-cp-roece, ebcdic-cp-yu |
IBM871 | 871, CP871, cpibm871, csIBM871, ebcdic-cp-is, ebcdic-is |
IBM918 | CP918, csIBM918, , ebcdic-cp-ar2, ibm-918_STD, ibm-918_VPUA |
ISO-2022-CN-EXT | |
ISO-2022-CN | |
ISO-2022-JP-2 | csISO2022JP2 |
ISO-2022-JP | csISO2022JP |
ISO-2022-KR | csISO2022KR |
ISO-2022 | 2022, cp2022 |
iso-8859-15 | |
ISO-8859-1 | 8859-1, cp819, csISOLatin1, IBM819, ISO_8859-1:1987, iso-ir-100, l1, latin1 |
iso-8859-2 | 8859-2, 912, cp912, csISOLatin2, ISO_8859-2:1987, iso-ir-101, l2, latin2 |
iso-8859-3 | 8859-3, 913, cp913, csISOLatin3, iso-ir-109, l3, latin3 |
iso-8859-4 | 8859-4, 914, cp914, csISOLatin4, ISO_8859-4:1988, iso-ir-110, l4, latin4 |
iso-8859-5 | 8859-5, 915, cp915, csISOLatinCyrillic, cyrillic, ISO_8859-5:1988, iso-ir-144 |
iso-8859-6 | 1089, 8859-6, arabic, asmo-708, cp1089, csISOLatinArabic, ecma-114, ISO_8859-6:1987, iso-ir-127 |
iso-8859-7 | 813, 8859-7, cp813, csISOLatinGreek, ecma-118, elot_928, greek, greek8, ISO_8859-7:1987, iso-ir-126 |
iso-8859-8 | 916, cp916, csISOLatinHebrew, Hebrew, 8859-8, ISO_8859-8:1988, iso-ir-138 |
iso-8859-9 | 8859-9, 920, cp920, latin5, csISOLatin5, ISO_8859-9:1989, iso-ir-148, l5 |
JIS_Encoding | ISO-2022-JP-1, JIS |
KOI8-R | cp878, cskoi8r, koi8 |
KSC_5601 | 949, csKSC56011987, ibm949, ibm949_VSUB_VPUA, iso-ir-149, johab, Korean, ksc5601_1992, KS_C_5601-1987, KS_C_5601-1989, ks_x_1001:1992 |
mac | csMacintosh |
SCSU | |
Shift_JIS | 943, cp943, cp932, csShiftJIS, csWindows31J, MS_Kanji, pck, sjis, windows-31j, x-sjis |
TIS-620 | 874, cp874, cp9066, ms874, windows-874 |
US-ASCII | ANSI_X3.4-1968, ASCII, ANSI_X3.4-1986, cp367, csASCII, ISO_646.irv:1983, ISO_646.irv:1991, ISO646-US, iso-ir-6, us |
UTF-16BE | cp1201, UTF16_BigEndian, x-utf-16be |
UTF-16LE | cp1200, UTF16_LittleEndian, x-utf-16le |
UTF-32BE | UTF32_BigEndian |
UTF-32LE | UTF32_LittleEndian |
UTF-7 | cp65000 |
UTF-8 | cp1208, cp65001 |
UTF-16 | csUnicode, ISO-10646-UCS-2, ucs-2 |
UTF-32 | csUCS4, ISO-10646-UCS-4, ucs-4 |
windows-1250 | cp1250 |
windows-1251 | cp1251 |
windows-1252 | cp1252 |
windows-1253 | cp1253 |
windows-1254 | cp1254 |
windows-1255 | cp1255 |
windows-1256 | cp1256 |
windows-1257 | cp1257 |
windows-1258 | cp1258 |