- a technique for including multiple character sets in a single character encoding system, and
- a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.
|Standard||ISO/IEC 2022, |
JIS X 0202
|Transforms / Encodes||US-ASCII and, depending on implementation:|
|Succeeded by||ISO 10646 (Unicode)|
Many of the character sets included as ISO/IEC 2022 encodings are 'double byte' encodings where two bytes correspond to a single character. This makes ISO-2022 a variable width encoding. But a specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation.
Many languages or language families not based on the Latin alphabet such as Greek, Cyrillic, Arabic, or Hebrew have historically been represented on computers with different 8-bit extended ASCII encodings. Written East Asian languages, specifically Chinese, Japanese, and Korean, use far more characters than can be represented in an 8-bit computer byte and were first represented on computers with language-specific double byte encodings.
ISO/IEC 2022 was developed as a technique to attack both of these problems: to represent characters in multiple character sets within a single character encoding, and to represent large character sets.
A second requirement of ISO-2022 was that it should be compatible with 7-bit communication channels. So even though ISO-2022 is an 8-bit character set any 8-bit sequence can be reencoded to use only 7-bits without loss and normally only a small increase in size.
To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and follow the patterns defined within the standard. These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on previously encountered escape sequences. Note, however, that other standards such as ISO-2022-JP may impose extra conditions such as the current character set is reset to US-ASCII before the end of a line.
To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that one seven bit character will normally define 94 graphic (printable) characters (in addition to space and 33 control characters). Using two bytes, it is thus possible to represent up to 8836 (94×94) characters; and, using three bytes, up to 830584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although EUC-TW's unregistered G2 is). For the two-byte character sets, the code point of each character is normally specified in so-called kuten (Japanese: 区点) form (sometimes called qūwèi (Chinese: 区位), especially when dealing with GB2312 and related standards), which specifies a zone (区, Japanese: ku, Chinese: qū), and the point (Japanese: 点 ten) or position (Chinese: 位 wèi) of that character within the zone.
The escape sequences therefore do not only declare which character set is being used, but also, by knowing the properties of these character sets, know whether a 94-, 96-, 8836-, or 830584-character (or some other sized) encoding is being dealt with.
In practice, the escape sequences declaring the national character sets may be absent if context or convention dictates that a certain national character set is to be used. For example, ISO-8859-1 states that no defining escape sequence is needed and RFC 1922, which defines ISO-2022-CN, allows ISO-2022 SHIFT characters to be used without explicit use of escape sequences.
The ISO-2022 definitions of the ISO-8859-X character sets are specific fixed combinations of the components that form ISO-2022. Specifically the lower control characters (C0) the US-ASCII character set (in GL) and the upper control characters (C1) are standard and the high characters (GR) are defined for each of the ISO-8859-X variants; for example ISO-8859-1 is defined by the combination of ISO-IR-1, ISO-IR-6, ISO-IR-77 and ISO-IR-100 with no shifts or character changes allowed.
Although ISO/IEC 2022 character sets using control sequences are still in common use, particularly ISO-2022-JP, most modern e-mail applications are converting to use the simpler Unicode transforms such as UTF-8. The encodings that don't use control sequences, such as the ISO-8859 sets are still very common.
Notation and nomenclature
ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.
Encoding byte values ("bit combinations") are often given in column-line notation, where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash. Hence, for instance, codes 2/0 (0x20) through 2/15 (0x2F) inclusive may be referred to as "column 02". This is the notation used in the ISO/IEC 2022 / ECMA-35 standard itself. They may be described elsewhere using hexadecimal, as is often used in this article, or using the corresponding ASCII characters, although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.
Byte values from the 7-bit ASCII graphic range (hexadecimal 0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while bytes from the "high ASCII" range (0xA0–0xFF), if available (i.e. in an 8-bit environment), are referred to as the "GR" codes ("graphics right"). The terms "CL" and "CR" are defined for the control ranges, but the CL range always invokes the primary (C0) controls, whereas the CR range always either invokes the secondary (C1) controls or is unused.
Fixed coded characters
The delete character DEL (0x7F), the escape character ESC (0x1B) and the space character SP (0x20) are designated "fixed" coded characters and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of whitespace character may be.
General syntax of escape sequences
Sequences using the ESC (escape) character take the form
ESC [I...] F, where the ESC character is followed by zero or more intermediate bytes (I) from the range 0x20–0x2F, and one final byte (F) from the range 0x30–0x7E.
The first I byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, F bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.
Graphical character sets
Each of the four working sets G0 through G3 may be a 94-character set or a 94n-character multi-byte set. Additionally, G1 through G3 may be a 96- or 96n-character set.
In a 96- or 96n-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by the set. In a 94- or 94n-character set, the bytes 0x20 and 0x7F are not used. When a 96- or 96n-character set is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available until a 94- or 94n-character set (such as the G0 set) is invoked in GL. 96-character sets cannot be designated to G0.
Registration of a set as a 96-character set does not necessarily mean that the 0x20/A0 and 0x7F/FF bytes are actually assigned by the set; some examples of graphical character sets which are registered as 96-sets but do not use those bytes include the G1 set of I.S. 434, the box drawing set from ISO/IEC 10367, and ISO-IR-164 (a subset of the G1 set of ISO-8859-8 with only the letters, used by CCITT).
Characters are expected to be spacing characters, not combining characters, unless specified otherwise by the graphical set in question. ISO 2022 / ECMA-35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters, as well as the CSI sequence "Graphic Character Combination" (GCC) (
CSI 0x20 (SP) 0x5F (_)).
Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43 and by ISO/IEC 8859, on the basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function on the basis that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning.
Control character sets
A C0 control set must contain the ESC (escape) control character at 0x1B (a C0 set containing only ESC is registered as ISO-IR-104), whereas a C1 control set may not contain the escape control whatsoever. Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set.
If codes from the C0 set of ISO 6429 / ECMA-48, i.e. the ASCII control codes, appear in the C0 set, they are required to appear at their ISO 6429 / ECMA-48 locations. Inclusion of transmission control characters in the C0 set, besides the ten included by ISO 6429 / ECMA-48 (namely SOH, STX, ETX, EOT, ENQ, ACK, DLE, NAK, SYN and ETB), or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard.
A C0 control set is invoked over the CL range 0x00 through 0x1F, whereas a C1 control character may be invoked over the CR range 0x80 through 0x9F (in an 8-bit environment) or by using escape sequences (in a 7-bit or 8-bit environment), but not both. Which style of C1 invocation is used must be specified in the definition of the code version. For example, ISO/IEC 4873 specifies CR bytes for the C1 controls (SS2 and SS3) which it uses. If necessary, which invocation is used may be communicated using announcer sequences.
In the latter case, single control characters from the C1 control character set are invoked using "type Fe" escape sequences, meaning those where the ESC control character is followed by a byte from columns 04 or 05 (that is to say,
ESC 0x40 (@) through
ESC 0x5F (_)).
Other control functions
Additional control functions are assigned to "type Fs" escape sequences (in the range
ESC 0x60 (`) through
ESC 0x7E (~)); these have permanently assigned meanings rather than depending on the C0 or C1 designations. Registration of control functions to type "Fs" sequences must be approved by ISO/IEC JTC 1/SC 2. Other single control functions may be registered to type "3Ft" escape sequences (in the range
ESC 0x23 (#) [I...] 0x40 (@) through
ESC 0x23 (#) [I...] 0x7E (~)), although no "3Ft" sequences are currently assigned (as of 2019).
|DMI||Disable manual input||Disables some or all of the manual input facilities of the device.|
|INT||Interrupt||Interrupts the current process.|
|EMI||Enable manual input||Enables the manual input facilities of the device.|
|RIS||Reset to initial state||Resets the device to its state after being powered on.|
|CMD||Coding method delimiter||Used when interacting with an outer coding / representation system, see below.|
|LS2||Locking shift two||Shift function, see below.|
|LS3||Locking shift three||Shift function, see below.|
|LS3R||Locking shift three right||Shift function, see below.|
|LS2R||Locking shift two right||Shift function, see below.|
|LS1R||Locking shift one right||Shift function, see below.|
Escape sequences of type "Fp" (
ESC 0x30 (0) through
ESC 0x3F (?)) or of type "3Fp" (
ESC 0x23 (#) [I...] 0x30 (0) through
ESC 0x23 (#) [I...] 0x3F (?)) are reserved for single private use control codes, by prior agreement between parties. Several such sequences of both types are used by DEC terminals such as the VT100, and are thus supported by terminal emulators.
Locking shift zero
|GL encodes G0 from now on|
Locking shift one
|GL encodes G1 from now on|
|LS2||Locking shift two||GL encodes G2 from now on|
|LS3||Locking shift three||GL encodes G3 from now on|
|CR area: |
|CR area: |
|SS2||Single shift two||GL or GR (see below) encodes G2 for the immediately following character only|
|CR area: |
|CR area: |
|SS3||Single shift three||GL or GR (see below) encodes G3 for the immediately following character only|
|LS1R||Locking shift one right||GR encodes G1 from now on|
|LS2R||Locking shift two right||GR encodes G2 from now on|
|LS3R||Locking shift three right||GR encodes G3 from now on|
In 8-bit environments, either GL or GR, but not both, may be used as the single-shift area. This must be specified in the definition of the code version. For instance, ISO/IEC 4873 specifies GL, whereas packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area. If necessary, which single-shift area is used may be communicated using announcer sequences.
The names "locking shift zero" (LS0) and "locking shift one" (LS1) refer to the same pair of C0 control characters (0x0F and 0x0E) as the names "shift in" (SI) and "shift out" (SO). However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments.
Since SI (LS0) and SO (LS1) are C0 controls and SS2 and SS3 are C1 controls, they must be present in the respective sets if their functionality is used. The C1 controls themselves, as mentioned above, may be represented using escape sequences or 8-bit bytes, but not both.
Registration of graphical and control code sets
The ISO International register of coded character sets to be used with escape sequences (ISO-IR) lists graphical character sets, control code sets, single control codes and so forth which have been registered in accordance with the ISO 2375 procedures for registering escape sequences, for use with ISO/IEC 2022. Each registration receives a unique escape sequence, and a unique registry entry number to identify it.
Character set designations
Escape sequences to designate character sets take the form
ESC I [I...] F. As mentioned above, the intermediate (I) bytes are from the range 0x20–0x2F, and the final (F) byte is from the range 0x30–0x7E. The first I byte (or, for a multi-byte set, the first two) identifies the type of character set and the working set it is to be designated to, whereas the F byte (and any additional I bytes) identify the character set itself, as assigned in the ISO-IR register (or, for the private-use escape sequences, by prior agreement).
Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of the form
ESC ( ! F have been assigned. At the other extreme, no multibyte 96-sets have been registered, so the sequences below are strictly theoretical.
As with other escape sequence types, the range 0x30–0x3F is reserved for private-use F bytes (which might be defined by further protocols such as ARIB STD-B24). However, in a graphical set designation sequence, if the second I byte (for a single-byte set) or the third I byte (for a double-byte set) is 0x20, the set denoted is a "dynamically redefinable character set" defined by prior agreement, which is also considered private use.
There are also three special cases for multi-byte codes. The code sequences
ESC $ @,
ESC $ A, and
ESC $ B were all registered when the contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted as synonyms for the sequences
ESC $ ( @ through
ESC $ ( B to designate to the G0 character set.
There are additional (rarely used) features for switching control character sets, but this is a single-level lookup, in that (as noted above) the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences (as opposed to the graphical set ones) may also be used from within ISO/IEC 10646 (UCS/Unicode), in contexts where processing ANSI escape codes is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding.
|ACS||Announce code structure||Specifies code features used, e.g. working sets (see below).|
(ISO 4873 level 1)
|CZD||C0-designate||F selects a C0 control character set to be used.|
(ASCII C0 codes)
|C1D||C1-designate||F selects a C1 control character set to be used.|
(ISO 6429 C1 codes)
|IRR||Identify revised registration||Prefixes designation escape to denote revision.|
(JIS X 0208:1990 in G0)
|GZD4||G0-designate 94-set||F selects a 94-character set to be used for G0.|
(ASCII in G0)
|G1D4||G1-designate 94-set||F selects a 94-character set to be used for G1.|
(JIS X 0201 Kana in G1)
|G2D4||G2-designate 94-set||F selects a 94-character set to be used for G2.|
(ITU T.61 RHS in G2)
|G3D4||G3-designate 94-set||F selects a 94-character set to be used for G3.|
(NATS-SEFI-ADD in G3)
|G1D6||G1-designate 96-set||F selects a 96-character set to be used for G1.|
(ISO 8859-1 RHS in G1)
|G2D6||G2-designate 96-set||F selects a 96-character set to be used for G2.|
(ISO 8859-2 RHS in G2)
|G3D6||G3-designate 96-set||F selects a 96-character set to be used for G3.|
(ISO 8859-15 RHS in G3)
|GZDM4||G0-designate multibyte 94-set||F selects a 94n-character set to be used for G0.|
(KS X 1001 in G0)
|G1DM4||G1-designate multibyte 94-set||F selects a 94n-character set to be used for G1.|
(GB 2312 in G1)
|G2DM4||G2-designate multibyte 94-set||F selects a 94n-character set to be used for G2.|
(JIS X 0208 in G2)
|G3DM4||G3-designate multibyte 94-set||F selects a 94n-character set to be used for G3.|
(JIS X 0212 in G3)
|G1DM6||G1-designate multibyte 96-set||F selects a 96n-character set to be used for G1.|
|G2DM6||G2-designate multibyte 96-set||F selects a 96n-character set to be used for G2.|
|G3DM6||G3-designate multibyte 96-set||F selects a 96n-character set to be used for G3.|
Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by
ESC ( A through
ESC + A is not related in any way to the 96-character set designated by
ESC - A through
ESC / A. And neither of those is related to the 94n-character set designated by
ESC $ ( A through
ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes,
ESC A is a way of specifying the C1 control code 0x81.)
Also note that C0 and C1 control character sets are independent; the C0 control character set designated by
ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by
ESC " A (the CCITT attribute control set for Videotex).
Interaction with other coding systems
The standard also defines a way to specify coding systems that do not follow its own structure.
A sequence is also defined for returning to ISO/IEC 2022; the registrations which support this sequence as encoded in ISO/IEC 2022 comprise (as of 2019) various Videotex formats, UTF-8, and UTF-1. A second I byte of 0x2F (
/) is included in the designation sequences of codes which do not use that byte sequence to return to ISO 2022; they may have their own means to return to ISO 2022 (such as a different or padded sequence) or none at all. All existing registrations of the latter type (as of 2019) are either transparent raw data, Unicode/UCS formats, or subsets thereof.
Of particular interest are the sequences which switch to ISO/IEC 10646 (Unicode) formats which do not follow the ISO/IEC 2022 structure. These include UTF-8 (which does not reserve the range 0x80–0x9F for control characters), its predecessor UTF-1 (which mixes GR and GL bytes in multi-byte codes), and UTF-16 and UTF-32 (which use wider coding units).
Several codes were also registered for subsets (levels 1 and 2) of UTF-8, UTF-16 and UTF-32, as well as for three levels of UCS-2. However, the only codes currently specified by ISO/IEC 10646 are the level-3 codes for UTF-8, UTF-16 and UTF-32 and the unspecified-level code for UTF-8, with the rest being listed as deprecated. ISO/IEC 10646 stipulates that the big-endian formats of UTF-16 and UTF-32 are designated by their escape sequences.
|Unicode Format||Code(s)||Hex||Deprecated codes||Deprecated hex|
|UTF-1||(UTF-1 not in current ISO/IEC 10646.)|
Of the sequences switching to UTF-8,
ESC % G is the one supported by, for example, xterm.
Although use of a variant of the standard return sequence from UTF-16 and UTF-32 is permitted, the bytes of the escape sequence must be padded to the size of the code unit of the encoding (i.e.
001B 0025 0040 for UTF-16), i.e. the coding of the standard return sequence does not conform exactly to ISO/IEC 2022. For this reason, the designations for UTF-16 and UTF-32 use a without-standard-return syntax.
Code structure announcements
The sequence "announce code structure" (
ESC SP (0x20) F) is used to announce a specific code structure, or a specific group of ISO 2022 facilities which are used in a particular code version. Although announcements can be combined, certain contradictory combinations (specifically, using locking shift announcements 16–23 with announcements 1, 3 and 4) are prohibited by the standard, as is using additional announcements on top of ISO/IEC 4873 level announcements 12–14 (which fully specify the permissible structural features). Announcement sequences are as follows:
|Number||Code||Hex||Code version feature announced|
|1||G0 in GL, GR absent or unused, no locking shifts.|
|2||G0 and G1 invoked to GL by locking shifts, GR absent or unused.|
|3||G0 in GL, G1 in GR, no locking shifts, requires an 8-bit environment.|
|4||G0 in GL, G1 in GR if 8-bit, no locking shifts unless in a 7-bit environment.|
|5||Shift functions preserved during 7-bit/8-bit conversion.|
|6||C1 controls using escape sequences.|
|7||C1 controls in CR region in 8-bit environments, as escape sequences otherwise.|
|8||94-character graphical sets only.|
|9||94-character and/or 96-character graphical sets.|
|10||Uses a 7-bit code, even if an eighth bit is available for use.|
|11||Requires an 8-bit code.|
|12||Complies to ISO/IEC 4873 (ECMA-43) level 1.|
|13||Complies to ISO/IEC 4873 (ECMA-43) level 2.|
|14||Complies to ISO/IEC 4873 (ECMA-43) level 3.|
|16||SI / LS0 used.|
|18||SO / LS1 used.|
|19||LS1R used in 8-bit environments, SO used in 7-bit environments.|
|21||LS2R used in 8-bit environments, LS2 used in 7-bit environments.|
|23||LS3R used in 8-bit environments, LS3 used in 7-bit environments.|
|28||Single-shifts invoke over GR.|
ISO/IEC 2022 code versions
Japanese e-mail versions
ISO-2022-JP is a widely used encoding for Japanese, in particular in e-mail. It was introduced for use on the JUNET network and later codified in IETF RFC 1468, dated 1993. It has an advantage over other encodings for Japanese in that it does not require 8-bit clean transmission. Microsoft calls it Code page 50220. It starts in ASCII and includes the following escape sequences:
ESC ( Bto switch to ASCII (1 byte per character)
ESC ( Jto switch to JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)
ESC $ @to switch to JIS X 0208-1978 (2 bytes per character)
ESC $ Bto switch to JIS X 0208-1983 (2 bytes per character)
The RFC notes that some existing systems did not distinguish
ESC ( B from
ESC ( J, or did not distinguish
ESC $ @ from
ESC $ B, but stipulates that the escape sequences should not be changed by systems simply relaying messages such as e-mails. The WHATWG Encoding Standard referenced by HTML5 handles
ESC ( B and
ESC ( J distinctly, but treats
ESC $ @ the same as
ESC $ B when decoding, and uses only
ESC $ B for JIS X 0208 when encoding. The RFC also notes that some past systems had made erroneous use of the sequence
ESC ( H to switch away from JIS X 0208, which is actually registered for ISO-IR-11 (a Swedish variant of ISO 646 and World System Teletext).
ESC ( I to switch to the JIS X 0201-1976 Kana set (1 byte per character) is not part of the ISO-2022-JP profile, but is also sometimes used. Microsoft's code page for ISO-2022-JP with JIS X 0201 kana is Code page 50221. Python allows it in a variant which it labels ISO-2022-JP-EXT (which also incorporates JIS X 0212 as described below). The WHATWG/HTML5 variant permits decoding JIS X 0201 katakana in ISO-2022-JP input, but converts the characters to their JIS X 0208 equivalents upon encoding.
ISO-2022-JP-2 is a multilingual extension of ISO-2022-JP, defined in RFC 1554 (dated 1993), which permits the following escape sequences in addition to the ISO-2022-JP ones. The ISO/IEC 8859 parts are 96-character sets which cannot be designated to G0, and are accessed from G2 using the 7-bit escape sequence form of the single-shift code SS2:
ESC $ Ato switch to GB 2312-1980 (2 bytes per character)
ESC $ ( Cto switch to KS X 1001-1992 (2 bytes per character)
ESC $ ( Dto switch to JIS X 0212-1990 (2 bytes per character)
ESC . Ato switch to ISO/IEC 8859-1 high part, Extended Latin 1 set (1 byte per character) [designated to G2]
ESC . Fto switch to ISO/IEC 8859-7 high part, Basic Greek set (1 byte per character) [designated to G2]
The JIS X 0213 standard, first published in 2000, defines an updated version of ISO-2022-JP, without the ISO-2022-JP-2 extensions, named ISO-2022-JP-3. The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1, while the new plane 2 received its own registration. The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile, dubbed ISO-2022-JP-2004. In addition to the basic ISO-2022-JP designation codes, the following designations are recognized:
ESC ( Ito switch to JIS X 0201-1976 Kana set (1 byte per character)
ESC $ ( Oto switch to JIS X 0213-2000 Plane 1 (2 bytes per character)
ESC $ ( Pto switch to JIS X 0213-2000 Plane 2 (2 bytes per character)
ESC $ ( Qto switch to JIS X 0213-2004 Plane 1 (2 bytes per character, ISO-2022-JP-2004 only)
Other 7-bit versions
ISO-2022-KR is defined in RFC 1557, dated 1993. It encodes ASCII and the Korean double-byte KS X 1001-1992, previously named KS C 5601-1987. Unlike ISO-2022-JP-2, it makes use of the Shift Out and Shift In characters to switch between them, after including
ESC $ ) C once at the start of a line to designate KS X 1001 to G1.
ISO-2022-CN and ISO-2022-CN-EXT are defined in RFC 1922, dated 1996. They are 7-bit encodings making use both of the Shift Out and Shift In functions (to shift between G0 and G1), and of the 7-bit escape code forms of the single-shift functions SS2 and SS3 (to access G2 and G3). They support the character sets GB 2312 (for simplified Chinese) and CNS 11643 (for traditional Chinese).
The basic ISO-2022-CN profile uses ASCII as its G0 (shift in) set, and also includes GB 2312 and the first two planes of CNS 11643 (due to these two planes being sufficient to represent all traditional Chinese characters from common Big5, to which the RFC provides a correspondence in an appendix):
ESC $ ) Ato switch to GB 2312-1980 (2 bytes per character) [designated to G1]
ESC $ ) Gto switch to CNS 11643-1992 Plane 1 (2 bytes per character) [designated to G1]
ESC $ * Hto switch to CNS 11643-1992 Plane 2 (2 bytes per character) [designated to G2]
ESC $ ) Eto switch to ISO-IR-165 (2 bytes per character) [designated to G1]
ESC $ + Ito switch to CNS 11643-1992 Plane 3 (2 bytes per character) [designated to G3]
ESC $ + Jto switch to CNS 11643-1992 Plane 4 (2 bytes per character) [designated to G3]
ESC $ + Kto switch to CNS 11643-1992 Plane 5 (2 bytes per character) [designated to G3]
ESC $ + Lto switch to CNS 11643-1992 Plane 6 (2 bytes per character) [designated to G3]
ESC $ + Mto switch to CNS 11643-1992 Plane 7 (2 bytes per character) [designated to G3]
The ISO-2022-CN-EXT profile further lists additional Guobiao standard graphical sets as being permitted, but conditional on their being assigned registered ISO 2022 escape sequences:
- GB 12345 in G1
- GB 7589 or GB 13131 in G2
- GB 7590 or GB 13132 in G3
The character after the
ESC (for single-byte character sets) or
ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character
( (0x28) designates a 94-character set to the G0 character set, whereas
+ (0x29–0x2B) designates to the G1–G3 character sets.
ISO-2022-KR and ISO-2022-CN are used less frequently than ISO-2022-JP, and are sometimes deliberately not supported due to security concerns. Notably, the WHATWG Encoding Standard used by HTML5 maps ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT (as well as HZ-GB-2312) to the "replacement" decoder, which maps all input to the replacement character (�), in order to prevent certain cross-site scripting and related attacks, which utilize a difference in encoding support between the client and server. Although the same security concern (allowing sequences of ASCII bytes to be interpreted differently) also applies to ISO-2022-JP and UTF-16, they could not be given this treatment due to being much more frequently used in deployed content.
A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by Ecma International as ECMA-43. ISO/IEC 8859 defines 8-bit codes for ISO/IEC 4873 (or ECMA-43) level 1.
- Level 1, which includes a C0 set, the ASCII G0 set, an optional C1 set and an optional single-byte (94-character or 96-character) G1 set. G0 is invoked over GL, and G1 is invoked over GR. Use of shift functions is not permitted.
- Level 2, which includes a (94-character or 96-character) single-byte G2 and/or G3 set in addition to a mandatory G1 set. Only the single-shift functions SS2 and SS3 are permitted (i.e. locking shifts are forbidden), and they invoke over the GL region (including 0x20 and 0x7F in the case of a 96-set). SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively. This minimal required C1 set for ISO 4873 is registered as ISO-IR-105.
- Level 3, which permits the GR locking-shift functions LS1R, LS2R and LS3R in addition to the single shifts, but otherwise has the same restrictions as level 2.
Earlier editions of the standard permitted non-ASCII assignments in the G0 set, provided that the ISO 646 invariant positions were preserved, that the other positions were assigned to spacing (not combining) characters, that 0x23 was assigned to either £ or #, and that 0x24 was assigned to either $ or ¤. For instance, the 8-bit encoding of JIS X 0201 is compliant with earlier editions. This was subsequently changed to fully specify the ISO 646:1991 IRV / ISO-IR No. 6 set (ASCII).
In cases where duplicate characters are available in different sets, the current edition of ISO/IEC 4873 / ECMA-43 only permits using these characters in the lowest numbered working set which they appear in. For instance, if a character appears in both the G1 set and the G3 set, it must be used from the G1 set. However, use from other sets is noted as having been permitted in earlier editions.
ISO/IEC 8859 defines complete encodings at level 1 of ISO/IEC 4873, and does not allow for use of multiple ISO/IEC 8859 parts together. It stipulates that ISO/IEC 10367 should be used instead for levels 2 and 3 of ISO/IEC 4873. ISO/IEC 10367:1991 includes G0 and G1 sets matching those used by the first 9 parts of ISO/IEC 8859 (i.e. those which existed as of 1991, when it was published), and some supplementary sets.
Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol, in which case the standard requires an "announcer" sequence specifying the level, followed by a complete set of escapes specifying the character set designations for C0, C1, G0, G1, G2 and G3 respectively (but omitting G2 and G3 designations for level 1), with an F-byte of 0x7E denoting an empty set. Announcer sequences are as follows:
|ISO 4873 Level 1|
|ISO 4873 Level 2|
|ISO 4873 Level 3|
Extended Unix Code
Extended Unix Code (EUC) is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese. It is based on ISO 2022, and only character sets which conform to the ISO 2022 structure can have EUC forms. Up to four coded character sets can be represented (in G0, G1, G2 and G3). The G0 set is invoked over GL, the G1 set is invoked over GR, and the G2 and G3 sets are (if present) invoked using the single shifts SS2 and SS3, which are used over GR (not GL). Other shift codes are not used.
Typically, G0 is used for an ISO-646 compliant single-byte coded character set such as ASCII, whereas G1 is used for a 94x94 coded character set represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes (i.e. SS3 plus two bytes) whereas a single character in EUC-TW can take up to four bytes (i.e. SS2 plus three bytes). Sometimes, ISO 646:KR (KS X 1003) or ISO 646:JP (the lower half of JIS X 0201) are used instead of ASCII for G0, meaning that 0x5C (backslash in US-ASCII) is often used to represent a Yen sign in EUC-JP and a Won sign in EUC-KR.
Comparison with other encodings
- As ISO/IEC 2022's entire range of graphical character encodings can be invoked over GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
- As compared to Unicode, ISO/IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues associated with unification, such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font.
- Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a back up to the previous escape sequence before the bytes following the escape sequence can be interpreted.
- Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be designated to any of G0 through G3, which may be invoked using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
- Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 in addition to supporting several other encodings. This type of variation makes it difficult to portably transfer text between computer systems.
- UTF-1, the multi-byte Unicode transformation format compatible with ISO/IEC 2022, has various disadvantages in comparison with UTF-8, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
- Because of its escape sequences, it is possible to construct attack byte sequences that round-trip from ISO/IEC 2022 to Unicode and back. Use of this encoding is thus treated as suspicious by malware protection suites.
- Concatenation can pose issues. Profiles such as ISO-2022-JP specify that the stream starts in the ASCII state and must end in the ASCII state. This is necessary to ensure that characters in concatenated ISO-2022-JP and/or ASCII streams will be interpreted in the correct set. This has the consequence that if a stream that ends in a multi-byte character is concatenated with one that starts with a multi-byte character, a pair of escape codes are generated switching to ASCII and immediately away from it. However, as stipulated in Unicode Technical Report #36 ("Unicode Security Considerations"), pairs of ISO 2022 escape sequences with no characters between them should generate a replacement character ("�") to prevent them from being used to mask malicious sequences such as cross-site scripting. Implementing this measure, e.g. in Mozilla Thunderbird, has led to interoperability issues, with unexpected "�" characters being generated where two ISO-2022-JP streams have been concatenated.
- F, adjusted to the range 1-63, indicates which (upwardly compatible) revision of the immediately-following registration is needed, so that old systems know that they are old.
- For F bytes 0x40 (
@), 0x41 (
A) and 0x42 (
- See also, for instance, Printronix (2012), OKI® Programmer’s Reference Manual (PDF), p. 26 for a more recent system which uses
ESC ( Hto switch to ASCII from a DBCS.
- ECMA-35 (1994), Brief History
- ECMA-35 (1994), p. 4, definition 4.11
- ECMA-35 (1994), p. 5, definition 4.18
- See, for instance, ISO-IR-14 (1993), defining the G0 designation of the JIS X 0201 Roman set as
ESC 2/8 4/10.
- ECMA-35 (1994), p. 5, section 5.1
- See, for instance, RFC 1468 (1993), defining the G0 designation of the JIS X 0201 Roman set as
ESC ( J.
- ECMA-35 (1994), pp. 15–16, section 8.1
- ECMA-35 (1994), p. 7, section 6.2
- ECMA-35 (1994), p. 10, section 6.3.2
- ECMA-35 (1994), p. 4, definition 4.17
- ECMA-35 (1994), p. 4, definition 4.14
- ECMA-35 (1994), p. 28, section 13.1
- ECMA-35 (1994), p. 33, section 13.3.3
- ECMA-35 (1994), p. 11, section 6.4.3
- ISO-IR-208 (1999)
- ISO-IR-155 (1990)
- ISO-IR-164 (1992)
- ECMA-35 (1994), p. 10, section 6.3.3
- Google Inc. (2014). "ansi.go, line 134". ANSI escape sequence library for Go.
- ECMA-43 (1991), p. 5, section 7 ("Specification of the characters of the 8-bit code")
- ISO/IEC FDIS 8859-10 (1998), p. 3, section 6 ("Specification of the coded character set")
- ECMA-144 (2000), p. 3, section 6 ("Specification of the coded character set")
- ECMA-43 (1991), p. 19, annex C ("Composite graphic characters")
- ECMA-35 (1994), p. 10, section 6.4.1
- ECMA-35 (1994), p. 11, section 6.4.4
- ECMA-35 (1994), p. 11, section 6.4.2
- ISO-IR-104 (1985)
- ISO-IR-1 (1975)
- ECMA-35 (1994), p. 19, section 8.5.1
- ECMA-35 (1994), p. 19, section 8.5.2
- ECMA-43 (1991), p. 8, section 7.6 ("C1 set")
- ECMA-35 (1994), p. 29, section 13.2.1
- ECMA-35 (1994), p. 12, section 6.5.1
- ECMA-35 (1994), p. 12, section 6.5.2
- ISO-IR, p. 19, section 2.7 ("Single control functions")
- ECMA-35 (1994), p. 12, section 6.5.3
- Moy, Edward; Gildea, Stephen; Dickey, Thomas. "Controls beginning with ESC". XTerm Control Sequences.
- ECMA-35 (1994), p. 14, section 7.3, table 2
- ECMA-35 (1994), p. 17, section 8.3.1
- ECMA-35 (1994), p. 23, section 9.3.1
- ECMA-35 (1994), p. 19, section 8.4
- ECMA-35 (1994), p. 17, section 8.3.2
- ECMA-35 (1994), pp. 23-24, section 9.4
- ECMA-35 (1994), p. 27, section 11.1
- ECMA-35 (1994), p. 17, section 8.3.3
- ECMA-35 (1994), p. 47, annex B
- ISO-IR, p. 2, section 1 ("Introduction")
- ISO-IR, p. 10, section 2.2 ("94-Character graphic character set with second Intermediate byte")
- ECMA-35 (1994), p. 36, section 14.4
- ECMA-35 (1994), pp. 35-36, section 14.3.2
- ISO/IEC 10646 (2017), pp. 19-20, section 12.4 ("Identification of control function set")
- ECMA-35 (1994), pp. 37-41, section 15.2
- ECMA-35 (1994), p. 34, section 14.2.2
- ECMA-35 (1994), p. 34, section 14.2.3
- ECMA-35 (1994), pp. 36-37, section 14.5
- ISO-IR, p. 20, section 2.8.1 ("Coding systems with Standard return")
- ECMA-35 (1994), pp. 41-42, section 15.4
- ISO-IR, p. 21, section 2.8.2 ("Coding systems without Standard return")
- ECMA-35 (1994), p. 41, section 15.3
- ISO/IEC 10646 (2017), p. 19, section 12.2 ("Identification of a UCS encoding scheme")
- ISO/IEC 10646 (2017), pp. 18–19, section 12.1 ("Purpose and context of identification")
- ISO-IR-196 (1996)
- ISO-IR-192 (1996)
- ISO-IR-195 (1996)
- ISO/IEC 10646 (2017), p. 20, section 12.5 ("Identification of the coding system of ISO/IEC 2022")
- RFC 1468 (1993)
- "Code Page Identifiers". Windows Dev Center. Microsoft.
- WHATWG Encoding Standard, section 12.2 ("ISO-2022-JP")
- Chang, Hye-Shik. "Modules/cjkcodecs/_codecs_iso2022.c, line 1122". cPython source tree. Python Software Foundation.
- "codecs — Codec registry and base classes § Standard Encodings". Python 3.7.4 documentation. Python Software Foundation.
- RFC 1554 (1993)
- RFC 2237 (1999)
- RFC 1557 (1993)
- "KS X 1001:1992" (PDF).
- ISO-IR-149 (1988)
- RFC 1922 (1996)
- WHATWG Encoding Standard, section 4.2 ("Names and labels"), anchor "replacement"
- WHATWG Encoding Standard, section 14.1 ("replacement")
- WHATWG Encoding Standard, section 2 ("Security background")
- ISO/IEC FDIS 8859-10 (1998), p. 1, section 1 ("Scope")
- ECMA-144 (2000), p. 1, section 1 ("Scope")
- ECMA-43 (1991), pp. 9-10, section 8 ("Levels")
- ISO-IR-105 (1985)
- ECMA-43 (1985), pp. 7-11, section 7.3 ("The G0 set")
- ECMA-43 (1991), pp. 6-8, section 7.4 ("G0 set")
- ECMA-43 (1991), p. 11, section 10.3 ("Identification of a version")
- ECMA-43 (1991), p. 23, annex E ("Main differences between the second edition (1985) and the present (third) edition of this ECMA Standard")
- ECMA-43 (1991), pp. 10, section 9.2 ("Unique coding of characters")
- van Wingen, Johan W (1999). "8. Code Extension, ISO 2022 and 2375, ISO 4873 and 10367". Character sets. Letters, tokens and codes. Terena.
- ECMA-43 (1991), pp. 10-11, section 10 ("Identification of version and level")
- "DICOM ISO 2022 variation".
- Davis, Mark; Suignard, Michel (2014-09-19). "3.6.2 Some Output For All Input". Unicode Technical Report #36: Unicode Security Considerations (revision 15). Unicode Consortium.
- Sivonen, Henri (2018-12-17). "(UNSUBMITTED DRAFT) No U+FFFD Generation for Zero-Length ASCII-State Content between ISO-2022-JP Escape Sequences" (PDF).
Standards and registry indices cited
- ECMA (1994). ECMA-35: Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.).
- ECMA (1985). ECMA-43: 8-Bit Coded Character Set Structure and Rules (PDF) (ECMA Standard) (2nd ed.).
- ECMA (1991). ECMA-43: 8-Bit Coded Character Set Structure and Rules (PDF) (ECMA Standard) (3rd ed.).
- ECMA (2000). ECMA-144: 8-Bit Single-Byte Coded Graphic Character sets: Latin Alphabet No. 6 (PDF) (ECMA Standard) (3rd ed.).
- ISO/IEC JTC 1/SC 2 (1998-02-12). ISO/IEC FDIS 8859-10: Information Technology — 8-bit single-byte coded graphic character sets — Part 10: Latin alphabet No. 6 (PDF) (Final Draft International Standard).
- ISO/IEC JTC 1/SC 2 (2017). ISO/IEC 10646: Information technology — Universal Coded Character Set (UCS) (ISO Standard) (5th ed.). ISO.
- ISO-IR: ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences (PDF) (Registry Index). ITSCJ/IPSJ.
- van Kesteren, Anne. WHATWG Encoding Standard (WHATWG Living Standard). WHATWG.
Registered code sets cited
- ISO/TC 97/SC 2 (1975-12-01). ISO-IR-1: The set of control characters of the ISO 646 (PDF). ITSCJ/IPSJ.
- Japanese Industrial Standards Committee (1975-12-01). ISO-IR-14: The Japanese Roman graphic set of characters (PDF). ITSCJ/IPSJ.
- ISO/TC97/SC2/WG-7; ECMA (1985-08-01). ISO-IR-104: Minimum C0 set for ISO 4873 (PDF). ITSCJ/IPSJ.
- ISO/TC97/SC2/WG-7; ECMA (1985-08-01). ISO-IR-105: Minimum C1 Set for ISO 4873 (PDF). ITSCJ/IPSJ.
- Korea Bureau of Standards (1988-10-01). ISO-IR-149: Korean Graphic Character Set for Information Interchange (KS C 5601:1987) (PDF). ITSCJ/IPSJ.
- ISO/IEC/JTC1/SC2/WG3 (1990-04-16). ISO-IR-155: Basic Box-Drawings Set (PDF). ITSCJ/IPSJ.
- CCITT (1992-07-13). ISO-IR-164: Hebrew Supplementary Set of Graphic Characters (PDF). ITSCJ/IPSJ.
- ECMA (1996-04-22). ISO-IR-192: UCS Transformation Format (UTF-8), implementation level 3, without standard return (PDF). ITSCJ/IPSJ.
- ECMA (1996-04-22). ISO-IR-195: UCS Transformation Format (UTF-16), implementation level 3, without standard return (PDF). ITSCJ/IPSJ.
- ECMA (1996-04-22). ISO-IR-196: UCS Transformation Format (UTF-8), with standard return (PDF). ITSCJ/IPSJ.
- National Standards Authority of Ireland (1999-12-07). ISO-IR-208: Ogham coded character set for information interchange (PDF). ITSCJ/IPSJ.
Internet Requests For Comment cited
- Murai, J.; Crispin, M.; van der Poel, E. (1993). "RFC 1468: Japanese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1468.
- Ohta, M.; Handa, K. (1993). "RFC 1554: ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP". Requests for Comments. IETF. doi:10.17487/rfc1554.
- Choi, U.; Chon, K.; Park, H. (1993). "RFC 1557: Korean Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1557.
- Zhu, HF.; Hu, DY.; Wang, ZG.; Kao, TC.; Chang, WCH.; Crispin, M. (1996). "RFC 1922: Chinese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1922.
- Tamaru, K. (1997). "RFC 2237: Japanese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc2237.
- ISO/IEC 2022:1994
- ISO/IEC 2022:1994/Cor 1:1999
- ECMA-35, equivalent to ISO/IEC 2022 and freely downloadable.
- International Register of Coded Character Sets to be Used with Escape Sequences, a full list of assigned character sets and their escape sequences
- History of Character Codes in North America, Europe, and East Asia from 1999, rev. 2004
- Ken Lunde's CJK.INF: a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO/IEC 2022.