Main Page | See live article | Alphabetical index

Universal character set

The Universal Character Set is a character encoding that is defined by the international standard ISO/IEC 10646. It maps hundreds of thousands of abstract characters, each identified by an unambiguous name, to numeric code points.

Since 1991, the Unicode Consortium has been working with ISO to develop the Unicode Standard and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of the Unicode Standard are identical to those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, the new and updated characters were brought into the UCS via ISO/IEC 10646-1:2000.

The UCS has over 1.1 million code points, but only the first 65536 (the Basic Multilingual Plane, or BMP) are commonly used, the remainder being reserved for such purposes as representing ancient Egyptian hieroglyphics or rare Chinese characters. Many code points, even in the BMP, are deliberately not assigned to characters, to allow for future expansion or to minimize conflicts with other encoding forms.

Table of contents
1 Encoding Forms of the Universal Character Set
2 Citing the Universal Character Set
3 Correlation to Unicode
4 External link
5 Related ISO

Encoding Forms of the Universal Character Set

There are several character encoding forms defined by ISO 10646 for the Universal Character Set. The simplest is UCS-2, which uses a single code value between 0 and 65535 for each character, and allowing that value to be represented as exactly two bytes (one 16-bit word). UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. Code points outside the BMP can be represented by pairs of special characters from what is called the S (Special) Zone of the BMP, each pair consisting of what is called an RC-element from the high-half zone and an RC-element from the low-half zone.

In Unicode terminology these characters are called high surrogates and low surrogates respectively and UTF-16 is the Unicode terminology for UCS-2.

Another encoding is UCS-4, which uses a single code value between 0 and, theoretically, hexadecimal FFFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also be in that range), and allowing that value to be represented as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. Like UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2. ISO/IEC 10646

Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". There is no UCS-16; the authors who make this error usually intended to refer to UCS-2 or UTF-16.

Citing the Universal Character Set

ISO 10646 is a general, informal citation for the ISO/IEC 10646 family of standards, and is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite a particular part and version, using the form ISO/IEC 10646-{part}:{year}; for example: ISO/IEC 10646-1:1993.

Correlation to Unicode

External link

Related ISO

Related ISO standards from the
List of ISO standards are: ISO 2022, ISO 6429, ISO 14651

See also Unicode, UTF-16, UTF-8,