Internationalizing Domain Names in Applications

Internationalizing Domain Names in Applications (IDNA) is a mechanism defined in 2003 for handling domain names containing non-ASCII characters. Such domain names cannot be handled by the DNS, and must therefore be converted to a suitable form by web browsers and other user applications; IDNA specifies how this conversion is to be done. ICANN has issued guidelines for the use of IDNA, and it is already possible to register .jp domains using this system. Other top-level domain registries are intending to start accepting registrations in 2004.

An IDNA-enabled application is able to convert between the ASCII and non-ASCII representations of a domain, using the ASCII form in cases where it is needed (such as for DNS lookup), but being able to present the more readable non-ASCII form to users. Applications that do not support IDNA will not be able to handle domain names with non-ASCII characters, but will still be able to access such domains if given the (usually rather cryptic) ASCII equivalent.

Mozilla 1.4 and Netscape 7.1 are among the first applications to support IDNA.

ToASCII and ToUnicode

The convertions between ASCII and non-ASCII forms of a domain name are accomplished by algorithms called ToASCII and ToUnicode. These algorithms are not applied to the domain name as a whole, but rather to individual labels. For example, if the domain name is www.example.com, then the labels are www, example and com, and ToASCII or ToUnicode would be applied to each of these three separately.

The details of these two algorithms are complex, and are specified in the RFCs linked at the end of this article. The following gives an overview of their behaviour.

ToASCII leaves unchanged any label which is already in ASCII, except that will fail if the label is unsuitable for DNS. If given a label containing at least one non-ASCII character, ToASCII will apply the Nameprep algorithm (which converts the label to lowercase and performs other normalization) and will then translate the result to ASCII using Punycode before prepending the 4-character string "xn--". This 4-character string is called the ACE prefix, where ACE means ASCII Compatible Encoding, and is used to distinguish Punycode-encoded labels from ordinary ASCII labels. Note that the ToASCII algorithm can fail in a number of ways; for example, the final string could exceed the 63-character limit for the DNS. A label on which ToASCII fails cannot be used in an internationalized domain name.

ToUnicode reverses the action of ToASCII, stripping off the ACE prefix and applying the Punycode decode algorithm. It does not reverse the Nameprep processing, since that is merely a normalization and is by nature irreversible. Unlike ToASCII, ToUnicode always succeeds, because it simply returns the original string if decoding would fail. In particular, this means that ToUnicode has no effect on a string that does not begin with the ACE prefix.

Example

As an example of how IDNA works, suppose the domain to be encoded is B�cher.ch. This has two labels, B�cher and ch. The second label is pure ASCII, and so is left unchanged. The first label is processed by Nameprep to give b�cher, and then by Punycode to give bcher-kva, and then has xn-- prepended to give xn--bcher-kva. The final domain suitable for use with the DNS is therefore xn--bcher-kva.ch.

External links

RFC 3490 (IDNA)
RFC 3491 (Nameprep)
RFC 3492 (Punycode)
ICANN Guidelines for the Implementation of Internationalized Domain Names
Internet Mail Consortium IDNA test tool (includes Perl source code)
IANA e-mails explaining the final choice of ACE prefix
GNU Libidn is an implementation of IDNA