Internationalization Overview

Original Author(s): Markus Gylling

Natural Language

Convey Natural Language: base

The lang and xml:lang attributes are used to convey the natural language of the presentation.

If the whole document uses the same language the language attribute can be put on the root element:


  <html lang="en" xml:lang="en" >
   <head>...</head>
   <body>...</body>
  </html>

Convey Natural Language: different languages inline

If the document uses different languages inline the language attribute is set on the element where the language changes:


  <html lang="en" xml:lang="en" >
   <head>...</head>
   <body>
     [... english content ...]
     [... suddenly a spanish paragraph:]
     <p lang="es">Tengo que acabar esto...</p>
     [... english content continues ...]
   </body>
  </html>

In the case above only the one spanish paragraph is regarded as spanish because lang="en" was set on the parent - the lang="es" attribute temporarily overrides what is otherwise true for all children of the parent.

In other words, an element will inherit the language specified for its parent, unless otherwise is specified on the child itself.

Language codes are available in Language Code Listing.

Convey Natural Language: dialects

If the natural language of the presentation uses a certain dialect of the language, it is possible to specify this by adding a country code as a suffix to the language code.

In this example, the natural language of the presentation is american english:


  <html lang="en-us" xml:lang="en-us" >
   <head>...</head>
   [...]

In this example, the natural language of the presentation is british english:


  <html lang="en-gb" xml:lang="en-gb" >
   <head>...</head>
   [...]

The country and language codes must always by separated by a hyphen.

Note that the codes for language and country are case insensitive.

A commonly used case combination for the codes is lowercase for the language code, and uppercase for the countrycode, that is:

    
  <html lang="en-GB" xml:lang="en-GB" >
   <head>...</head>
   [...]

Country codes are available in Country Code Listing.

Charactersets

The term "natural language" normally refers to the name of a spoken language.

The term "script" refers to the transformation of segments of the spoken language into to a symbolic representation in logical (and graphic) form. A script normally contains a set of such symbols, refered to as "characters".

The term "characterset" refers to how the characters of a script are represented in an electronic file. In this file each character will have a unique binary value, defined by the characterset.

It is very important to find out which characterset is used, and to put this information into the document(s). This ensures that the document can be transferred correctly to users from other countries and/or language regions. This also ensures that machines (parsers, servers, editors, production tools) will be able to interpret the information correctly.

Note that many editors and production tools will add this information automatically; however not all tools do. Therefore it is important to have enough knowledge to be able to verify that the characterset information is conveyed correctly.

Examples of characterset names are "windows-1252", "iso-8859-1", "gb2312", etc. Refer to the Characterset list for a listing.

Conveying Characterset in XML and XHTML

By default all XML documents use Unicode characterset (utf-8 encoding).

If you have not explicitly used software that encodes characters into Unicode, or if your document uses characters other than "us-ascii", that is, a-z, A-Z, 0-9, then you need to explicitly define which characterset you are using.

These are the two rules for XHTML 1.0:

  1. for HTML browser display purposes, the <meta http-equiv ... /> element is used.
  2. for XML processors, the encoding attribute of the XML declaration is used.

Both #1 and #2 should be added to the document. (In fact, only if the characterset encoding is other than utf-8, #2 is required, but it is good idea to add both anyway.)

Characterset usage conveyment example

In the following code example, the document characterset is Thai. The name for the characterset used is "TIS-620". The characterset name is put in the XML Declaration (first line), and in a meta element that is a child of head.


 <?xml version="1.0" encoding="TIS-620"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html>
   <head>
     ...
     <meta
       http-equiv="Content-type"
       content="text/html; charset=TIS-620"/>
     ...
   </head>
   <body>
   ...
   </body>
 </html>

DAISYpedia Categories: 


This page was last edited by PVerma on Friday, August 6, 2010 23:18
Text is available under the terms of the DAISY Consortium Intellectual Property Policy, Licensing, and Working Group Process.