Charactersets

The term "natural language" normally refers to the name of a spoken language.

The term "script" refers to the transformation of segments of the spoken language into to a symbolic representation in logical (and graphic) form. A script normally contains a set of such symbols, refered to as "characters".

The term "characterset" refers to how the characters of a script are represented in an electronic file. In this file each character will have a unique binary value, defined by the characterset.

It is very important to find out which characterset is used, and to put this information into the document(s). This ensures that the document can be transferred correctly to users from other countries and/or language regions. This also ensures that machines (parsers, servers, editors, production tools) will be able to interpret the information correctly.

Note that many editors and production tools will add this information automatically; however not all tools do. Therefore it is important to have enough knowledge to be able to verify that the characterset information is conveyed correctly.

Examples of characterset names are "windows-1252", "iso-8859-1", "gb2312", etc. Refer to the Characterset list for a listing.