1 Representing Characters

After numbers, the next most common data type processed by computers is characters, the elementary symbols that make up our written languages.

Human languages are varied, arbitrary, and complex; written languages even more so. This makes character encoding a much more fascinating subject than it might seem at first glance.

Let’s begin with a bit of vocabulary.

1.1 Characters

A character is a distinct linguistic symbol in some language. For example, normal written English includes the characters LATIN SMALL LETTER G, DIGIT THREE, and COMMA (to use their Unicode names). Some symbols that arguably aren’t characters (e.g., LINE FEED) are treated as characters for the purposes of encoding, see below. And some symbols that really aren’t linguistic characters (like emoji) can also be treated as characters.

1.2 Code points and character encodings

A code point is a number that maps to a particular character. A character encoding consists of a set of code points.

1.3 Binary or byte encodings

A binary encoding (also sometimes called a byte encoding) maps bit patterns to code points. Some binary encodings are simple; others are quite complex.

1.4 Glyphs and Typefaces

A glyph is a visual representation of a character. A typeface (or font, but there’s a technical distinction) is a collection of glyphs, normally in a distinct visual style. The glyphs for a given character may look quite different in different typefaces; consider the LATIN SMALL LETTER G in Times New Roman versus Helvetica. For aesthetic reasons, some typefaces will include multiple glyphs for the same character, or combined glyphs representing multiple characters.

2 ASCII

The American Standard Code for Information Interchange (ASCII) was the first internationally-recognized character encoding. In ASCII, the byte encoding maps directly to the code points based on unsigned, seven-bit integers. ASCII is illustrated in figure 1.

Figure 1: The ASCII code chart.

In this chart, the 128 code points that make up ASCII are shown as a table, reading from code 0 (NUL) to code 127 (DEL). The row and column indices are in hexadecimal and are read row then column. For example, the 'G' character is number 47 (hexadecimal) which is 71 (decimal).

Some points to note about ASCII:

ASCII is a seven-bit coding standard, so it can only represent 128 characters.
Thirty-three of the ASCII code points are used for “control characters” (the first two rows of the table plus DEL). ASCII was originally conceived for teletype machines and these codes were literally used to control the machines. Few of them are used today; however code 0A (LF, line feed) and 0D (CR, carriage return) are still used to indicate line breaks in text documents.
Characters not shown in this table (e.g., accented characters) are not representable in ASCII, so other coding schemes have to be used where they are required.

3 Zillions of code pages

For most of the world, ASCII is not sufficient since it can’t represent characters necessary for most languages. This led to an explosion of completely incompatible eight-bit “code pages” for encoding different characters. One popular code page, Windows 1252, is illustrated in figure 2. Note that the first 128 characters are the same as ASCII; the second 128 are used to encode European accented characters and various symbols.

Figure 2: The Windows 1252 code page.

For Asian languages, which have thousands of characters, many eight-bit code pages are required, along with clever mechanisms for switching from one code page to another.

4 Unicode

In 1991 the Unicode Consortium was formed in order to define a single character encoding for all human characters. The current version of Unicode allows for the encoding of 1,114,112 (2²⁰ + 2¹⁶) different characters. As of Unicode version 6.0, more than 111,000 character code points have actually been defined, covering almost all human languages. Despite repeated lobbying attempts, the Unicode Consortium has stubbornly refused to include the Klingon Alphabet in its standard.

The first 256 Unicode code points map exactly onto the ISO 8859-1 (ISO Latin-1) character encoding, which includes ASCII as its first 128 characters.

Internally, most programming languages and operating systems (including MATLAB) represent Unicode using either 16-bit or 32-bit unsigned integers that map directly to Unicode code points. Of course, a 16-bit representation can’t access all Unicode characters, but it can access most of them. An alternative is to use the UTF-16 binary encoding.

4.1 Unicode binary encodings

For storage or transmission over networks, we need a binary encoding of Unicode code points. There are two major encodings used in practice. UTF-16 uses 16 bits for most characters, 32 bits for others. UTF-8 uses either 8, 16, 24 or 32 bits per character. One advantage of UTF-8 is that its first 255 characters are exactly ISO Latin-1, which means the characters in most European languages take up only 8 bits.

4.2 Read more!

If you’re doing any serious programming involving text, you must understand Unicode and character encodings. A good starting point is Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

5 Characters in MATLAB

MATLAB strings are actually vectors of integers. Internally they are represented using the sixteen-bit UTF-16 mapping of Unicode. Externally (on-screen and in saved files) they are stored or displayed using the default character encoding of the current platform — although this varies with operating system and MATLAB version!

You can find out the default character encoding for your platform using the MATLAB command feature('DefaultCharacterSet').On my computer (Mac OS X version 10.7, MATLAB R2011a v7.12) this gives ISO-8859-1.

The MATLAB function char translates from Unicode numeric values to characters. For instance, char(97) returns 'a' and char([97 98 99]) returns 'abc'.

The MATLAB function double translates from characters to MATLAB double-precision numeric values. So double('a') returns 97 and double('abc') returns [97 98 99].

Similarly, you can use sprintf and fprintf to translate from numbers to characters, or vice versa. Calling fprintf('%i', 'a') (treating the character “a” as an integer) results in 97 being displayed to the console window; writing s = sprintf('%s', [97 98 99]); (treating the vector [97 98 99] as a string) sets s to the string 'abc'.

— Greg Phillips

ASCII, Unicode, and MATLAB strings

CSE101 Fall 2011