Traditionally, characters have been 8 bits, restricting the number of available glyphs to a few hundred. By using 16 or 32 bits to represent a character, the number of available glyphs increases considerably.
16 and 32 bit characters are referred to as wide characters. The Unicode Consortium and the International Organisation for Standardisation (ISO) determine which bit pattern corresponds to which glyph. The ISO standard is called ISO-10646 or UCS (Universal Character Set). The Unicode standard is simply called Unicode. Both standards have an identical mapping from bit pattern to glyph.
UCS has two versions: UCS-2 and UCS-4. UCS-2 is 16 bits, UCS-4 32.
The first 65536 positions of UCS-4 are identical to UCS-2. The first 256
positions of UCS-2 are identical to ISO-8859-1. The first 128 positions of
ISO-8859-1 are US-ASCII.
Note: Some Unicode aware software is restricted to the first 65536 positions!
UTF-16 is a way to translate from 32 bits to 16 bit sequences and back. UTF-8 is a way to translate from 32 bits to 8 bit sequences and back (UTF-8 is used by this website).
Recently I run into a file called UnicodeData.txt. It contains a lot of information. The fields in the file are separated by colons (';'). What's missing in a lot of documentation is a numbered list;
0 | Code point |
1 | Character name |
2 | General Category |
3 | Canonical Combining Classes |
4 | Bidirectional Category |
5 | Character Decomposition Mapping |
6 | Decimal digit value |
7 | Digit value |
8 | Numeric value |
9 | Mirrored |
10 | Unicode 1.0 Name |
11 | 10646 comment field |
12 | Uppercase Mapping |
13 | Lowercase Mapping |
14 | Titlecase Mapping |
Some personal notes here.
More info here.
And the Debian font section.
Some info on running UTF-8 on the console
The best way to tell a browser that a file is UTF-8 is by putting the charset in the http response header;
HTTP/1.1 200 OK Date: Sat, 11 Dec 2004 14:28:20 GMT Server: Apache/1.3.26 (Unix) Debian GNU/Linux PHP/4.1.2 Last-Modified: Thu, 12 Aug 2004 10:01:13 GMT ETag: "2a0c0-1441-411b3fe9" Accept-Ranges: bytes Content-Length: 5185 Connection: close Content-Type: text/html; charset=UTF-8
This can be achieved in several ways;
For this you need a meta file, File_Name.meta for each UTF-8 document. Meta files default reside in a '.web' subdirectory. The location of the meta files can be set in the httpd config file. The example below enables meta files and sets the meta file directory to '.';
MetaFiles on #MetaDir .web MetaDir . # MetaSuffix: specifies the file name suffix for the file containing the # meta information. MetaSuffix .meta
You also need to enable meta files by loading the Cern Meta module. Eg;
LoadModule cern_meta_module /usr/lib/apache/1.3/mod_cern_meta.so
Each meta file should contain the following line;
Content-Type: text/html; charset=UTF-8
Unless it is a text file, in which case it should say;
Content-Type: text/plain; charset=UTF-8
For this you need a '.htaccess' file. Htaccess files should be enabled in your config. Eg;
AccessFileName .htaccess #AllowOverride None AllowOverride FileInfo
The '.htaccess file should contain the following line;
AddDefaultCharset UTF-8
For this you just set the charset alias wide;
<Directory /usr/lib/cgi-bin/utf> AllowOverride None Options ExecCGI FollowSymLinks AddDefaultCharset UTF-8 Order deny,allow Allow from all </Directory>
Just put a 'AddDefaultCharset UTF-8' in your global config.
Firefox used to have en default charset option. This is no longer there.
There is a crude workaround though. From
910192 - Get
rid of intl.charset.default as a localizable pref and deduce the fallback from
the locale;
turning on View -> Character Encoding -> Auto-Detect -> Japanese
causes it to detect UTF-8 text and render it properly without needing a BOM in
the file.
There is an Unicode enabled version of pine called Alpine.
Debian supplies a package.
Replacing aspell with this script below makes the spell check multi lingual;
#!/bin/bash echo -e "Select language:\n" select name in EN NL DE FR do if [ $REPLY = 2 ] || [ $REPLY = "N" ] || [ $REPLY = "n" ] then aspell -l nl_NL.UTF-8 -H -c ${1} break elif [ $REPLY = 3 ] || [ $REPLY = "D" ] || [ $REPLY = "d" ] then aspell -l de_DE.UTF-8 -H -c ${1} break elif [ $REPLY = 4 ] || [ $REPLY = "F" ] || [ $REPLY = "f" ] then aspell -l fr_FR.UTF-8 -H -c ${1} break else aspell -H -c ${1} break fi done
On my box en_GB.UTF-8 is the default. '-H' keeps aspell from nagging about HTML tags. Adapt this to suit your needs.
There is a bug in forwarding mail.
When you select forward as attachment, alpine will ignore the messages charset
and set it to US-ASCII instead.
There is a workaround;
You can check the charset by opening 'postponed-messages' in alpine. It should display all non-ascii correctly.
Put the following line in your smb.conf;
unix charset = UTF-8
If your kernel does support cifs, mount using cifs rather than smbfs and add the following mount option;
,iocharset=utf8
If your kernel does not support cifs, use smbfs add the following mount options;
,iocharset=utf8,codepage=cp850
This of course means that only glyphs in CP-850 are supported.