Site map  

Unicode

General

Traditionally, characters have been 8 bits, restricting the number of available glyphs to a few hundred. By using 16 or 32 bits to represent a character, the number of available glyphs increases considerably.

16 and 32 bit characters are referred to as wide characters. The Unicode Consortium and the International Organisation for Standardisation (ISO) determine which bit pattern corresponds to which glyph. The ISO standard is called ISO-10646 or UCS (Universal Character Set). The Unicode standard is simply called Unicode. Both standards have an identical mapping from bit pattern to glyph.

UCS has two versions: UCS-2 and UCS-4. UCS-2 is 16 bits, UCS-4 32.
The first 65536 positions of UCS-4 are identical to UCS-2. The first 256 positions of UCS-2 are identical to ISO-8859-1. The first 128 positions of ISO-8859-1 are US-ASCII.
Note: Some Unicode aware software is restricted to the first 65536 positions!

UTF-16 is a way to translate from 32 bits to 16 bit sequences and back. UTF-8 is a way to translate from 32 bits to 8 bit sequences and back (UTF-8 is used by this website).

Misc

Recently I run into a file called UnicodeData.txt. It contains a lot of information. The fields in the file are separated by colons (';'). What's missing in a lot of documentation is a numbered list;

0Code point
1Character name
2 General Category
3 Canonical Combining Classes
4 Bidirectional Category
5 Character Decomposition Mapping
6Decimal digit value
7Digit value
8Numeric value
9Mirrored
10Unicode 1.0 Name
1110646 comment field
12Uppercase Mapping
13Lowercase Mapping
14Titlecase Mapping

Some personal notes here.
More info here.

Unicode fonts

And the Debian font section.

UTF-8 on the console

Some info on running UTF-8 on the console

Apache and Unicode

The best way to tell a browser that a file is UTF-8 is by putting the charset in the http response header;

HTTP/1.1 200 OK
Date: Sat, 11 Dec 2004 14:28:20 GMT
Server: Apache/1.3.26 (Unix) Debian GNU/Linux PHP/4.1.2
Last-Modified: Thu, 12 Aug 2004 10:01:13 GMT
ETag: "2a0c0-1441-411b3fe9"
Accept-Ranges: bytes
Content-Length: 5185
Connection: close
Content-Type: text/html; charset=UTF-8

This can be achieved in several ways;

Single file

For this you need a meta file, File_Name.meta for each UTF-8 document. Meta files default reside in a '.web' subdirectory. The location of the meta files can be set in the httpd config file. The example below enables meta files and sets the meta file directory to '.';

MetaFiles on

#MetaDir .web
MetaDir .

# MetaSuffix: specifies the file name suffix for the file containing the
# meta information.
MetaSuffix .meta

You also need to enable meta files by loading the Cern Meta module. Eg;

LoadModule cern_meta_module /usr/lib/apache/1.3/mod_cern_meta.so

Each meta file should contain the following line;

Content-Type: text/html; charset=UTF-8

Unless it is a text file, in which case it should say;

Content-Type: text/plain; charset=UTF-8

Directory (tree)

For this you need a '.htaccess' file. Htaccess files should be enabled in your config. Eg;

AccessFileName .htaccess

#AllowOverride None
AllowOverride FileInfo

The '.htaccess file should contain the following line;

AddDefaultCharset UTF-8

Alias

For this you just set the charset alias wide;

<Directory /usr/lib/cgi-bin/utf>
AllowOverride None
Options ExecCGI FollowSymLinks
AddDefaultCharset UTF-8
Order deny,allow
Allow from all
</Directory>

Global

Just put a 'AddDefaultCharset UTF-8' in your global config.

Make Firefox behave

Firefox used to have en default charset option. This is no longer there. There is a crude workaround though. From 910192 - Get rid of intl.charset.default as a localizable pref and deduce the fallback from the locale;
turning on View -> Character Encoding -> Auto-Detect -> Japanese causes it to detect UTF-8 text and render it properly without needing a BOM in the file.

Debian

Pine

There is an Unicode enabled version of pine called Alpine.

Debian supplies a package.

Multi lingual spellcheck

Replacing aspell with this script below makes the spell check multi lingual;

#!/bin/bash

echo -e "Select language:\n"
select name in EN NL DE FR
do
	if [ $REPLY = 2 ] || [ $REPLY = "N" ] || [ $REPLY = "n" ]
	then
		aspell -l nl_NL.UTF-8 -H -c ${1}
		break
	elif [ $REPLY = 3 ] || [ $REPLY = "D" ] || [ $REPLY = "d" ]
	then
		aspell -l de_DE.UTF-8 -H -c ${1}
		break
	elif [ $REPLY = 4 ] || [ $REPLY = "F" ] || [ $REPLY = "f" ]
	then
		aspell -l fr_FR.UTF-8 -H -c ${1}
		break
	else
		aspell -H -c ${1}
		break
	fi
done

On my box en_GB.UTF-8 is the default. '-H' keeps aspell from nagging about HTML tags. Adapt this to suit your needs.

Forward bug workaround

There is a bug in forwarding mail.
When you select forward as attachment, alpine will ignore the messages charset and set it to US-ASCII instead.

There is a workaround;

You can check the charset by opening 'postponed-messages' in alpine. It should display all non-ascii correctly.

Samba

Put the following line in your smb.conf;

unix charset = UTF-8

Mounting smb

If your kernel does support cifs, mount using cifs rather than smbfs and add the following mount option;

,iocharset=utf8

If your kernel does not support cifs, use smbfs add the following mount options;

,iocharset=utf8,codepage=cp850

This of course means that only glyphs in CP-850 are supported.

Search

This system