UnicodeData.txt

Some personal notes on UnicodeData.txt.

Field 0

Code point.
A hex number. Can be used as wchar_t, provided wchar_t is 32 bits.

Field 1

Character name.
Can be used as remark in header files.

Field 2

General Category

L	Letter
Ll	Lower case	a
Lm	Modifier
Lo	Other	ª
Lt	Title	ǲ
Lu	Upper case	A
M	Mark
Mc	Spacing combining
Me	Enclosing
Mn	Nonspacing	Accent
N	Number
Nd	Decimal	8
Nl	Letter	Ⅷ
No	Other	EG Superscript
P	Punctuation
Pc	Connector	_
Pd	Dash	-
Pe	Close	)
Pf	Final quote	»
Pi	Initial quote	«
Po	Other	&
Ps	Open	(
S	Symbol
Sc	Currency	$
Sk	Modifier	^
Sm	Math	+
So	Other	¦
Z	Separator
Zl	Line
Zp	Paragraph
Zs	Space
C	Other
Cc	Control
Cf	Format
Cn	Not assigned
Co	Private
Cs	Surrogate

These can be used to find the non-ASCII equivalent of 'alnum'. Or the opposite thereof.
I use anything except Ll, Lo, Lt, Lu, Mc, Mn, Nd, Nl and Cs as word separators / delimiters.

Field 5

Character Decomposition Mapping E,S N

<font>	A font variant (e.g. a blackletter form).
<noBreak>	A no-break version of a space or hyphen.
<initial>	An initial presentation form (Arabic).
<medial>	A medial presentation form (Arabic).
<final>	A final presentation form (Arabic).
<isolated>	An isolated presentation form (Arabic).
<circle>	An encircled form.
<super>	A superscript form.
<sub>	A subscript form.
<vertical>	A vertical layout presentation form.
<wide>	A wide (or zenkaku) compatibility character.
<narrow>	A narrow (or hankaku) compatibility character.
<small>	A small variant form (CNS compatibility).
<square>	A CJK squared font variant.
<fraction>	A vulgar fraction form.
<compat>	Otherwise unspecified compatibility character.

Compat can be used (among other things) to find ASCII equivalents for non-ASCII glyphs;

Glyph	ASCII
À	A
Ĳ	IJ
ǈ	Lj
Ⅷ	VIII

Field 6

Decimal digit value E,N N
EG 8 for Ⅷ.

Field 13

This has some quirks. It applies the lower case mapping to compat as well. For instance, it considers 'OHM SIGN' ('Ω') to be compatible with 'GREEK CAPITAL LETTER OMEGA'. A 'towlower()' will therefore convert 'OHM SIGN' to 'GREEK SMALL LETTER OMEGA' ('ω'). Not the same thing at all.