Some personal notes on UnicodeData.txt.
Code point.
A hex number. Can be used as wchar_t, provided wchar_t is 32 bits.
Character name.
Can be used as remark in header files.
General Category
L | Letter | |
Ll | Lower case | a |
Lm | Modifier | |
Lo | Other | ª |
Lt | Title | Dz |
Lu | Upper case | A |
M | Mark | |
Mc | Spacing combining | |
Me | Enclosing | |
Mn | Nonspacing | Accent |
N | Number | |
Nd | Decimal | 8 |
Nl | Letter | Ⅷ |
No | Other | EG Superscript |
P | Punctuation | |
Pc | Connector | _ |
Pd | Dash | - |
Pe | Close | ) |
Pf | Final quote | » |
Pi | Initial quote | « |
Po | Other | & |
Ps | Open | ( |
S | Symbol | |
Sc | Currency | $ |
Sk | Modifier | ^ |
Sm | Math | + |
So | Other | ¦ |
Z | Separator | |
Zl | Line | |
Zp | Paragraph | |
Zs | Space | |
C | Other | |
Cc | Control | |
Cf | Format | |
Cn | Not assigned | |
Co | Private | |
Cs | Surrogate |
These can be used to find the non-ASCII equivalent of 'alnum'.
Or the opposite thereof.
I use anything except Ll, Lo, Lt, Lu, Mc, Mn, Nd, Nl and Cs as word
separators / delimiters.
Character Decomposition Mapping E,S N
<font> | A font variant (e.g. a blackletter form). |
<noBreak> | A no-break version of a space or hyphen. |
<initial> | An initial presentation form (Arabic). |
<medial> | A medial presentation form (Arabic). |
<final> | A final presentation form (Arabic). |
<isolated> | An isolated presentation form (Arabic). |
<circle> | An encircled form. |
<super> | A superscript form. |
<sub> | A subscript form. |
<vertical> | A vertical layout presentation form. |
<wide> | A wide (or zenkaku) compatibility character. |
<narrow> | A narrow (or hankaku) compatibility character. |
<small> | A small variant form (CNS compatibility). |
<square> | A CJK squared font variant. |
<fraction> | A vulgar fraction form. |
<compat> | Otherwise unspecified compatibility character. |
Compat can be used (among other things) to find ASCII equivalents for non-ASCII glyphs;
Glyph | ASCII |
---|---|
À | A |
IJ | IJ |
Lj | Lj |
Ⅷ | VIII |
Decimal digit value E,N N
EG 8 for Ⅷ.
This has some quirks. It applies the lower case mapping to compat as well. For instance, it considers 'OHM SIGN' ('Ω') to be compatible with 'GREEK CAPITAL LETTER OMEGA'. A 'towlower()' will therefore convert 'OHM SIGN' to 'GREEK SMALL LETTER OMEGA' ('ω'). Not the same thing at all.