Word numbers are signed 32-bit.
Only word numbers > zero are valid.
Document numbers are 1 to 65530 (unsigned 16-bit).
Binary formats are used by the search form.
This is either a compact or a non-compact format. The compact format is probably more efficient is case of a large words-list. The compact format needs an additional index file.
Alphabetical list of words.
32-bit (4-byte) word number followed by word.
Word is max 31 bytes followed by a terminating NULL
(before 2021-03-08 this used to be max 23 bytes).
1 2 3 ┌────────┬────────┬────────┬────────┐ 0 │ Word number │ ├────────┼────────┼────────┼────────┤ 4 │ Word │ ├────────┼────────┼────────┼────────┤ 8 │ │ ├────────┼────────┼────────┼────────┤ 12 │ │ ├────────┼────────┼────────┼────────┤ 16 │ │ ├────────┼────────┼────────┼────────┤ 20 │ │ ├────────┼────────┼────────┼────────┤ 24 │ │ ├────────┼────────┼────────┼────────┤ 28 │ │ ├────────┼────────┼────────┼────────┤ 32 │ 0 │ ├────────┼────────┼────────┼────────┤ 36 │ 0 0 0 0 │ └────────┴────────┴────────┴────────┘
Records are fixed length with zero padding.
There is an additional 4 byte pad, which is zero.
Word numbers start at one, not zero.
Charset is UTF-8.
The sort order is ascending unsigned byte value, not Unicode code
points.
Just over 26 million words are supported (1073741820 / 40). With the '-k'
option this may be higher.
Alphabetical list of words.
Generated by 'cgi-index -k'.
UTF-8 NULL terminated character strings.
┌────────┬────── ──────┬────────┐ │ Word ... 0 │ └────────┴────── ──────┴────────┘
Records are variable length. The sort order is ascending unsigned byte value.
Index to words-list.
Array of structs:
32-bit word number, followed by offset to word, followed by word
length.
1 2 3 ┌────────┬────────┬────────┬────────┐ 0 │ Word number │ ├────────┼────────┼────────┼────────┤ 4 │ Pointer / Offset (bytes) │ ├────────┼────────┼────────┼────────┤ 8 │ Length (bytes) │ └────────┴────────┴────────┴────────┘
The length does not include the terminating NULL.
Note: This file is only present in case of compact words-list.
This file should not exist when the compact words list file format isn't
used.
When words.idx does exist the software will ASSUME that the words
list file format is compact! And the compact file format is completely
incompatible with the non-compact file format!
Lists per word in which documents these words occur.
E.G.: Word number 3 is present in documents 6, 7 and 8.
Lower 16 bits of word number followed by one or more unsigned 16─bit document
numbers. Terminated with a 16─bit NULL.
1 2 3 ┌────────┬────────┬────────┬────────┐ │ Word number │ Document number │ ├────────┼────────┼────────┼────────┤ │ Document number │ Document number │ ├────────┼────────┼────────┼────────┤ ... ├────────┼────────┼────────┼────────┤ │ Document number │ 0 │ └────────┴────────┴────────┴────────┘
Records are variable length. Document numbers start at one, not zero. The word number can be used for debugging and file consistency checks.
Index to index─list.
Array of structs:
32─bit word number, followed by offset to record, followed by record
length.
1 2 3 ┌────────┬────────┬────────┬────────┐ 0 │ Word number │ ├────────┼────────┼────────┼────────┤ 4 │ Pointer / Offset (bytes) │ ├────────┼────────┼────────┼────────┤ 8 │ Length (bytes) │ └────────┴────────┴────────┴────────┘
The length does not include the terminating NULL.
The word number can be used for debugging and file consistency checks.
Sort order is word number.
Note: Offset and length are bytes, not number of data elements.
Lists per document the first 94 words.
Generated by 'cgi-index -a'.
32-bit document number, followed by one or more word numbers.
Terminated with a 32-bit NULL.
1 2 3 ┌────────┬────────┬────────┬────────┐ │ Document number │ ├────────┼────────┼────────┼────────┤ │ Word number │ ├────────┼────────┼────────┼────────┤ ... ├────────┼────────┼────────┼────────┤ │ Word number │ ├────────┼────────┼────────┼────────┤ │ 0 │ └────────┴────────┴────────┴────────┘
Records are fixed length with zero padding. Word numbers start at one, not zero. The document number can be used for debugging and file consistency checks. Sort order is document number.
Contains URLs and their titles;
<a href="$URL">$TITLE</a>
UTF-8
NULL terminated character strings.
┌────────┬────── ──────┬────────┐ │ Link ... 0 │ └────────┴────── ──────┴────────┘
Records are variable length.
Indexes to links─list.
Array of structs:
32─bit document number, followed by offset to record, followed by record
length.
1 2 3 ┌────────┬────────┬────────┬────────┐ 0 │ Document number │ ├────────┼────────┼────────┼────────┤ 4 │ Pointer / Offset (bytes) │ ├────────┼────────┼────────┼────────┤ 8 │ Length (bytes) │ └────────┴────────┴────────┴────────┘
The length does not include the terminating NULL. The document number can be used for debugging and file consistency checks. Sort order is document number.
Text formats are used for debugging or to generate binary files.
Generated by 'cgi─index ─t'.
Alphabetical list of words.
Hex word number followed by word.
XXXX<SP>Word<LF>
E.G.:
102B caffeine
With the '─l' option;
0000102B caffeine
Generated by 'cgi─index ─t'.
Lists per word in which documents these words occur.
E.G.: Word number 3 is present in documents 6, 7 and 8.
Hex word number followed by one or more hex document numbers separated by
spaces.
XXXX<SP>XXXX<SP>XXXX ... XXXX<LF>
E.G.:
0003 0006 0007 0008
Generated by 'cgi-index -a -t'.
Lists per document the first 94 words.
Hex document number followed by one or more hex word numbers separated by
spaces.
XXXX<SP>XXXX<SP>XXXX ... XXXX<LF>
E.G.:
0001 3029 31DA 3BAD
With the '-l' option;
00000001 00003029 000031DA 00003BAD
Used to generate links─list and links.idx.
Hex document number followed by tab or a single space followed by link.
XXXX<Tab_Or_Space>Link<LF>
0001 <a href="/">Rob's server</a> 0002 <a href="/~g%C3%BCnter/">Günter's homepage</a>
'gen─num─index $NAME' converts num─$NAME.list to $NAME─list and $NAME.idx
Document numbers start at one, not zero.
The charset of the links file is
UTF-8.
This applies to the file system as well!
Note how the u-umlaut / u-diaeresis ('ü') is escaped in the above example
('%C3%B'); Each byte in the UTF-8 multi-byte sequence is replaced by
percent-hex-value.
Charsets other than UTF-8 will not work!
Futhermore, do not use shell meta characters (E.G.: space) in file-names.
This will not work!
This is actually an US <-> GB conversion;
A US spelling search will also look up GB spelled words.
A GB spelling search will also look up US spelled words.
The current lookup table is based on an
Aspell dump;
gb2us.tsv.gz is a gzipped GB to US spelling TSV
file. It contains more than 2500 GB - US word pairs.
Switching the columns yields a US to GB conversion.
And combining the two does both.
List of words and their synonyms. One pair per line;
Word<Single Space or Tab>Synonym<LF>
Example:
center centre centre center color colour colour color fiber fibre fibre fiber
This file is in alphabetical order.
It's used by 'gensynontab' to generate the files 'synonyms-list' and
'synonyms.idx'.
synonyms-list and synonyms.idx are used by the indexer. When synonyms are
enabled (-o option), it will index synonyms as if they were part of the
text.
Note: Synonyms do not show up in abstracts. Only the words that are actually
in the text do.
List of words and their synonyms.
Lower case NULL-terminated UTF-8 strings without padding.
So records have no fixed length.
┌────────┬────── ──────┬────────┐ │ Word or synonym ... 0 │ └────────┴────── ──────┴────────┘
Example:
center centre color colour fiber fibre
This file is in alphabetical order.
Index to synonyms-list.
Word - synonym pair lookup table. Each record contains two 32-bit signed
integers. The first points to a word. The second to it's synonym.
1 2 3 ┌────────┬────────┬────────┬────────┐ 0 │ Word offset (bytes) │ ├────────┼────────┼────────┼────────┤ 4 │ Synonym offset (bytes) │ └────────┴────────┴────────┴────────┘
Note that there is just one synonym per word.