Note: This indexer - search-form combination,
works on the file system;
The indexer doesn't crawl! Consequently, it needs to run on the system it
indexes.
I you want a crawler instead, look here.
I use
Linklint
as an internal link-checker. It looks for broken links on my own website.
One of the reports it produces is a file called 'file.txt'. Which contains
a list of all the files linked on my website. I use a shell script to
generate a list containing all plain-text-, HTML- and PDF files on my site.
Pdftotext is used to generate text versions of PDF files. There is a file
with the extension '.title' for each PDF file.
The file-list file has following format;
/Path/File<Tab>URL<Tab>Title<LF>
E.G.:
File | URL | Title |
---|---|---|
/var/www/index.html | / | Rob's server |
/var/www/time/T4224.txt | /time/T4224.pdf | Temic U4224B Time Code Receiver |
/home/rob/WWW/index.html | /~rob/ | Rob's home page |
Other shell scripts use this file to generate the
sitemap.html and sitemap.xml.
A combination of a shell script and some C programs is used to index my site.
The indexer creates a word-list, a word to document index and an abstracts file.
These are used by the web-form.
The software assumes that all HTML files have the '.html' extension.
If this is not the case, you need to modify the shell scripts and C-sources,
to include other extensions.
Furthermore, the software assumes that the charset is
UTF-8!
There is no need for a database server. The software maintains the files on
it's own.
Except text in alt-tags, names and titles, the software ignores all text in
HTML tags. Text in alt-tags, names and iframe-titles needs to be in quotes
(alt="Some text").
The software considers tags to be word delimiters. This does not apply to
'<a href="Url">', '</a>', bold, italic and underline.
So if only part of a word is clickable, it still gets indexed as one word.
Numeric- and SGML entities are converted to UTF-8 before indexing.
Some 2400 HTML, SGML and XML entities are supported.
When indexing 'garät' it indexes 'garat' as well. This way, if search
words are entered without accents, both accented an non accented forms
are found.
Note: This only works for latin scripts.
When GB and US spelling are different, it will index both versions.
Note: Only the version that is actually in the text shows up in
abstracts (See Search below).
Internal arithmetic is done with 32-bit signed integers, which limits the
size of files and number of unique search words.
Furthermore, document numbers are unsigned 16-bit integers, which limits
the number of indexed documents.
Data is processed on a per line basis. Lines may not be longer than 4095
bytes, including newline.
When indexing non-ASCII, only UTF-8 is supported.
The maximum word length is currently 31 bytes. For Non-ASCII this may be
as little as 7 characters!
If you find this too much of a limitation you may increase the value of
'CGS_WRD_SIZ' in 'cgi-search.h' to any higher multiple of eight. Next
recompile the software and then run the index script WITHOUT the
'-r' option (after this you can use '-r' again). Do this before
using the search form.
If you set 'CGS_WRD_SIZ' to a higher value it is probably a a good idea to
use the '-k' option. This will reduce the size of the word list. Both in
memory and on disk. If you make it much higher and combine that with '-k',
you may want to edit the source as well; In cgi-index.c, function init() change
'wordlstsiz = wordlstentr * ((CGS_MIN_WRD_LEN + CGS_WRD_MAX) / 2);' to
'wordlstsiz = wordlstentr * CGS_WRD_SIZ / 4;'.
The functionality of this software is quite limited. But it's also very fast; If I run the indexer from the prompt, the prompt returns right away. Having indexed some 20000 words from ca. 200 documents.
The search-form is very simple too. It produces links
to all the pages which contain all of the searched words. All on one page!
When more than one word is entered, it produces abstracts too. Each abstract
contains the first 94 words of the document.
Searched words are highlighted in the abstracts.
Some notes on the contents of some of the header files