This crawler is based on my indexer. This page primarily describes the differences from the indexer, so you may want to read this page first.
This software both crawls and indexes. So you don't any additional binaries
or scripts (the supplied script can be handy though).
You just need a text file with a list of websites to crawl;
http://www.example.org/ https://www.example.com/
You need a trailing slash ('/') for each site!
It it's not there the software will append one.
Furthermore, the software doesn't understand that http://www.example.org/
and https://www.example.org/ are the same thing. Websites need to be
completely HTTP OR HTTPS. Of course, one site being HTTP and an other
HTTPS is OK.
Each site should be in the list only ONCE!
In addition to the limits of the indexer, this crawler has the following limitations;
Content management systems have their own problems. Some use thousands of
different URLs to refer to just a few documents.
It may help to use the '-m' option. This enables the use of sitemaps; If
the software finds a sitemap, it will not crawl the website but index
each file in the sitemap instead.
The software will try to convert legacy charsets to UTF-8. This may reduce
the maximum line length to as little as 1023 bytes and the maximum URL
and title sizes to as little as 507 bytes.
A lot of people get their charset wrong. So in case of ISO-8859-1, Windows-1252
is assumed. And in case of ISO-8859-9, Windows-1254 is assumed.
Initially the charset is derived from a HTTP response header.
If this doesn't specify the charset, the software will try to get the charset
from a HTML HEAD meta http-equiv or meta charset. The meta needs to be before
the title; On reading the title, the charset is 'locked'.
If the charset isn't specified, UTF-8 is assumed.
Note: The software will only convert charsets while crawling!
With the '-p' option the software will index PDF files as well. To this end, PDF files are temporary saved and then converted to text by pdftotext. The output of pdftotext is then indexed. Unless debug ('-d' option) is used, PDF files are removed after indexing.
The software generates all the files the
web-form needs. It tries to find the titles of the
indexed web-pages. If this fails, it uses the full URL instead.
The default behaviour is to do a HTTP HEAD before a HTTP GET. If the
document isn't text, HTML, sitemap, or PDF it won't do a HTTP GET. This means
that all the websites crawled this way should support HEAD requests.
With the '-b' option it will do a GET, analyse the HTTP response header
and abort the download if the Content-Type isn't Text, HTML, robots.txt,
sitemap-xml, or PDF.
Non-fetched and non-indexed documents get a 'fake title' in the file 'links-list' (or num-links.list in case of a text dump);
<a href="Url">XXXX XXXX DDD Type</a>
These are:
The software only keeps track of six content-types; Text, Html, Robots,
SiteMap, Pdf and Other.
Text is text/* except text/html. So that's sources, diffs, scripts, makefiles,
man pages, anything text.
Robots is '/robots.txt'.
SiteMap is '/sitemap.xml' or a sitemap-xml file referred to in '/robots.txt'.
Some sitemaps refer to other sitemaps. These are supported as well.
Note: In case of a HTTP response code other than 200, the web server may state
the content-type as 'html' even though it isn't.
These 'fake titles' are for debugging purposes and should never appear in
URL lists produced by the search form. If they do there is something wrong.
Sites which are not in the URL list file shouldn't be in links-list at all.
With '-a' abstracts are produced. Each abstract is 384 bytes and contains a
four-byte document number, 94 four-byte word numbers and a four-byte
terminating null. Abstracts have zero-padding.
Abstracts of web-documents that aren't indexed (E.G.: images), have an abstract
consisting of 384 null bytes. In text dumps these show up as "0000".
Alphabetical list of all internal links.
Tab delimited file of: URL number, Redirect number, HTTP response code,
Content-type, Charset and URL.
When the charset is not specified or the charset is UTF-8 it lists
'Default'. In case of ISO-8859-1 it lists 'Windows-1252' and in case of
ISO-8859-9 'Windows-1254'.
Dead internal links show up with HTTP response code 404.
Redirects to external URLs have redirect number 0000.
Alphabetical list of all external links.
Enabled with '-e' option.
The file can be fed to an external links checker. This way you can keep track
of dead links on your website. The script
'chk-rem-lnk.sh' can be used for this purpose.
You need both
Lynx and
Curl for this script.