Site map  

SputNlBot options

-a
Generate abstract file.
-c
Crawl instead of index.
This also enables charset conversion.
-b
Broken server: Do a combined HEAD and GET.
Some broken servers respond with a GET to a HEAD request. Others with a 403 or 500 error.
The default behaviour of the software is to do a HEAD, and if the content-type turns out to be text, HTML, robots.txt, sitemap-xml or PDF, then a GET.
With the '-b' option, the software will do a GET right away and abort the download if the content-type isn't text, HTML, robots.txt, sitemap-xml or PDF. These aborted download are reported as: 'Operation was aborted by an application callback'.
The file may be downloaded completely anyway. Especially small files downloaded over fast links. The abort works best when downloading large files over slow links.
-d
Enable debug.
-e
Enable external link report.
-f List_Of_Sites
List of files to be indexed. With '-c'; List of websites to be crawled.
-k
Use compact words list file format.
This will also reduce the size of the word list in memory. It requires a bit more processing though.
-l
Allow more than 64k words.
-m
Use sitemap.
The software will first try to append 'sitemap.xml' to the url. If this doesn't work it will try to fetch robots.txt to see if there is a sitemap in there.
If a sitemap is found, the site isn't crawled but the urls listed in the sitemap are indexed instead. This also means that there is no external links report for this site.
Note: There is no real robots.txt support. Apart from a sitemap all other contents of robots.txt are ignored.
-o
Use orthography / synonyms.
Index synonyms as if they where part of the text. This translates GB to US and US to GB spelling.
Only active with '-u'.
-p
Index PDF files.
pdftotext needs to be installed for this.
-r
Re-use old wordlist.
Update this list to become the new wordlist.
-s
Print word stats.
Warning: Long list!
-t
Text output.
Can be used for debugging.
-u
Index non-ASCII.
This assumes UTF-8.
Note: Without this option the indexer will treat all non-ASCII as word delimiters.
-v
Print version and exit.
-w Wait_Time
Wait time between docs (s). ms resolution.
Default: 1 s.