Download: source: webcrawl.tar.gz
Extract with 'tar xvfz webcrawl.tar.gz'.
You can compile the crawler source into a mere indexer;
~$ cc -O2 -Wall -o cgi-index cgi-crawl.c
Using '-D' to set a define, you get a crawler instead;
~$ cc -O2 -Wall -DCGS_WITH_HTTP -lcurl -o cgi-crawl cgi-crawl.c
For this to work, you need to have lib-curl-devel and all of it's dependencies installed. The ldd-s below clearly show the difference;
~$ ldd cgi-index linux-vdso.so.1 (0x00007ffdb9108000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3b8d254000) /lib64/ld-linux-x86-64.so.2 (0x00007f3b8d823000)
~$ ldd cgi-crawl linux-vdso.so.1 (0x00007ffd733e5000) libcurl-gnutls.so.4 => /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4 (0x00007f9993956000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f99935b7000) libnghttp2.so.14 => /usr/lib/x86_64-linux-gnu/libnghttp2.so.14 (0x00007f9993390000) libidn2.so.0 => /usr/lib/x86_64-linux-gnu/libidn2.so.0 (0x00007f999316e000) librtmp.so.1 => /usr/lib/x86_64-linux-gnu/librtmp.so.1 (0x00007f9992f51000) libssh2.so.1 => /usr/lib/x86_64-linux-gnu/libssh2.so.1 (0x00007f9992d24000) libpsl.so.5 => /usr/lib/x86_64-linux-gnu/libpsl.so.5 (0x00007f9992b16000) libnettle.so.6 => /usr/lib/x86_64-linux-gnu/libnettle.so.6 (0x00007f99928df000) libgnutls.so.30 => /usr/lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007f9992546000) libgssapi_krb5.so.2 => /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007f99922fb000) libkrb5.so.3 => /usr/lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007f9992021000) libk5crypto.so.3 => /usr/lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007f9991dee000) libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007f9991bea000) liblber-2.4.so.2 => /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2 (0x00007f99919db000) libldap_r-2.4.so.2 => /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2 (0x00007f999178a000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9991570000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9991353000) /lib64/ld-linux-x86-64.so.2 (0x00007f9993e4b000) libunistring.so.0 => /usr/lib/x86_64-linux-gnu/libunistring.so.0 (0x00007f999103c000) libhogweed.so.4 => /usr/lib/x86_64-linux-gnu/libhogweed.so.4 (0x00007f9990e07000) libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f9990b84000) libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x00007f9990874000) libp11-kit.so.0 => /usr/lib/x86_64-linux-gnu/libp11-kit.so.0 (0x00007f999060f000) libidn.so.11 => /lib/x86_64-linux-gnu/libidn.so.11 (0x00007f99903db000) libtasn1.so.6 => /usr/lib/x86_64-linux-gnu/libtasn1.so.6 (0x00007f99901c8000) libkrb5support.so.0 => /usr/lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007f998ffbc000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f998fdb8000) libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007f998fbb4000) libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f998f99d000) libsasl2.so.2 => /usr/lib/x86_64-linux-gnu/libsasl2.so.2 (0x00007f998f782000) libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007f998f56e000) libffi.so.6 => /usr/lib/x86_64-linux-gnu/libffi.so.6 (0x00007f998f365000)
The crawler has an additional '-c' option, which makes it crawl instead of
just index.
If you want to index PDF ('-p' option), you also need pdftotext, which is part
of
poppler-utils.
The software comes with a small Makefile. A 'make' will compile all of the binaries;
~$ make cc -O2 -Wall -o cgi-index cgi-crawl.c cc -O2 -Wall -DCGS_WITH_HTTP -lcurl -o cgi-crawl cgi-crawl.c cc -O2 -Wall -o cgi-search cgi-search.c cc -O2 -Wall -o findwebpath findwebpath.c cc -O2 -Wall -o fndtitle fndtitle.c cc -O2 -Wall -o gen-num-index gen-num-index.c cc -O2 -Wall -o gensynontab gensynontab.c cc -O2 -Wall -o url2file url2file.c
If the compilation of cgi-search causes problems, see: custom.h.
A 'make install' will run the install script. Do this as root;
~# make install
Binaries and scripts are installed in '/usr/local/bin/'.
Man pages are installed in '/usr/local/share/man/man1/'.
Documentation is installed in '/usr/local/share/doc/websearch/'.
Searchform is installed in '/var/www/search/'.
If any of these directories do not exist, the install script will create them
for you.
Files will only be copied if they do not already exist in the target directory
or if the version in the target directory is older.
The following directories are created by the install script;
You need to set the ownership to these directories to the indexer process owner: If the indexer runs as user 'foo' group 'bar', set the following permissions;
~# cd /var/local/lib/ ~# chmod g+w websearch ~# chown :bar websearch ~# cd websearch/ ~# chown -R foo:bar *
Put a list of sites you want to crawl in a list ('urls.list' in the example script);
http://www.exmaple.org/ https://www.example.com/
And run the crawl script.
Don't run the script as root.
The installation and use of the indexer is the same as before.