HarvestMan Command line FAQ
HarvestMan has supported command line options for some time. However the options were not very user-friendly and a bit confusing to use. They were also not updated for a long time, causing most of the options to not work as expected.
This has changed with HarvestMan 1.4.5 beta 1. HarvestMan supports a limited set of well documented, user friendly command line options from the 1.4.5 beta 1 release. And they actually do what they are supposed to. :-)
On your command prompt, pass the -h or --help option to HarvestMan to get help on the command-line options, as shown below.
$ harvestman.py -h
Pass the -v or --version option to HarvestMan, as shown below.
$ harvestman.py -v Version: 1.4.5 beta 1
The minimalistic way of doing this is to just pass the website url to HarvestMan on the command-line, as shown below for the hypothetical URL http://www.foo.com/bar .
$ harvestman.py http://www.foo.com/bar
From 1.4.5 beta 1, the base directory or project name are not mandatory options
on the command line.
If you do not pass a base-directory, HarvestMan uses the current directory
as the base-directory to save the files.
If you do not pass a project name, HarvestMan extracts the domain name from
the URL and uses it as the project name.
So in the above example, all your files will get saved under the sub-directory www.foo.com in your current working directory.
Note that the base-directory and project name are optional only when the program is run using command-line. If you are using the config file, they are still compulsory.
To pass in a base directory option use the -b or --basedir option.
To pass in a project name use the -p or --project option.
Here is an actual example.
$ harvestman.py -b /tmp/websites -p pydoc www.python.org/doc .
This will mirror the URL http://www.python.org/doc in the sub-directory pydoc under /tmp/websites .
Here is the other way of doing it, using long options.
$ harvestman.py --basedir /tmp/websites --project pydoc www.python.org/doc
Yes, you can. HarvestMan takes the following options for a proxy server/firewall .
- -Y, --proxy - Proxy server name/ip in the form of host:port.
- -U, --proxyuser - Username for the proxy, if any.
- -W, --proxypass - Password for the username, if any.
Here is an example. Assume that you are behind the firewall at http://proxy.mycompany.com running at port 8080. Also assume that the proxy requires authentication. Your username is alice and password is bob . Here is how to supply the proxy settings to HarvestMan .
$ harvestman.py -Y proxy.mycompany.com:8080 -U alice -W bob http://ww.foo.com/bar
Here is the alternate way to do it, by using the long option names.
$ harvestman.py --proxy proxy.mycompany.com:8080 --proxyuser alice
--proxypass bob http://www.foo.com/bar
From 1.4.5 beta 1, HarvestMan supports a no crawl option on the command line. This enables you to use HarvestMan similar to programs such as wget .
To only download a URL and not try to crawl it, pass the -N or --nocrawl option to HarvestMan along with the URL.
Example: Downloading only the index.html from http://www.python.org .
$ harvestman.py -N http://www.python.orgor
$ harvestman.py --nocrawl http://www.python.org
Note that since the nocrawl option saves just the starting URL, the base directory and project options makes no sense with it. In fact, if they are passed, they will be ignored by the program. The URL is always saved to the current working directory.
From 1.4.5 beta 1, the number of command line options, including the help option is 26.
Command line options of HarvestMan saves you the trouble of creating a config file for your downloads. It supports most of the commonly configured options so you are not losing anything when you switch to the command line. On the other hand, the time to run the program is reduced, since there is no need to spend time creating and/or customizing a config file.
The nocrawl option is supported only on the command line, not when using the config file. Also if you don't want to bother specifying a base directory or project name, it is O.K on the command line, but not when using the config file. See answers to questions 5 and 8 above.
Yes, you can.
By default, HarvestMan will look for the config file named config.xml in the current directory, if run with no command line options. However, you can use the command line option -C or --configfile to pass in a new
config file name or path.
For example, this is how to run HarvestMan using the config file myconfig.xml .
$ harvestman.py -C myconfig.xml
Note that if you use the -C or --configfile option, all the other command line options will be ignored.
The program will exit, displaying an error message like the one shown below.
Fatal error: Cannot find config fileCreate or copy a config file to this directory or run the program with the -C option to use a different config file
Yes, you can. First locate the project file of your previous project. It will be saved as <project-name>.hbp in your base directory. (That is the project name with a .hbp extension.)
Now, run HarvestMan with the -P or --projectfile option by
passing in the location of this file. Something like,
$ harvestman.py -P myproject.hbp
Note that the -p option is for passing a project name and the -P option is for passing a project file. Don't confuse between the two.
The default fetch level is 0. To pass another one use the -f or --fetchlevel options.
Caching is enabled by default. To change it, use the -c or --cache options.
For example, to disable caching,
$ harvestman.py -c 0 http://www.foo.com/bar
Note that you can also pass in a no or off to turn off default options.
Use the -n or --connections option.
Use the -M or --maxfiles option.
To control the number of main threads (trackers), use the -T or --maxthreads option.
To control the number of sub threads (workers), use the -w or --workers option.
Note that these are advanced options, something you normally won't tune most of the time you are using HarvestMan.
Use the -t or --timelimit option.
No, it is not necessary that the URL is the last argument on the command line. However, it is a good practice to do so, since it avoids clutter in arguments.
For example, the following three examples are the same. Only that in each of them, we vary the position of the URL argument.
$ harvestman.py -b ~/websites -p myproj -c no http://www.foo.com -t 1200
$ harvestman.py http://www.foo.com -b ~/websites -p myproj -c no -t 1200
$ harvestman.py -b ~/websites -p myproj -c no -t 1200 http://www.foo.comAll of them will work correctly. However, the preferred style is to put the URL as the last argument.
Email me with your question. If it is valid and generic enough for the FAQ, I will add it here. My email address is in the main page.
- The HarvestMan Web Crawler