HarvestMan - The HarvestMan Web Crawler

HarvestMan : News | About | Releases | Project page
| FAQ | Architecture | Downloads | Projects | Links & Related Projects

HarvestMan Command line FAQ

1. Does HarvestMan support command-line options ?

HarvestMan has supported command line options for some time. However the options were not very user-friendly and a bit confusing to use. They were also not updated for a long time, causing most of the options to not work as expected.

This has changed with HarvestMan 1.4.5 beta 1. HarvestMan supports a limited set of well documented, user friendly command line options from the 1.4.5 beta 1 release. And they actually do what they are supposed to. :-)

2. How do I invoke the help on command-line support ?

On your command prompt, pass the -h or --help option to HarvestMan to get help on the command-line options, as shown below.

$ harvestman.py -h

3. How to see the current version information ?

Pass the -v or --version option to HarvestMan, as shown below.

$ harvestman.py -v
Version: 1.4.5 beta 1

4. Show me an example of crawling a website by using HarvestMan on the command line.

The minimalistic way of doing this is to just pass the website url to HarvestMan on the command-line, as shown below for the hypothetical URL http://www.foo.com/bar .

$ harvestman.py http://www.foo.com/bar

5. In the earlier question, you are not passing a base directory or project name. Is it not required ?

From 1.4.5 beta 1, the base directory or project name are not mandatory options on the command line.

If you do not pass a base-directory, HarvestMan uses the current directory as the base-directory to save the files.

If you do not pass a project name, HarvestMan extracts the domain name from the URL and uses it as the project name.

So in the above example, all your files will get saved under the sub-directory www.foo.com in your current working directory.

Note that the base-directory and project name are optional only when the program is run using command-line. If you are using the config file, they are still compulsory.

6. So how do I pass in a base directory or project name option to the program on the command line ?

To pass in a base directory option use the -b or --basedir option.
To pass in a project name use the -p or --project option.

Here is an actual example.

$ harvestman.py -b /tmp/websites -p pydoc www.python.org/doc .

This will mirror the URL http://www.python.org/doc in the sub-directory pydoc under /tmp/websites .

Here is the other way of doing it, using long options.

$ harvestman.py --basedir /tmp/websites --project pydoc www.python.org/doc 

7. I work behind a proxy/firewall. Can I pass the proxy settings to HarvestMan in the command line ?

Yes, you can. HarvestMan takes the following options for a proxy server/firewall .

Here is an example. Assume that you are behind the firewall at http://proxy.mycompany.com running at port 8080. Also assume that the proxy requires authentication. Your username is alice and password is bob . Here is how to supply the proxy settings to HarvestMan .

$ harvestman.py -Y proxy.mycompany.com:8080 -U alice -W bob http://ww.foo.com/bar 

Here is the alternate way to do it, by using the long option names.

$ harvestman.py --proxy proxy.mycompany.com:8080 --proxyuser alice
--proxypass bob http://www.foo.com/bar

8. I just want to download a URL, not crawl it. How do I do it ?

From 1.4.5 beta 1, HarvestMan supports a no crawl option on the command line. This enables you to use HarvestMan similar to programs such as wget .

To only download a URL and not try to crawl it, pass the -N or --nocrawl option to HarvestMan along with the URL.

Example: Downloading only the index.html from http://www.python.org .

$ harvestman.py -N http://www.python.org
or
$ harvestman.py --nocrawl http://www.python.org

Note that since the nocrawl option saves just the starting URL, the base directory and project options makes no sense with it. In fact, if they are passed, they will be ignored by the program. The URL is always saved to the current working directory.

9. How many command line options are there in total ?

From 1.4.5 beta 1, the number of command line options, including the help option is 26.

10. Why should I use the command line ? Are there any advantages of using the command line ?

Command line options of HarvestMan saves you the trouble of creating a config file for your downloads. It supports most of the commonly configured options so you are not losing anything when you switch to the command line. On the other hand, the time to run the program is reduced, since there is no need to spend time creating and/or customizing a config file.

The nocrawl option is supported only on the command line, not when using the config file. Also if you don't want to bother specifying a base directory or project name, it is O.K on the command line, but not when using the config file. See answers to questions 5 and 8 above.

11. Can I use a mix of command line and config file options ?

Yes, you can.

By default, HarvestMan will look for the config file named config.xml in the current directory, if run with no command line options. However, you can use the command line option -C or --configfile to pass in a new config file name or path.

For example, this is how to run HarvestMan using the config file myconfig.xml .

$ harvestman.py -C myconfig.xml

Note that if you use the -C or --configfile option, all the other command line options will be ignored.

12. What if the config file passed with the -C argument does not exist ?

The program will exit, displaying an error message like the one shown below.

Fatal error: Cannot find config file 

Create or copy a config file to this directory
or run the program with the -C option to use
a different config file

13. Can I rerun a previous project on the command line ?

Yes, you can. First locate the project file of your previous project. It will be saved as <project-name>.hbp in your base directory. (That is the project name with a .hbp extension.)

Now, run HarvestMan with the -P or --projectfile option by passing in the location of this file. Something like,

$ harvestman.py -P myproject.hbp 

Note that the -p option is for passing a project name and the -P option is for passing a project file. Don't confuse between the two.

14. How to pass a different fetch level ?

The default fetch level is 0. To pass another one use the -f or --fetchlevel options.

15. How to control caching ?

Caching is enabled by default. To change it, use the -c or --cache options.

For example, to disable caching,

$ harvestman.py -c 0 http://www.foo.com/bar

Note that you can also pass in a no or off to turn off default options.

16. How to control the number of connections ?

Use the -n or --connections option.

17. How to set a limit on the number of files ?

Use the -M or --maxfiles option.

18. How to control number of threads ?

To control the number of main threads (trackers), use the -T or --maxthreads option.

To control the number of sub threads (workers), use the -w or --workers option.

Note that these are advanced options, something you normally won't tune most of the time you are using HarvestMan.

19. How to set a time limit to the program ?

Use the -t or --timelimit option.

20. I see that in all the examples, the URL is passed as the last argument. Is it important ?

No, it is not necessary that the URL is the last argument on the command line. However, it is a good practice to do so, since it avoids clutter in arguments.

For example, the following three examples are the same. Only that in each of them, we vary the position of the URL argument.

$ harvestman.py -b ~/websites -p myproj -c no http://www.foo.com -t 1200
$ harvestman.py http://www.foo.com -b ~/websites -p myproj -c no -t 1200 
$ harvestman.py -b ~/websites -p myproj -c no -t 1200 http://www.foo.com 
All of them will work correctly. However, the preferred style is to put the URL as the last argument.

21. My question is not there in the FAQ. What should I do ?

Email me with your question. If it is valid and generic enough for the FAQ, I will add it here. My email address is in the main page.