Welcome
Welcome to the project page of the HarvestMan web crawler.
Companion Website (new)
HarvestMan has a new companion website, thanks to Tom Smith. The new site has more current information including a Wiki which is updated frequently.
News (Updated May 08 2008)
Read the latest news about HarvestMan.
Development Code
Browse or download the bleeding edge source code.
About HarvestMan
HarvestMan is a web crawler application written in the Python programming language. HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application.
HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.
Current Release
The latest release of HarvestMan is 1.4.6.
More information is available on the releases page.
Architecture
See the architecture of HarvestMan.
HarvestMan Configuration
HarvestMan is typically run by reading options from a configuration file. The configuration file is in the XML format. By default it is named config.xml. This overrides an older text format, where configuration options were represented as name/value pairs in a text file. This page describes the older format in detail.
Here is a sample config file of HarvestMan.
HarvestMan command-line options
HarvestMan also accepts command-line options. The Command line FAQ describes the most important command-line options for HarvestMan.
Developers
The original developer of HarvestMan is Anand B Pillai. Anand is a software professional, based in Bangalore, India..
History
For an interesting article on the history of HarvestMan, read this interview.
Downloads
Check the download page for HarvestMan downloads.
Contacts
- The HarvestMan Web Crawler
