HarvestMan Configuration File
This page describes each of the HarvestMan configuration options in detail.
There seems to be a lot of confusion over the different configuration options of HarvestMan, since it uses different kind of thread classes and has a slew of download control options, whose intent may not be very clear from the name of the configuration variable. This page is an attempt to explain away the confusion.
The configuration file has different namespaces inside it. These split the configuration options into different sections. At present, the configuration file has the following namespaces.
- project - This section holds the options related to the current HarvestMan project
- network - This section holds the configuration options related to your network connection
- download - This section holds configuration option that affect your downloads in a generic way
- control - This section is similar to the above one, but holds options that affect your downloads in a much more specific way. This is a kind of 'tweak' section, that allows you to exert more fine-grained control over your projects.
- system - This section controls the threading options, regional (locale) options and any other options related to Python interpreter and your computer.
- indexer - This section holds variables related to how the files are processed after downloading. Right now it holds variables related to localizing links.
- files - This section holds variables that control the files created by HarvestMan namely error log, message log and an optional url log.
- display - This holds a single variable related to creating a browser page for all HarvestMan projects on your computer.
Let us now examine each section (namespace) and its configuration variables in detail.
The project namespace holds four variables related to your current HarvestMan project. It holds the starting url for your project, the project name, the directory were HarvestMan downloads your project files, and the degree of verbosity of the program. These variables are,
-
project.url : The starting url of your project; This has to be a valid url. It might or might not start with a protocol specification like 'http://'. If the url does not have a protocol specification, HarvestMan tries to figure out the protocol on its own. The default protocol is 'http://'.
-
project.basedir : The download directory for all your projects. This directory has to be writeable by the user running HarvestMan. If this directory does not exit, HarvestMan will attempt to create it. All files are downloaded to project directories under this directory (explained below).
-
project.name : HarvestMan identifies your project by this name. All project files for a project goes to a directory with this name inside the main download directory (see above). The project & cache files created for your project also share this name.
-
project.verbosity : This variable determines the degree and amount of console messages printed by the program while it is running. The default value of 2 prints general project information and basic information about each url, while suppressing errors. Each level adds a certain degree of verbosity over its previous level, as explained below.
0 -> No console messages are printed 1 -> Just a starting message about the project & HarvestMan is printed 2 -> Above plus, messages for each url and its write status are printed 3 -> Above plus, download filter messages, parser messages and critical error messages are printed. 4 -> This is the first debug level. It prints more download filter messages and most of the error messages. 5 -> This is final debug level. This provides maximum verbosity including developer tracebacks.For most projects, you should use a value of either 2 or 3.
The network namespace allows you to configure your network options, especially proxies. In the 1.4 version, it also allows you to configure an optional url server.
The variables in this section are described below.
-
network.proxyserver : If you are connecting to the Internet via a proxy, specify it's name or ip address here.
-
network.proxyport : The port to which the proxy server listens (if any).
-
network.proxyuser : If the proxy requires authentication, specify the username here.
-
network.proxypasswd : The password of the user accessing the authenticated proxy.
-
network.urlserver : From 1.4 version, HarvestMan allows the user to run a 'url server', which collects and serves new urls for HarvestMan. If this variable is enabled (set to 1), HarvestMan will run a url server (based on the asyncore module ) on port number 3081 (default), and use this for storing & retrieving new urls, instead of the standard Queue. This can sometimes offer a higher performance and speed, since the urlserver operates by using the 'select' multiplex call on your OS.
-
network.urlhost : The host on which the url server runs. Right now, this is a dummy option, since the url server is always run on the localhost in its own thread.
-
network.urlport : The port number on which the url server listens for connections. This can be modified. The default is 3081.
Note on proxy settings
For creating the proxy settings, you have to use the utility named 'proxyobfuscator.py' in the 'genutils' subdirectory of HarvestMan installation. HarvestMan obfuscates your proxy entries namely server, username and password (if any) during configuration file creation, so that these sensitive information is not visible to others. Hence the program assumes that the entries in the config file are obfuscated; if they are not, the program will fail to read them.
The 'genconfig.py' script automatically obfuscates the proxy entries for you so there is no need to use 'proxyobfuscator.py' if you have created your config file using 'genconfig.py'.
If you have not given any network proxy options, HarvestMan assumes a direct connection to the Internet.
The download section allows you to configure your downloads. It and its sister section control can be used to tweak HarvestMan.
-
download.fetchlevel : This option sets the fetch level for HarvestMan, which allows you to control how the program behaves when it encounters links belonging to various servers and directories with respect to the starting url.
The default value is 0, which permits downloads only below the directory (on the server) of your starting url. For example, if your starting url is
http://www.foo.com/bar/index.html
only the links belonging to the directory on the serverhttp://www.foo.com/bar
and its sub-directories will be downloaded.
A value of 1 limits the downloads to the starting server, but does not limit it to the starting url's directory. In the above case, it will download all links from the starting url in the serverhttp://www.foo.com .
The values of 0 and 1 do not do any crawling of external servers. The values of 2 & 3 control crawling of external domains.
A value of 2 lets the program crawl all urls in the starting server, and also any first level links in external servers. A first level link is any url linked directly from any of the pages in the starting server that the crawler encounters. It does not let the program to crawl second and higher level links, i.e urls linked further away, on external servers.
A value of 3 behaves like a combination of levels 0 and 2 minus level 1. Thus it will let the program crawl any first level external links and all links below the starting url's directory, but will not allow to crawl other links on the starting server.
A value of 4 gives the program a literal free-ride. This lets the program to go wherever it feels like on the Internet. Be careful to use this level, since you will not be able to restrict the crawling in a meaningful manner. If used, it should be combined with other download control options like depth limits, server/url filters etc.For most projects, a fetch level of 0 or 1 would suffice. If you want the program to crawl external servers, choose 2 or 3, while remembering that 2 has less restrictions on the starting server when compared to 3.
-
download.html : This controls download of html files. If you don't want to download & save html files, set this to 0. This does not mean that the program won't work. It will still download & parse web pages for new links, but the difference is that those links (web pages) won't be saved to the disk.
-
download.images : This controls download of image files. If you don't want to download images, set this to zero.
- download.cookies : Controls processing of cookies on websites. If set to 1 (default), this will save the cookies in a file named 'cookies.dat' in your project directory. This option is not of much consequence in most projects.
- download.javascript : Download of server-side javascript files is controlled by this option. If enabled (default), it will save server-side javascript files (they have a ".js" extension) embeddded in web pages to the disk.
-
download.javaapplet : Control download of server-side java applet class files using this option. If enabled (default), it will
download java class files to the disk. The parser knows to parse the
'<applet>'
tags and the'codebase'
param intelligently to figure out the location of the java class files. - download.checkfiles : If enabled (default), this will ask the program to perform an md5 checksum of the downloaded files. After the file is saved to the disk, it is read again and a checksum of its contents is compared to the checksum of the original downloaded data. If they don't match, the program signals an error.
- download.rename : This option tells the program to try to rename content generated by server-side queries originating from php, jsp and similar cgi calls. Websites can generate different kinds of content by using server-side scripts. HarvestMan assumes that these generate web-pages, by default. This option will trigger an algorithm which will try to find out the actual type (mime-type) by reading the signature of the file and renames it with the appropriate extension. Currently this works well only for BMP, JPEG and GIF image files, hence it is disabled by default.
- download.linkedimages : This option, if set to 1 (default), will force the program to download image files linked from web pages by overriding fetch-level and other external server-related rules. If set to 0, download of images will be controlled by the download rules. It is a good idea to enable it.
- download.linkedstylesheets : This works similar to above option, the difference being that it controls the download of cascading style sheet files of webpages. If set, css files are always downloaded regardless of download rules. It is a good idea to enable this also.
The control namespace allows to configure finer control of urls downloaded by the program, than what is provided by the download namespace.
The variables in the control namespace can be generally divided into those performing Url control, Server control, Connection control, Depth control, Parser control, File control & Time-based control, apart from supporting control using Robot Exclusion Protocol. Some of these controls operate on the basis of maximum limits, while others work as 'filters'.
-
control.extpagelinks : This can be used to configure the parsing & fetching of urls external to the starting server's directory. This is set to 1 (enabled) by default. Since fetch levels offer much finer control and have precedence over this option, it is advised to skip this option and control your fetchlevel instead.
-
control.maxextdirs : A limit on the maximum number of external directories (paths outside the directory of the starting url) on your project. By default, there is no limit.
-
control.skipqueryforms : An option that tells the program to skip query forms of the format
http://www.foo.com/cgi-bin/query?item=val .
This is enabled by default. -
control.urlpriority : A priority algorithm by which you can specify priorities for urls based on file extensions. A priority specification takes the form of "extension+/-priority".
For example, a url priority string of "png+5" gives maximum priority for PNG image files. Another one of "jpg-5" gives the least priority for JPEG image files. Urls with higher priority are downloaded first and those with lower priority, later. The priority algorithm works only if the default Queue algorithm is enabled. If the url server option is enabled, then priorities won't work.
Priorities can be chained together using commas to create a priority string.
Example: control.urlpriority png+5,html-2,jpg+4,pdf-3
Priorities are in the range of (-5,+5). +5 gives maximum priority while -5 gives least priority. -
control.urlfilter : This allows to create a regular expression like url filter for filtering specific urls. The match can be based on file extensions, directory components and wildcards. For more information see FAQ.
-
control.junkfilter : This is a new feature which will be available in 1.4 final release. It will add advertisement & banner filtering to HarvestMan. The design of this is similar to Internet Junkbuster.
Url Control Variables
-
control.extserverlinks : This tells the program to crawl external server links. Again a better option is to set fetch levels, since fetch level setting overrides this option.
-
control.maxextservers : This allows to set a limit to the number of external servers crawled. By default, there is no limit.
-
control.serverfiler : This is an option similar to control.urlfilter, only that it allows you to filter external servers using a regular expression like syntax. Fore more information, see FAQ.
-
control.serverpriority : This is similar to control.urlpriority, only that it sets priorities for servers. This makes sense only when crawling multiple servers (fetchlevels 2, 3 & 4). The priority syntax is same as that for url priorities.
-
control.subdomain : This option allows to differentiate between servers in the same base domain. By default, it is set to 0 (False), which makes the program to treat servers belonging to the same domain (shopping.yahoo.com & mail.yahoo.com for example), to be the same server. If set to 1 (True), it will treat the servers as different. This makes a difference when creating server priorities, filters and in controlling requests to the same server.
Server control variables
-
control.connections : This is a setting which allows you to control the number of total open network connections (sockets) to servers at a given time. This is set to 5 by default. If you have a high network bandwidth, feel free to increase it!
-
control.requests : This setting allows much finer control than 'control.connections'. It controls the number of simultaneous requests to the same server. By default it is set to 5. Again, you should set a higher value to allow speedier downloads, if you got a higher bandwidth!
Connection control
-
control.depth : This allows to control the depth of a url relative to the starting url. The depth is measured in terms of the number of path components separating a url from the starting url. For example the depth of the url
http://www.foo.com/images/holidays/hawaii/me.jpg
is 2, if its starting url ishttp://www.foo.com/images/index.html
. By default this is set to 10. If a url's depth is more than this limit, it is skipped. -
control.extdepth : This controls the depth of urls belonging to external servers. It works the same way as 'control.depth' with the difference that the depth is measured from the root of the server. For example, the depth of the url
http://www.foo.com/images/holidays/hawaii/me.jpg
is 3, if it is an external server url.
By default, this setting is 0, which means that there is no limit to the depth.
Depth control
-
control.tidyhtml : This is the only parser control variable. If set (default), it will try to invoke the Tidy html cleaner library to clean up downloaded html, before parsing it and extracting links. The tidy library is supplied with HarvestMan. It is also available as a separate download in the download page.
Parser control
-
control.maxfiles : This option can be used to control the download by setting a limit on the maximum number of files downloaded. By default this is 3000. If the number of downloaded files exceeds this limit, a controller thread automatically brings the program to end.
-
control.pagecache : Set this to 1 to enable caching of downloaded files. If caching is enabled, HarvestMan need not re-download files from a server when the project is re-run, if the cache is upto date. Caching is done by checking the timestamp of the file at the server and the local timestamp which is stored in a specially designed cache file. This option is enabled by default.
-
control.datacache : This option allows finer control of caching. If set to 1 (the default), the program will save the data of each file downloaded and write it to the cache file of the project.
This allows the program to regenerate the files from the cache if the downloaded files of a project got deleted with the cache file intact. In such cases, if the cache is upto date, HarvestMan will generate the files from the cache.
The downside of this is that the cache file becomes huge for projects which download a lot of files. Hence if you are downloading a huge number of files (say > 5000) in your project, it is sometimes better to disable data caching.
File control
-
control.projtimeout : This option provides a safe exit sentinel for HarvestMan, if the threads don't terminate due to some reason, thereby hanging a project. The main program thread monitors the download threads. If the program hangs (no write/read from queues) for a time period more than this limit, the main thread signals the other threads to terminate and forces the program to a halt. By default, this is set to five minutes.
-
control.timelimit : This provides a way of running a HarvestMan project for a stipulated time period. The time limit is specified in seconds. A controller thread keeps track of the program's running time. If the time limit expires, this thread signals other threads to stop and brings the program to a halt. By default, there is no time limit for a project.
Time-based control
-
control.robots : If enabled, HarvestMan will parse the 'robots.txt' file on each web server and obey the rules specified in it. These rules are applied to each url to find out if the 'robots.txt' file allows fetching of that url. If fetching is disallowed, HarvestMan outputs a console message (for verbosity >= 3) and skips that url.
If you want to disble 'robots.txt' file parsing, set this variable to 0. Neglecting of REP is not very good Internet etiquette. However, HarvestMan gives you the freedom to make that choice.
Robots Exclusion Protocol based control
HarvestMan obeys the Robots Exclusion Protocol used by certain web-sites to block certain sections of the web server or certain files to crawlers like HarvestMan. A single configuration option lets you to configure HarvestMan to obey or disobey REP.
The system namespace holds the options by which you can control threading parameters, set the language & locale settings and perform any other customization related to the Python interpreter and your system for HarvestMan.
The system namespace has the following options.
-
system.fastmode : This option enables or disables the fast mode for HarvestMan. Fast mode is the default working mode which uses multiple tracker & worker threads to complete downloads. If you disable fast-mode, then the program will queue all downloads in just one thread, i.e the main program thread. It is better to leave this option enabled to get the maximum performance out of HarvestMan.
-
system.maxtrackers : This sets the number of tracker threads for the program. Tracker threads are the workhorses of HarvestMan. The number of threads specified here are prelaunched at program beginning. They run in a continous loop querying Queues for work and performing crawling & downloading. The default value is 3. Increasing this value can often lead to faster downloads, especially if you have a high-end CPU and ample network bandwidth, since the work is distributed more. However, this speed increase tapers off after a certain point due to other constraints such as connection & request limits.
If you have adequate network bandwidth and high-end CPU or if the CPU is relatively free, you can increase this value. However, also make sure that you increase the limit on connections, i.e the control.connections and control.requests values accordingly. A good rule of thumb is to keep the number of connections greater than or equal to the half the number of tracker threads. The following is the formula.control.connections >= 0.5*system.maxtrackers
andcontrol.requests <= control.connections
Anyway, it is a good idea to limit system.maxtrackers to a value <= 20 on most desktop PCs and workstations. However, on very high performance systems and SMPs (with multiple CPUs), this limit can be as high as 50 or even 100. Make a judicious decision after analyzing your system and currently available resources. -
system.usethreads : This option allows the user to enable or disable the use of 'downloader' threads (also called worker threads) in HarvestMan. Downloader threads are a different class of threads from trackers. Downloader threads perform the action of downloading non-webpage files from servers. They can often reduce the workload of the so-called 'fetcher' threads which are tracker threads that fetch files from websites.
If system.usethreads is enabled (the default), fetcher threads can delegate the actual job of downloading a non-webpage file(images, zip files, pdf files, word documents etc) to the downloader threads. This can often reduce the work load of fetcher threads, thereby improving their performance and effectively the performance of the program.
The downloader threads are also pre-launched at program beginning and they wait in a loop for work. There is a thread pool object which manages these threads and delegates work to them. -
system.threadpoolsize : Sets the size of the downloader thread-pool discussed above. The default value is 10. Again, you can increase or decrease this value depending upon your system configuration and available resources. However in most cases, the performance gain extracted from increasing this value is minimal.
Note on HarvestMan threads
As said earlier, HarvestMan uses two classes of threads namely 'trackers' and 'downloaders' (or 'workers'). 'Tracker' threads are further divided into 'crawlers' and 'fetchers' based on their role. HarvestMan pre-launches all these threads at program beginning. These threads are stopped only at program end, or in exceptional conditions such as hanging threads, hanging downloads, time/file limit violations or when explicitly killed by the user (by pressing "Ctrl-C" for example).
In addition to these, there is a 'controller' thread (from version 1.4) and of course the main program thread. Thus when an instance of HarvestMan is run, it creates a number of threads which is given by the formula given below.
Number of threads = system.maxtrackers + system.threadpoolsize + 2 (controller & main thread)
It is important to realize that threads take up CPU time, since each thread is executed in a separate context in the CPU. The CPU performs a context switch when it executes a thread. If you are not using a multiprocesser machine, this can often lead to high CPU load. HarvestMan provides each thread with sufficient idle time so as to keep the CPU usage at acceptable levels. However, it is important to realize that multiple threads can sometimes affect the performance of your system, thereby reducing system response. This is not a problem if you use the default settings of HarvestMan threads. But it is important to analyze your system's configuration , current load and available resources before customizing thread-related variables. -
system.locale : This is related to the regional settings of your project. Locale setting can be modified to make sure that webpages encoded in a different encoding other than the default will parse properly. The default settings should suffice in most cases. For more information see the page on setting locales for HarvestMan.
This section holds variables related to managing the downloaded html and other web-page files. The name indexer is actually a misnomer, since it does not do any indexing of the web pages. At present it contains a couple of variables for localizing the links in downloaded web pages.
-
indexer.localise : This option controls the 'localization' of downloaded web pages. Localization refers to the process of converting the urls in downloaded web pages so that they point to the downloaded files instead of pointing to their original locations. Localizing allows the user to browse the downloaded pages locally, without again connecting to the Internet.
This option has a range of values from 0-2 unlike most other options which are binary (1 or 0). These values stand for,0 - Turn localizing off (Links will be preserved) 1 - Localize using absolute file paths 2 - Localize using relative file paths.
The default is set to 2, i.e relative path localization. Relative path localization will use relative path formats when modifying links on the downloaded pages. This allows you to move the project directory to another system or another top-level directory, while still preserving the sanity of the links.
For example, assume there is a link from the downloaded file 'web.html' to the file 'profile.jpg'. Assume that 'profile.jpg' is in the parent directory of the directory containing 'web.html'. In this case the localized link would look like,<img src="../profile.jpg">
If you are using absolute localization, the program will replace the urls with absolute paths instead of relative paths. In this case, assume that your project base directory is say '/tmp/websites' and your project name is 'test'. Also assume that you are harvesting the site 'http://www.foo.com/images/hawaii/index.html' and that the 'web.html' file is located inside 'http://www.foo.com/images/hawaii' directory. In this case the localized link for 'profile.jpg' would be,<img src="/tmp/websites/test/www.foo.com/images/profile.jpg>
It is quite clear that if you move the directory 'test' from '/tmp/websites to any other place, the links would be broken.
It is always better to use the default localization value of 2 which performs relative localization. Also since localization itself is important, it is advised to never turn this option off. -
indexer.jitlocalise : If you enable this option, the program tries to localize the links of a downloaded web page file immediately after downloading it. The normal behaviour is to localize links of all downloaded web page files at project completion. This option is turned off by default, since immediate localization does not convert all links properly. It is advised to keep this option off always.
This namespace holds the options related to files associated with HarvestMan projects, namely the log file, error log file and url file.
-
files.msgfile : The name of the log file for the project. The log file holds all messages output by HarvestMan excluding error messages. This file is named 'harvestman.log' by default and is created in the HarvestMan source directory. In a future version, it will be changed such that the log file will be created in each project directory with the name of the project.
-
files.errorfile : The name of error log file for a HarvestMan project. This file holds all error messages generated by HarvestMan and by the Python runtime. This file is named 'error.log' by default and is created in the project directory for each project.
-
files.urlslistfile : You can optionally dump a list of all urls encountered (both downloaded & skipped) in a project by setting the value of this option. If this option is set, HarvestMan dumps a list of all urls generated in a project to a file of this name. This option is turned off (no file) by default.
Note about Files
Instead of file names you can also specify a complete path name for each of the file options discussed above. Also note that the error logging feature is disabled at present.
This holds options related to the display interface of HarvestMan. Since HarvestMan is a console program at present, it holds just one variable related to a project browse page.
-
display.browsepage : If enabled, this will create an index web page of all projects completed in the base directory of the projects (project.basedir). New project information is automatically appended to this page if it exists. This helps you to browse directly to the start page of your project.
It also tries to open this browse page in your current browser session by using Python's webbrowser controller module. However this feature does not work always.
There are certain hidden options for HarvestMan that are not exposed in the configuration file. These are mostly internal & debug variables, and are not important for the user. These variables can be found in the file 'config.py' in HarvestMan source distribution.
- The HarvestMan Web Crawler