*==========================================================*
| -Changes.txt file for HarvestMan- |
| |
| URL: http://harvestman.freezope.org |
*==========================================================*
Version 1.4.6 final
Release Date: Sep 9 2005
Release Focus: Minor bugfix
Changes
=======
1. Fixed bugs in the setup.py and install scripts
so that they work with Python 2.4.
2. Updated py2exe install script. It works correctly
with py2exe version 0.6.1 upwards.
Version 1.4.5 final
Release Date: Aug 19 2005
Release Focus: Bug-fixes
Changes
=======
1. Added a subdomain flag to the command line.
2. For verbosity level of zero, no message is printed.
Earlier this used to print the welcome message.
Bug-fixes
=========
1. Fixed the bug with starting a project by reading
back an existing project file. This was not working
before. Project file written out using Python marshal
module, not pickle.
2. Fixed bugs in localization. The regular expression's
sub method should replace URL only once. Test site:
http://www.oligopolywatch.com .
3. Verbosity command line flag was not working. Fixed
it. Fixed errors with a few other command line options.
4. The stop project method of the program now
calls the "terminate" method on threads so we dont
have hanging threads.
Version 1.4.5 b1 (beta 1)
Release Date: Aug 02 2005
Improvements
============
1. There is only one improvement in this release, the new
command line options. The new release has a complete set
of new command line options written from scratch. It
replaces the previous cluttered and confusing command line.
A notable feature is that you can use HarvestMan like wget
for only downloading URLs with a nocrawl option. The new
command line supports a number of useful options which
the user is most likely to configure. It skips a number
of advanced or obscure options that the user need not
be bothered with, making the command line user friendly.
For more information, consult the Readme.txt of the package
or go to http://harvestman.freezope.org/commandline.html .
Bug-fixes
=========
1. Added extensions .shtm, .php4, .aspx, .cfm, .cfml,
.cms as valid web-page extensions in urlparser.py. So
web-pages ending in these extensions will work with
HarvestMan.(These were present in HarvestMan 1.4 alphas
but somehow got lost!).
2. When printing the url tree, duplicate links were not
checked. This has been fixed by adding a check.
3. A minor bug in setting verbosity in logger object
was fixed.
4. Comments will be printed for starting & stoppping
of url server at verbosity level 3. Comments for pinging
url server is raised to debug level 4.
5. Program version number, when print using the -v option
will print the release level also. For example right now
this will be printed as 'HarvestMan 1.4.5 beta 1'. Earlier
it used to print only the version number.
6. The __fix method of config.py now looks at the number
of URLs. If no URLs are found (either from config.xml or
through command line), it exits with an error message.
7. Asyncore thread for urlserver is now a daemon thread,
so it will exit if the program is killed.
8. Fixed a minor bug in set_proxy in connector.py where
the function to set proxy was being called three times.
Changed this to once.
9. Fixed a bug in rules.py. Member self._robocache should
be a list.
Version 1.4.5 a2 (alpha 2)
Release Date: 21/07/2005
Bug Fixes
=========
1. Fixed a bug in calculating url paths of directory-like
urls which use the set_directory_url method in module
urlparser.py . This was causing a number of invalid urls
which resulted in HTTP 404 errors. This bug is fixed in
this version.
2. Fixed a bug in urls that use HTTP redirection with
cookies. Sometimes some websites send a new url and
a cookie along with an HTTP redirection error (301,302)
when a url is requested. The HTTP redirection handler
is expected to send a new request with the new url
and the cookie. These kind of urls now work with
HarvestMan. Fix in connector.py module.
If you are using Python 2.4, this uses the cookielib
module and the new HTTPCookieProcessor handler. However,
even if you are using Python 2.3 or earlier versions,
this will work since a new HTTP redirect handle is
added in the connector module, that takes care of this.
3. Fixed a bug in parsing tags in
module pageparser.py .
4. Fixed a bug that created invalid urls because the
html parser object was not reset before parsing everytime.
This is now fixed in module crawler.py .
5. Fixed a bug in connector.py module in extracting
error numbers and error strings from error objects.
6. Fixed a bug in logger.py module to correctly convert
non-string types to string types.
7. Fixed a bug in config.py to take care of timelimit
settings. This was getting ignored before.
Other Changes
=============
1. All file encodings are now in latin-1, since
iso-8859-1 was causing some problems.
2. A number of modules now use the high performance
collections.deque data structure if HarvestMan is
run with Python 2.4. If not, these default to lists.
3. Some functions in common.py module are removed.
Some are moved to utils.py module.
4. Error handler function in harvestman.py removed.
5. Module htmlparser is removed since it is no
longer used.
6. Module cookiemgr is removed, since it is no
longer used. Essential cookie handling is available
in connector.py module.
7. The PriorityQueue in urlqueue.py module now
uses a modified collections.deque object if run
with Python 2.4.Otherwise it defaults to a list.
8. Exception handlers rewritten in many modules.
9. Unnecessary and commented out debug statements
are removed.
10. Tool 'cachereader.py' is removed from tools
sub-directory.
Version: 1.4.5 a1 (alpha 1)
Release Date: 27/05/2005
Features
========
1. Changed config file format from text to xml. The default
config file from this version onwards is named 'config.xml'.
The text config file format also works, but wil be slowly phased
out in future releases.
2. New HTML parser based on SGMLParser module.
3. Dependency on HTML tidy is removed.
4. New archive feature for archiving project files
to tar.bz2/tar.gz archives.
5. Changes in project caching:
- Data of web pages is compressed before writing to cache.
- Cache data structure changed to a dictionary, from list.
- Option for writing cache in DBM format.
- Headers of urls is also written to cache.
This can be turned on or off.
6. A junk filter for filtering out banner ads and similar urls.
7. HarvestMan now Works with Python 2.4 .
8. New scripts in 'tools' directory
- A script to generate project files from cache.
- A script to dump url headers in the form of
a DBM file from the project cache.
- A script to convert between xml & text
config files.
Bug-fixes
=========
1. Bug fixes in urlparser module.
2. Bug fixes in datamgr module.
3. Bug fixes in rules module.
Version: 1.4 final (Bug fixes + Minor features)
Release Date: Dec 17 2004
Changes from version 1.3.9-1
============================
Features
========
1. Added an asynchronous url server which listens
to port 3081 (by default). The url server can be
optionally enabled to gather and send urls instead
of using a Queue. This can be faster, since the
url server uses asyncore module of Python with
queues, which is faster than just using queues.
To enable this feature, set the config variable
network.urlserver to 1.
2. Modified caching algorithm to store the data
of the files download in the cache file. Hence
if some one accidentally deletes the downloaded files,
HarvestMan can recreate the files from the cache file,
without actually downloading them, if they are uptodate.
3. Queue architecture modified. The data queue has
been replaced with a links queue. Instead of pushing
web page data into a queue, fetchers process them and
push the new urls to a queue. Crawlers get the urls
, walk through them and posts the newly created url
objects into the url queue or sends them to the url
server. This saves memory on the queues.
4. Added an option for controlling file download
based on maximum file size. The maximum size by default
for a single file is 1 MB.
5. Added an option for dumping a url tree which shows
parent-child dependencies of the urls generated. This
can be either a text file or an html file.
6. Added an advertisement/banner filter to the rules
module. If enabled this can skip urls related to ad
banners or graphics.
7. New controller thread to manage file and time limits
on downloads.
Fixes
=====
1. This release fixes a huge bug in HarvestMan, i.e
that of hanging threads. The threading architecture
is modified to introduce local buffers. Threads
do an unblocked push on the queue as opposed to
a blocked push in all previous versions. If they
cannot push the data (Queue full) after 5 attempts,
they store the data in a local buffer. In the next
loop of the threads, they try to push the buffer data
before creating any new objects to push (by crawling
pages/parsing html files. This ensures that the
threads dont block continously on the queue leading
to deadlocks and time outs.)
2.Increased the idling time of threads to reduce CPU
load.
3. Fixed a bug with correctly identifying WWW urls.
4. Fixed a bug that incorrectly modifies urls
with spaces between words.
5. Fixed many bugs with get_relative_filename method.
6. Fixed bugs with generating urls. Trailing spaces
and/or newlines need to be removed from path
components.
7. Added a method to correctly identify the type of
a url based on its mimetype.
8. Fixed bugs in robot protocol checking method.
Many optimizations are also added to quickly
process urls. A robot object cache (dictionary)
and url object whitelist has been added to
reduce processing time. Also html files need
to be processed.
9. Fixed bugs in url filter checking method.
10.Fixed bugs in the order of checking rules
in violates_basic_rules method.
11.Fixed bug in creating regular expression for
filtering based on file extension.
12. Many bug fixes in localise_file_links method.
13. Fixed a bug in correctly generating the
regular expression for old url.
14. Fixed a bug in localising file names. All
web page files are correctly localised now.
15. Fixed a bug in updating files from project
cache.
16. Bugfixes in urltracker module.
17. Fixed the bug when program exits sometimes
just after downloading the first url.
18. Fixed bug with parsing
link.
19. Fixed error in managing an empty url.
Correct error message is printed now.
20. Fixed bugs with logging errors.
The error log stream is not enabled
now.
21. Fix to allow special characters in project base
directory (such as ~ for home directory on
Unix systems).
22. Fixed bug in function that opens robots.txt
urls.
23. Removed some useless arguments from some
functions.
24. Fixed bug with url object in connect(...)
function.
25. Fixes to make slow mode work.
26. Modified to use methods of cPickle module instead
of pickle module in utils.py (cPickle is faster).
27. Use our own strptime module since this function
is not available on all Python versions on Windows.
28. Fixes in locale setting on Windows platform.
29. Log file for each project is now generated in the
project directory as '.log'. This is
not a configurable option anymore.
30. The verification of downloaded files by checksumming
is disabled. This is not a configurable option
anymore.
31. The renaming algorithm is disabled since it is not
general purpose.
Other Changes
=============
1. License of program changed to GNU GPL.
2. The genconfig.py script is more interactive now,
displaying the options selected.
3. Language encoding specified on top of all Python
files.
4. A script to check Python dependency namely, 'check_dep.py'
has been added.
5. Installation made easier on Linux and Unix like systems.
A script named 'install' does the job for you.
6. The 'genutils' directory is renamed to 'tools'.
Version: 1.3.9-1 (minor bug fixes)
Release Date: June 24 2004
Changes in version 1.3.9-1 from 1.3.9
=====================================
1. Fixed a bug in cache algorithm. Key 'checksum'
should not be checked if it is old cache.
2. Fixed a bug in connector.py. Check for valid
url object in line 622.
3. Fixed a bug in urlparser. Anchor type urls
should have the url file name as base url, not
original url filename.
4. Fixed a bug in url tracker. Anchor type urls
should not be skipped.
Version: 1.3.9 (features/bug fixes)
Release Date: June 14 2004
Changes in version 1.3.9 from 1.3.4
==================================
New Features
------------
1. Url priorities: Every url is assigned a priority according
to which it is downloaded. Urls with higher priority are downloaded
first. Priorities are determined by 3 factors.
a. The generation of the url
b. Whether the url is a webpage
c. User defined priorities
Urls in a lower generation are given higher priority when compared
to urls in a higher generation. This makes sure that urls which
were created in the beginning of a project gets downloaded first.
Webpage urls are given a higher priority when compared to other urls.
Apart from this user can defined priorities in the config file in the
range of (-5,5) based on file extensions.
2. Website priorites: These are like url priorities but which
can be specified by the user in the config file.
Sample usage:
control.serverpriority www.foo.com+3,www.bar.com-3
3. Thread groups for downloads: The download threads are now
pre-launched in a group similar to tracker threads. The download
jobs are submitted to the thread pool, which in turn delegates
them to the threads. The thread pool has been made into a
queue for this.
This reduces thread latency, since we no longer spawn
new threads during the life cycle of the program.
4. Allow urls with spaces: HarvestMan can now download urls which
contain spaces like 'http://www.foo.com/bar/this url.html'.
5. Changed the way to distinguish between directory and file like
urls. Earlier when we parsed the url, a connection was made to
the url, assuming it was directory like. If the reply was HTTP 404
error, then it was assumed correctly to be a file like url.
This has been changed in the new version. We assume all urls are
file like, For example, if there is a url like http://www.foo.com/bar/file
, which can be a directory http://www.foo.com/bar/file/index.html or
file http://www.foo.com/bar/file, we assume it is a file initialy and
try to download it. The geturl() method of the file-like object returned
by opening the url, will tell whether it is file like or directory like.
This information is used to modify the local (disk) file name of the url
at that point. This decouples the modules urlparser and connector to
a large extent and makes performance better with such urls.
6. Added functionality to tidy html pages before parsing them by
using 'uTidy', the python port of html tidy. This helps to crawl
sites that exit due to parsing errors in previous versions of
HarvestMan.
7. Intranet downloads need not set a specific flag (download.intranet).
Instead HarvestMan can figure out whether the server is in intranet
by resolving its name and take appropriate action. This allows
intranet/internet downloads to be mixed in the same project.
8. Modified the way url information is cached. The field 'last-modified'
in url's headers is used, if it is available. If it is not there, a
checksum based on the content of the url is used (previous algorithm)
as fallback.
Other Changes
=============
1. Regular expressions for filters are pre-compiled.
2. Derived HarvestManStateObject (config class) from 'dict' type.
3. Main thread 'joins' each tracker thread with zero timeout instead
of killing them at the end of project.
4. Optimization fix: Links are stored for localising, only if their
download is successful.
5. Assigned 2:1 ratio for fetchers and crawlers instead of current
1:1 ratio.
6. Renamed all modules.
7. Used 'weakref' wherever possible to reduce extra references to
objects and avoid reference loops. This is mostly used in
'GetObject' method and in urlparser module.
8.
Bug fixes
========
1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 .
2. Fixed bug in url filter for images.
3. Fixed a bug with timezone printing. Bug ID # B1083253695.02.
4. Close file like object returned by opening urls
after reading data.
5. Fixed a bug in localising links. Directory like urls
need to be skipped.
6. Fixed bug in finding common domain for servers that
have lesser than three 'dots' in their name string. (This is
the same bug as # B1083256752.28 .)
7. Fixed a bug in setting up network for clients behind a proxy/
firewall.
Version: 1.3.3 (bug fixes)
Release Date: Feb 24 2004
Changes in Version 1.3.3 from 1.3.2
===================================
1. Fixed bug with parsing of FTP links. Bug # B1077613467.85.
2. Fixed another bug with external server links.
3. Fixed bug with request control. Request dictionary
key is server name, not ip.
Version: 1.3.2 (minor feature enhancements)
Release Date: Feb 13 2004
Changes in Version 1.3.2 from 1.3.1
===================================
There is one minor feature in this release.
1. This release adds ability to limit downloads by
controlling the number of simultaneous requests from the
same server. This option can be controlled by the config
variable named 'control.requests'.
2. Apart from that I have re-structured the package,
and added a distutils setup.py script which copies the
package to your PYTHON installation folder.
Version: 1.3.1 (bug fix)
Release Date: Feb 10 2004
Changes in Version 1.3.1 from 1.3
=================================
This version is a bug fix version fixing most
of the critical and annoying HarvestMan bugs.
These bugs can be located in the bugs database
at http://harvestman.freezope.org/Discussons .
1. Fixed bug with query forms. The program no longer
tries to download server side query form links.
Bug #B1073291938.97.
2. Fixed bug with handling frame redirects. Bug #B1076402199.0.
3. Fixed bug with robots.txt url. Bug #B1072436188.35.
4. Fixed bug in finding out external server links.
Bug #B1076402348.52.
5. Fixed bug in external links with respect to subdomains.
Bug #B1076409910.45.
6. Fixed bug with following non-existent links in a
directory listing Bug #B1073028403.71.
7. Fixed problem in printing harvestman url in welcome
message.
8. Fixed some problems in config file parsing.
9. Fixed problem with printing version string (-v and
--version options).
10. Other miscellaneous fixes and corrections thanks to
Vivian, Sascha and some others.
Version: 1.3 (final)
Release Date: Dec 15 2003
Changes in Version 1.3 (from 1.3 a1)
=========================================
1. This version adds one feature, that of searching
a webpage for keywords. You can create complex
boolean regular expressions and supply them to
HarvestMan. HarvestMan will parse the regular
expressions and download only those web pages that
match the regular expression.
In simpler words, this means a keyword(s) search. :-)
For example, you need to download only those webpages
that contain the term 'Saddam' and 'WMD'. You create
the following regular expression and pass it on to
HarvestMan as the option 'control.wordfilter'.
;; config file for harvestman
control.wordfilter (Saddam & WMD)
You use the boolean '&' and '|' to create the regular
expressions.
I have added this as a recipe in the ASPN Python Cookbook.
For more information on how it works, point to the URL,
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526.
Changes in Version 1.3 a1 (from 1.2 final)
=========================================
1. This version features the new threading model which was
started in the last release. This model is now completely
written to prevent thread deadlocking incidents. A description
of the model can be found in the HarvestMan webpage at
http://harvestman.freezope.org.
This model will be developed further and will be the default
for all future releases of HarvestMan.
2. The other major changes are complete re-writes of many modules.
Classes have been renamed wherever suitable and some function
names changed. The HarvestMan module has been trimmed up
considerably.
3. This version has added an extra module HarvestManUtils which has
some utility classes for reading/writing project & cache files and for
creating the browse page. The code for these were earlier in the
HarvestMan, HarvestManDataManager and HarvestManConfig modules.
4. The cache and project file information is compressed before writing
to files.
Changes in Version 1.2 final (from 1.2 rc2)
===========================================
1. Added support for javascript and java applet tag parsing.
HarvestMan can now fetch javascript (.js) files and
java applets (.class) files from webpages.
The code for parsing this sits in the new HTMLParser
customized for HarvestMan.
2. Designated url trackers to two flavors - Fetchers and Getters.
Fetchers are responsible for crawling webpages and fetching links,
and Getters get the non-html files fetched by Fetchers. Images
are still fetched by the Fetchers in thier threads.
This should help in the growth of this program and make future
development easier. Also this might help in preventing the thread
locking incidents.
3. Fixed bugs in localizing anchor type links. Rewrote HarvestManPageParser,
HarvestManUrlPathParser and HarvestManDataManager classes to take care
of this. Anchor links in webpages are localized correctly now.
4. Due to javascript/javaapplet parsing code in the new html parser,
many webpages which failed to work before (due to mostly javascript
tags which the parser could not understand) will work correctly now.
5. Other routine bug fixes.
a) Fixed a problem in creating the project browse page.
We need to provide the absolute path of the project start url file.
b) Fixed a problem in getRelativeFilename() in HarvestManUrlPathParser
class.
c) A few more...
Changes in Version 1.2 rc2 (from 1.2 rc1)
=========================================
Release Date: Sep 27 2003
1. Rewrote the algorithm for fetching urls with no filename
extensions. We assume that it is a directory-like url
(of the form dir/index.html) and try to fetch it during
url path resolving time (in urlPathParser clas).
If this fails, a 404 error is returned. The url is cached
for later lookup in the datamanager in a invalid urls cache.
We re-resolve the url assuming it now as a file-like url
(of the form /file ) and fetch it.
If it does not fail, the url is again cached for later lookup
in the datamanager in a valid urls cache. The connector object
is also cached in a connector dictionary of the datamanager so
that we dont need to re-create the connection later.
This fixes the long-standing bug with urls with no filename
extensions.
2. Rewrote algorithm for localizing links. Instead of re-parsing
html files and localizing the links, a dictionary of html files
and their links are kept in the datamanager object. This dictionary
is updated during crawling time with the url objects for each html
file. This dictionary is used at the end for localizing.
This improves localization time to as much as 500%.
3. Fixed a bug in calculating project time. (Time for localization
should not be included).
4. Modification in priting error messages. Error messages are printed
only for verbosity levels of 3 and up. OS and IO exceptions are
printed only at verbosity level 4 (debug).
For seeing url error messages (connection errors), you need to set
the verbosity to 3 now.
At the default verbosity level (2), no error messages can be seen.
5. Modified the checking of hanging threads. This check was not done
properly. Now it is done in the loop that checks for exit condition.
Also, reduced default timeout for hanging threads from 600 seconds
(10 minutes) to 120 seconds ( 2 minutes ).
Added socket timeout for sockets. This is same as thread timeout above.
(This works for users using Python 2.3.)
This will fix the problem of hanging threads in a big way.
Changes in Version 1.2 rc1 (from 1.2 alpha)
===========================================
Release Date: Sep 24 2003
1. Removed the earlier global download lock. Earlier the url
connector instances shared a common lock which they had to acquire
before downloading. This led to only a single download possible a
given moment.
This has been changed to multiple downloads which can be specified
in the configuration file.
2. We can specify any number of connections in the config file now.
The program makes sure that there are only so many connections
running at a given instant. This takes the place of the previous
global download lock. Since now many simultaneous downloads are possible
(apart from many threads), the program is much faster than before.
3. Added an option for writing pickled cache files. This has been
made the default in this release. XML cache files take a long
time to read, if they are big.
4. Integrated genconfig.py script with harvestManConfig class.
This makes future developments of this script easier. Added an abort
condition to the script which can be invoked by pressing the
key.
5. Fixes for handling error conditions in the url connector class.
Arbitrary error numbers are no longer used, instead we try to
get the error number by parsing the error strings.
6. Redownload of failed links works only for links that failed with
non-fatal errors. This speeds up projects.
7. Modified the regular expression behaviour. Compile the reg expressions
to optimize regular expression search.
8. Moved code around from HarvestMan.py module to reduce its size.
Parsing of config file is now done in the HarvestManConfig module.
9. Removed usage of 'string' module everywhere and replaced with
methods on string objects.
10. Added a timeout option for the project. Sometimes the last thread
in the program does not complete hanging a well downloaded project.
This option looks at the last data operation into the url queue
and times it. If the time of the last operation (get/put) is more
than a prescribed time, the project times out.
We also wait now for the download sub-threads to complete their work
before exiting. This fixes any premature project exit conditions.
11. Change in writing project files. We now write pickled project files
instead of XML project files. This will be the default from this
release.
12. Bug fixes in urlpathparser module for fixing relative filename computation
errors.
13. Bug fixes in rules module. Rewrote some methods in this module.
14. Fixes in creating the project browse page. The project browse
page entry is now created correctly for every new project.
15. Many other routine bug fixes to speed up downloads and reduce
bugs in threading.
Changes in Version 1.2 alpha (From version 1.1.2)
================================================
1. This version has introduced limited support for Cookies.
This is experimental code, written from scratch
following RFC 2109. The cookie support is pretty
basic with only domain cookies supported. Netscape
style cookies may not work.
2. Support for webpage caching is available. A cache
file (xml) is created in the project directory for
a project, the first time. The cache file associates
urls to file on the disk. We compare files by using
an md5 checksum on the file content. For any
further runs of the project, only the out-of-date
files are re-fetched.
3. Many bug fixes and better error checking.
4. Bugs in genconfig script fixed.
5. Documentation changes: We provide an RTF version of the
documentation file now. (Request by John J Lee of
Clientcookie fame)
Changes in Version 1.1.2(From version 1.1.1)
============================================
1. Added a fast html parser based on sgmlop module by F.Lundh.
This can be selected by setting the variable HTMLPARSER in the
config file to 1. The default parser is still the standard
python parser.
2. Added an option to localise links relatively. This is the
default now. That is we dont replace filenames with their
absolute pathname but only relative pathname, so that users
can browse the downloaded pages on another filesystem also.
3. Added an option for the user to control md5 checksumming of files.
This option is controlled by the variable CHECKFILES in the
config file.
4. Support comments at the end of an option line in the config file.
(Egs: is valid now.
It would have thrown an error before.)
5. We are not localising form links. This makes sure that a cgi
query goes directly to the webserver.
6. An option for JIT (Just In Time) localization of url links.
If this option is selected, then urls in html files are localized
immediately after they are downloaded, instead of at the end.
Changes In Architecture (Version 1.1)
====================================
1. Global Object Register/Lookup
-----------------------------
One of the major changes in this version is the architecture of harvestman program.
It uses a modified Object Oriented approach of looking up objects whenever the services
of an object is needed by other objects. The classes no longer maintain pointers to
other class instances inside them.
All Harvestman program objects register themselves with a global registry/look-up object
when they are created. (It is upto the programmer to do this.). The registry object is
a Borg singleton ensuring that the state of the objects is maintained. The objects are
stored in the dictionary of the registry object using strings as the key.
When an object needs the services of another, it performs a simple 'query' or 'lookup'
of the registry using the key of that particular object (This should be known. Right now
we dont support a publish/subscribe mechanism, it will be added later.). The register
object sits in the Harvestman globals module, so it is available to objects in all modules
which do an import of this module. An example is given below.
# Create and register the object.
obj1 = HarvestManObject1()
HarvestManGlobals.SetObject('object1', obj1)
# Object2 wants services of obj1
obj1instance = HarvestManGlobals.GetObject('object1')
# Use its services
obj1instance.func1(...)
This makes adding new modules to HarvestMan easy, if you make sure that you register them
in the globals module.
2. Threading Model
---------------
HarvestMan versions till 1.0 was using a model where url tracker threads were store in a
queue. A url tracker object consisting of data of a url was pushed into a queue and was
later popped by a monitor object so that downloads could be controlled. This gave rise
to problems of controlling threads and overhead in the form of new thread contexts since
we were not reusing threads.
HarvestMan version 1.1 uses a preemptive threading model and reuses threads. Also, thread
data is only managed in the queue, and not threads themselves. The number of threads
(as per the config file or command line user input) are pre-launched in the beginning of
the program. They run in a loop looking for url data which is managed by a url data queue.
Threads post their url data to this queue. This ensures that we always have a given number
of threads running. It also reduces overheads and latency.
HarvestMan sub-threads in the HarvestManUrlThread module still uses a post-emptive
(new thread launched per request) mechanism. This might be changed in future releases.
3. Code Reorganization
-------------------
The new version features some extra modules which have been created by moving code
from existing modules and re-writing them. The aim was to split crawler code from
data management code, in which we succeeded quite well. There is a new Data Manager
module which takes care of scheduling downloading requests, indexing files, keeping
file statistics and localizing links. A Rules module checks the HarvestMan download
rules (this was earlier done by the previous "WebUrlTrackerMonitor" class).
A Synchronization lock has been added in the Connector module. This might
slow down downloads a bit, but should ensure that threads dont corrupt the data.
Interested users can experiment with the lock, removing it or modifying it, and
see how it works. Please report any improvements in performance you see to the
authors.
4. Other Changes
-------------
For other changes continue reading.
HISTORY
=======
+-----------------------------------------+
|Changes in Version 1.1 (from Version 1.0)|
+-----------------------------------------+
1. A project file is created for every project in the harvestman directory
in the subdirectory 'projects'.
2. Always download css files related to a web-page, even if
it is outside of domain or directory. Same for images. Config options
for both added in the config file.
3. Added a config file option to rename dynamically generated images.
Works right now for jpeg/gif images.
4. Modified the urlfilter algorithm to check the order of filter
strings in case of a collision in filter results.
5. Added a new option FETCHLEVEL to the program to allow very
basic control of download. For details see Readme.txt/HarvestMan.doc
file.
6. Get background images of webpages.
7. Better error/message logging. Error files are created in each project's
download directory. All messages are logged to a file in the harvestman
installation directory. This by default is named 'harvestman.log'. User
can change this option by editing the config file. This file is created fresh
for every project.
8. Added support for getting files from ftp servers.
9. Write a project file based on HarvestMan.dtd before starting to crawl.
This file is written to the base directory.
10. Stats file is no longer written in the current directory under "projects". Instead
it is written to the project directory of the particular project.
11. Added command line support.
12. Modified proxy setting. Removed port number from proxy string. Port number
needs to be specified as a separate config entry.
13. Modified writing of stats. Stats are written to a file named 'projectname.hst' (where
projectname is the name of the current project) to the project directory. The file
extension 'hst' stands for 'HarvestMan Stats File'.
14. Write a binary project file also.
15. Modified localise links function to take care of localising anchor type links also.
This was an undetected bug in version 1.0.
16. HarvestMan can now load projects from saved project files. This can be done for
both the xml and binary project files. Added encryption for proxy related data.
17. Fixed some bugs in genconfig script. The script now encrypts any proxy related data
(except port number) before writing it to the config file.
18. Added code in WebUrlConnector to request user for authentication information
for a proxy-authenticated firewall. If the project file does not contain this information,
it will be requested from the user, interactively.
19. WebRobotParser module uses the services of WebUrlConnector now, instead of having
its own internet connection code.
20. Added a mechanism to log errors made in the config file, and inform user about it
at the end. The mechanism uses a list of strings in the Global module (HarvestManGlobals).
21. Updated HarvestMan.dtd to add the new config entries. (CONFIGFILE/PROJECTFILE).
22. Modified FETCHLEVEL handling. Levels 0 - 1 does not fetch external server links now.
0 - 1 will fetch only local links. 2 fetches local + first level external links and
3 fetches any link.
23. Tried different approaches to running thread queue. Ideally the runTrackers() method
should be called when you start the project and it should run separately from the
push() method. But this lead to blocking of the last download thread in many tests since
the CPU seems to run the runTrackers() method in priority to the last download thread.
So I reverted back to the existing method of running trackers where the push method
makes a call to runTrackers() ( I know that it is not good thread programming, but it works . )
24. Modification to webUrlConnector class, this class now accepts a urlPathParser object
instead of a url directly. This makes handling of urls easy and we can pass more information
around. Made correspoding changes to Monitor/Tracker/Thread classes.
25. Fixes for slowmode. Rewrote some code.
+-----------------------------------------+
|Changes in Version 1.0 (from Version 0.8)|
+-----------------------------------------+
1. Fully multithreaded. Multithreaded mode is the default.
2. Depth fetching for starting server and external servers in config file.
3. Browser page for projects similar to HTTrack.
4. Added re-fetching of failed urls.
5. Support for intranet servers.
6. Verbosity option added in config file.
7. Lots of configurable options added in the config file.
The list of options (apart from the basic ones) is now about 30.
8. Signal handler for keyboard interrupts autmatically does clean up jobs.