*==========================================================* | -Changes.txt file for HarvestMan- | | | | URL: http://harvestman.freezope.org | *==========================================================* Version 1.4.6 final Release Date: Sep 9 2005 Release Focus: Minor bugfix Changes ======= 1. Fixed bugs in the setup.py and install scripts so that they work with Python 2.4. 2. Updated py2exe install script. It works correctly with py2exe version 0.6.1 upwards. Version 1.4.5 final Release Date: Aug 19 2005 Release Focus: Bug-fixes Changes ======= 1. Added a subdomain flag to the command line. 2. For verbosity level of zero, no message is printed. Earlier this used to print the welcome message. Bug-fixes ========= 1. Fixed the bug with starting a project by reading back an existing project file. This was not working before. Project file written out using Python marshal module, not pickle. 2. Fixed bugs in localization. The regular expression's sub method should replace URL only once. Test site: http://www.oligopolywatch.com . 3. Verbosity command line flag was not working. Fixed it. Fixed errors with a few other command line options. 4. The stop project method of the program now calls the "terminate" method on threads so we dont have hanging threads. Version 1.4.5 b1 (beta 1) Release Date: Aug 02 2005 Improvements ============ 1. There is only one improvement in this release, the new command line options. The new release has a complete set of new command line options written from scratch. It replaces the previous cluttered and confusing command line. A notable feature is that you can use HarvestMan like wget for only downloading URLs with a nocrawl option. The new command line supports a number of useful options which the user is most likely to configure. It skips a number of advanced or obscure options that the user need not be bothered with, making the command line user friendly. For more information, consult the Readme.txt of the package or go to http://harvestman.freezope.org/commandline.html . Bug-fixes ========= 1. Added extensions .shtm, .php4, .aspx, .cfm, .cfml, .cms as valid web-page extensions in urlparser.py. So web-pages ending in these extensions will work with HarvestMan.(These were present in HarvestMan 1.4 alphas but somehow got lost!). 2. When printing the url tree, duplicate links were not checked. This has been fixed by adding a check. 3. A minor bug in setting verbosity in logger object was fixed. 4. Comments will be printed for starting & stoppping of url server at verbosity level 3. Comments for pinging url server is raised to debug level 4. 5. Program version number, when print using the -v option will print the release level also. For example right now this will be printed as 'HarvestMan 1.4.5 beta 1'. Earlier it used to print only the version number. 6. The __fix method of config.py now looks at the number of URLs. If no URLs are found (either from config.xml or through command line), it exits with an error message. 7. Asyncore thread for urlserver is now a daemon thread, so it will exit if the program is killed. 8. Fixed a minor bug in set_proxy in connector.py where the function to set proxy was being called three times. Changed this to once. 9. Fixed a bug in rules.py. Member self._robocache should be a list. Version 1.4.5 a2 (alpha 2) Release Date: 21/07/2005 Bug Fixes ========= 1. Fixed a bug in calculating url paths of directory-like urls which use the set_directory_url method in module urlparser.py . This was causing a number of invalid urls which resulted in HTTP 404 errors. This bug is fixed in this version. 2. Fixed a bug in urls that use HTTP redirection with cookies. Sometimes some websites send a new url and a cookie along with an HTTP redirection error (301,302) when a url is requested. The HTTP redirection handler is expected to send a new request with the new url and the cookie. These kind of urls now work with HarvestMan. Fix in connector.py module. If you are using Python 2.4, this uses the cookielib module and the new HTTPCookieProcessor handler. However, even if you are using Python 2.3 or earlier versions, this will work since a new HTTP redirect handle is added in the connector module, that takes care of this. 3. Fixed a bug in parsing tags in module pageparser.py . 4. Fixed a bug that created invalid urls because the html parser object was not reset before parsing everytime. This is now fixed in module crawler.py . 5. Fixed a bug in connector.py module in extracting error numbers and error strings from error objects. 6. Fixed a bug in logger.py module to correctly convert non-string types to string types. 7. Fixed a bug in config.py to take care of timelimit settings. This was getting ignored before. Other Changes ============= 1. All file encodings are now in latin-1, since iso-8859-1 was causing some problems. 2. A number of modules now use the high performance collections.deque data structure if HarvestMan is run with Python 2.4. If not, these default to lists. 3. Some functions in common.py module are removed. Some are moved to utils.py module. 4. Error handler function in harvestman.py removed. 5. Module htmlparser is removed since it is no longer used. 6. Module cookiemgr is removed, since it is no longer used. Essential cookie handling is available in connector.py module. 7. The PriorityQueue in urlqueue.py module now uses a modified collections.deque object if run with Python 2.4.Otherwise it defaults to a list. 8. Exception handlers rewritten in many modules. 9. Unnecessary and commented out debug statements are removed. 10. Tool 'cachereader.py' is removed from tools sub-directory. Version: 1.4.5 a1 (alpha 1) Release Date: 27/05/2005 Features ======== 1. Changed config file format from text to xml. The default config file from this version onwards is named 'config.xml'. The text config file format also works, but wil be slowly phased out in future releases. 2. New HTML parser based on SGMLParser module. 3. Dependency on HTML tidy is removed. 4. New archive feature for archiving project files to tar.bz2/tar.gz archives. 5. Changes in project caching: - Data of web pages is compressed before writing to cache. - Cache data structure changed to a dictionary, from list. - Option for writing cache in DBM format. - Headers of urls is also written to cache. This can be turned on or off. 6. A junk filter for filtering out banner ads and similar urls. 7. HarvestMan now Works with Python 2.4 . 8. New scripts in 'tools' directory - A script to generate project files from cache. - A script to dump url headers in the form of a DBM file from the project cache. - A script to convert between xml & text config files. Bug-fixes ========= 1. Bug fixes in urlparser module. 2. Bug fixes in datamgr module. 3. Bug fixes in rules module. Version: 1.4 final (Bug fixes + Minor features) Release Date: Dec 17 2004 Changes from version 1.3.9-1 ============================ Features ======== 1. Added an asynchronous url server which listens to port 3081 (by default). The url server can be optionally enabled to gather and send urls instead of using a Queue. This can be faster, since the url server uses asyncore module of Python with queues, which is faster than just using queues. To enable this feature, set the config variable network.urlserver to 1. 2. Modified caching algorithm to store the data of the files download in the cache file. Hence if some one accidentally deletes the downloaded files, HarvestMan can recreate the files from the cache file, without actually downloading them, if they are uptodate. 3. Queue architecture modified. The data queue has been replaced with a links queue. Instead of pushing web page data into a queue, fetchers process them and push the new urls to a queue. Crawlers get the urls , walk through them and posts the newly created url objects into the url queue or sends them to the url server. This saves memory on the queues. 4. Added an option for controlling file download based on maximum file size. The maximum size by default for a single file is 1 MB. 5. Added an option for dumping a url tree which shows parent-child dependencies of the urls generated. This can be either a text file or an html file. 6. Added an advertisement/banner filter to the rules module. If enabled this can skip urls related to ad banners or graphics. 7. New controller thread to manage file and time limits on downloads. Fixes ===== 1. This release fixes a huge bug in HarvestMan, i.e that of hanging threads. The threading architecture is modified to introduce local buffers. Threads do an unblocked push on the queue as opposed to a blocked push in all previous versions. If they cannot push the data (Queue full) after 5 attempts, they store the data in a local buffer. In the next loop of the threads, they try to push the buffer data before creating any new objects to push (by crawling pages/parsing html files. This ensures that the threads dont block continously on the queue leading to deadlocks and time outs.) 2.Increased the idling time of threads to reduce CPU load. 3. Fixed a bug with correctly identifying WWW urls. 4. Fixed a bug that incorrectly modifies urls with spaces between words. 5. Fixed many bugs with get_relative_filename method. 6. Fixed bugs with generating urls. Trailing spaces and/or newlines need to be removed from path components. 7. Added a method to correctly identify the type of a url based on its mimetype. 8. Fixed bugs in robot protocol checking method. Many optimizations are also added to quickly process urls. A robot object cache (dictionary) and url object whitelist has been added to reduce processing time. Also html files need to be processed. 9. Fixed bugs in url filter checking method. 10.Fixed bugs in the order of checking rules in violates_basic_rules method. 11.Fixed bug in creating regular expression for filtering based on file extension. 12. Many bug fixes in localise_file_links method. 13. Fixed a bug in correctly generating the regular expression for old url. 14. Fixed a bug in localising file names. All web page files are correctly localised now. 15. Fixed a bug in updating files from project cache. 16. Bugfixes in urltracker module. 17. Fixed the bug when program exits sometimes just after downloading the first url. 18. Fixed bug with parsing link. 19. Fixed error in managing an empty url. Correct error message is printed now. 20. Fixed bugs with logging errors. The error log stream is not enabled now. 21. Fix to allow special characters in project base directory (such as ~ for home directory on Unix systems). 22. Fixed bug in function that opens robots.txt urls. 23. Removed some useless arguments from some functions. 24. Fixed bug with url object in connect(...) function. 25. Fixes to make slow mode work. 26. Modified to use methods of cPickle module instead of pickle module in utils.py (cPickle is faster). 27. Use our own strptime module since this function is not available on all Python versions on Windows. 28. Fixes in locale setting on Windows platform. 29. Log file for each project is now generated in the project directory as '.log'. This is not a configurable option anymore. 30. The verification of downloaded files by checksumming is disabled. This is not a configurable option anymore. 31. The renaming algorithm is disabled since it is not general purpose. Other Changes ============= 1. License of program changed to GNU GPL. 2. The genconfig.py script is more interactive now, displaying the options selected. 3. Language encoding specified on top of all Python files. 4. A script to check Python dependency namely, 'check_dep.py' has been added. 5. Installation made easier on Linux and Unix like systems. A script named 'install' does the job for you. 6. The 'genutils' directory is renamed to 'tools'. Version: 1.3.9-1 (minor bug fixes) Release Date: June 24 2004 Changes in version 1.3.9-1 from 1.3.9 ===================================== 1. Fixed a bug in cache algorithm. Key 'checksum' should not be checked if it is old cache. 2. Fixed a bug in connector.py. Check for valid url object in line 622. 3. Fixed a bug in urlparser. Anchor type urls should have the url file name as base url, not original url filename. 4. Fixed a bug in url tracker. Anchor type urls should not be skipped. Version: 1.3.9 (features/bug fixes) Release Date: June 14 2004 Changes in version 1.3.9 from 1.3.4 ================================== New Features ------------ 1. Url priorities: Every url is assigned a priority according to which it is downloaded. Urls with higher priority are downloaded first. Priorities are determined by 3 factors. a. The generation of the url b. Whether the url is a webpage c. User defined priorities Urls in a lower generation are given higher priority when compared to urls in a higher generation. This makes sure that urls which were created in the beginning of a project gets downloaded first. Webpage urls are given a higher priority when compared to other urls. Apart from this user can defined priorities in the config file in the range of (-5,5) based on file extensions. 2. Website priorites: These are like url priorities but which can be specified by the user in the config file. Sample usage: control.serverpriority www.foo.com+3,www.bar.com-3 3. Thread groups for downloads: The download threads are now pre-launched in a group similar to tracker threads. The download jobs are submitted to the thread pool, which in turn delegates them to the threads. The thread pool has been made into a queue for this. This reduces thread latency, since we no longer spawn new threads during the life cycle of the program. 4. Allow urls with spaces: HarvestMan can now download urls which contain spaces like 'http://www.foo.com/bar/this url.html'. 5. Changed the way to distinguish between directory and file like urls. Earlier when we parsed the url, a connection was made to the url, assuming it was directory like. If the reply was HTTP 404 error, then it was assumed correctly to be a file like url. This has been changed in the new version. We assume all urls are file like, For example, if there is a url like http://www.foo.com/bar/file , which can be a directory http://www.foo.com/bar/file/index.html or file http://www.foo.com/bar/file, we assume it is a file initialy and try to download it. The geturl() method of the file-like object returned by opening the url, will tell whether it is file like or directory like. This information is used to modify the local (disk) file name of the url at that point. This decouples the modules urlparser and connector to a large extent and makes performance better with such urls. 6. Added functionality to tidy html pages before parsing them by using 'uTidy', the python port of html tidy. This helps to crawl sites that exit due to parsing errors in previous versions of HarvestMan. 7. Intranet downloads need not set a specific flag (download.intranet). Instead HarvestMan can figure out whether the server is in intranet by resolving its name and take appropriate action. This allows intranet/internet downloads to be mixed in the same project. 8. Modified the way url information is cached. The field 'last-modified' in url's headers is used, if it is available. If it is not there, a checksum based on the content of the url is used (previous algorithm) as fallback. Other Changes ============= 1. Regular expressions for filters are pre-compiled. 2. Derived HarvestManStateObject (config class) from 'dict' type. 3. Main thread 'joins' each tracker thread with zero timeout instead of killing them at the end of project. 4. Optimization fix: Links are stored for localising, only if their download is successful. 5. Assigned 2:1 ratio for fetchers and crawlers instead of current 1:1 ratio. 6. Renamed all modules. 7. Used 'weakref' wherever possible to reduce extra references to objects and avoid reference loops. This is mostly used in 'GetObject' method and in urlparser module. 8. Bug fixes ======== 1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 . 2. Fixed bug in url filter for images. 3. Fixed a bug with timezone printing. Bug ID # B1083253695.02. 4. Close file like object returned by opening urls after reading data. 5. Fixed a bug in localising links. Directory like urls need to be skipped. 6. Fixed bug in finding common domain for servers that have lesser than three 'dots' in their name string. (This is the same bug as # B1083256752.28 .) 7. Fixed a bug in setting up network for clients behind a proxy/ firewall. Version: 1.3.3 (bug fixes) Release Date: Feb 24 2004 Changes in Version 1.3.3 from 1.3.2 =================================== 1. Fixed bug with parsing of FTP links. Bug # B1077613467.85. 2. Fixed another bug with external server links. 3. Fixed bug with request control. Request dictionary key is server name, not ip. Version: 1.3.2 (minor feature enhancements) Release Date: Feb 13 2004 Changes in Version 1.3.2 from 1.3.1 =================================== There is one minor feature in this release. 1. This release adds ability to limit downloads by controlling the number of simultaneous requests from the same server. This option can be controlled by the config variable named 'control.requests'. 2. Apart from that I have re-structured the package, and added a distutils setup.py script which copies the package to your PYTHON installation folder. Version: 1.3.1 (bug fix) Release Date: Feb 10 2004 Changes in Version 1.3.1 from 1.3 ================================= This version is a bug fix version fixing most of the critical and annoying HarvestMan bugs. These bugs can be located in the bugs database at http://harvestman.freezope.org/Discussons . 1. Fixed bug with query forms. The program no longer tries to download server side query form links. Bug #B1073291938.97. 2. Fixed bug with handling frame redirects. Bug #B1076402199.0. 3. Fixed bug with robots.txt url. Bug #B1072436188.35. 4. Fixed bug in finding out external server links. Bug #B1076402348.52. 5. Fixed bug in external links with respect to subdomains. Bug #B1076409910.45. 6. Fixed bug with following non-existent links in a directory listing Bug #B1073028403.71. 7. Fixed problem in printing harvestman url in welcome message. 8. Fixed some problems in config file parsing. 9. Fixed problem with printing version string (-v and --version options). 10. Other miscellaneous fixes and corrections thanks to Vivian, Sascha and some others. Version: 1.3 (final) Release Date: Dec 15 2003 Changes in Version 1.3 (from 1.3 a1) ========================================= 1. This version adds one feature, that of searching a webpage for keywords. You can create complex boolean regular expressions and supply them to HarvestMan. HarvestMan will parse the regular expressions and download only those web pages that match the regular expression. In simpler words, this means a keyword(s) search. :-) For example, you need to download only those webpages that contain the term 'Saddam' and 'WMD'. You create the following regular expression and pass it on to HarvestMan as the option 'control.wordfilter'. ;; config file for harvestman control.wordfilter (Saddam & WMD) You use the boolean '&' and '|' to create the regular expressions. I have added this as a recipe in the ASPN Python Cookbook. For more information on how it works, point to the URL, http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526. Changes in Version 1.3 a1 (from 1.2 final) ========================================= 1. This version features the new threading model which was started in the last release. This model is now completely written to prevent thread deadlocking incidents. A description of the model can be found in the HarvestMan webpage at http://harvestman.freezope.org. This model will be developed further and will be the default for all future releases of HarvestMan. 2. The other major changes are complete re-writes of many modules. Classes have been renamed wherever suitable and some function names changed. The HarvestMan module has been trimmed up considerably. 3. This version has added an extra module HarvestManUtils which has some utility classes for reading/writing project & cache files and for creating the browse page. The code for these were earlier in the HarvestMan, HarvestManDataManager and HarvestManConfig modules. 4. The cache and project file information is compressed before writing to files. Changes in Version 1.2 final (from 1.2 rc2) =========================================== 1. Added support for javascript and java applet tag parsing. HarvestMan can now fetch javascript (.js) files and java applets (.class) files from webpages. The code for parsing this sits in the new HTMLParser customized for HarvestMan. 2. Designated url trackers to two flavors - Fetchers and Getters. Fetchers are responsible for crawling webpages and fetching links, and Getters get the non-html files fetched by Fetchers. Images are still fetched by the Fetchers in thier threads. This should help in the growth of this program and make future development easier. Also this might help in preventing the thread locking incidents. 3. Fixed bugs in localizing anchor type links. Rewrote HarvestManPageParser, HarvestManUrlPathParser and HarvestManDataManager classes to take care of this. Anchor links in webpages are localized correctly now. 4. Due to javascript/javaapplet parsing code in the new html parser, many webpages which failed to work before (due to mostly javascript tags which the parser could not understand) will work correctly now. 5. Other routine bug fixes. a) Fixed a problem in creating the project browse page. We need to provide the absolute path of the project start url file. b) Fixed a problem in getRelativeFilename() in HarvestManUrlPathParser class. c) A few more... Changes in Version 1.2 rc2 (from 1.2 rc1) ========================================= Release Date: Sep 27 2003 1. Rewrote the algorithm for fetching urls with no filename extensions. We assume that it is a directory-like url (of the form dir/index.html) and try to fetch it during url path resolving time (in urlPathParser clas). If this fails, a 404 error is returned. The url is cached for later lookup in the datamanager in a invalid urls cache. We re-resolve the url assuming it now as a file-like url (of the form /file ) and fetch it. If it does not fail, the url is again cached for later lookup in the datamanager in a valid urls cache. The connector object is also cached in a connector dictionary of the datamanager so that we dont need to re-create the connection later. This fixes the long-standing bug with urls with no filename extensions. 2. Rewrote algorithm for localizing links. Instead of re-parsing html files and localizing the links, a dictionary of html files and their links are kept in the datamanager object. This dictionary is updated during crawling time with the url objects for each html file. This dictionary is used at the end for localizing. This improves localization time to as much as 500%. 3. Fixed a bug in calculating project time. (Time for localization should not be included). 4. Modification in priting error messages. Error messages are printed only for verbosity levels of 3 and up. OS and IO exceptions are printed only at verbosity level 4 (debug). For seeing url error messages (connection errors), you need to set the verbosity to 3 now. At the default verbosity level (2), no error messages can be seen. 5. Modified the checking of hanging threads. This check was not done properly. Now it is done in the loop that checks for exit condition. Also, reduced default timeout for hanging threads from 600 seconds (10 minutes) to 120 seconds ( 2 minutes ). Added socket timeout for sockets. This is same as thread timeout above. (This works for users using Python 2.3.) This will fix the problem of hanging threads in a big way. Changes in Version 1.2 rc1 (from 1.2 alpha) =========================================== Release Date: Sep 24 2003 1. Removed the earlier global download lock. Earlier the url connector instances shared a common lock which they had to acquire before downloading. This led to only a single download possible a given moment. This has been changed to multiple downloads which can be specified in the configuration file. 2. We can specify any number of connections in the config file now. The program makes sure that there are only so many connections running at a given instant. This takes the place of the previous global download lock. Since now many simultaneous downloads are possible (apart from many threads), the program is much faster than before. 3. Added an option for writing pickled cache files. This has been made the default in this release. XML cache files take a long time to read, if they are big. 4. Integrated genconfig.py script with harvestManConfig class. This makes future developments of this script easier. Added an abort condition to the script which can be invoked by pressing the key. 5. Fixes for handling error conditions in the url connector class. Arbitrary error numbers are no longer used, instead we try to get the error number by parsing the error strings. 6. Redownload of failed links works only for links that failed with non-fatal errors. This speeds up projects. 7. Modified the regular expression behaviour. Compile the reg expressions to optimize regular expression search. 8. Moved code around from HarvestMan.py module to reduce its size. Parsing of config file is now done in the HarvestManConfig module. 9. Removed usage of 'string' module everywhere and replaced with methods on string objects. 10. Added a timeout option for the project. Sometimes the last thread in the program does not complete hanging a well downloaded project. This option looks at the last data operation into the url queue and times it. If the time of the last operation (get/put) is more than a prescribed time, the project times out. We also wait now for the download sub-threads to complete their work before exiting. This fixes any premature project exit conditions. 11. Change in writing project files. We now write pickled project files instead of XML project files. This will be the default from this release. 12. Bug fixes in urlpathparser module for fixing relative filename computation errors. 13. Bug fixes in rules module. Rewrote some methods in this module. 14. Fixes in creating the project browse page. The project browse page entry is now created correctly for every new project. 15. Many other routine bug fixes to speed up downloads and reduce bugs in threading. Changes in Version 1.2 alpha (From version 1.1.2) ================================================ 1. This version has introduced limited support for Cookies. This is experimental code, written from scratch following RFC 2109. The cookie support is pretty basic with only domain cookies supported. Netscape style cookies may not work. 2. Support for webpage caching is available. A cache file (xml) is created in the project directory for a project, the first time. The cache file associates urls to file on the disk. We compare files by using an md5 checksum on the file content. For any further runs of the project, only the out-of-date files are re-fetched. 3. Many bug fixes and better error checking. 4. Bugs in genconfig script fixed. 5. Documentation changes: We provide an RTF version of the documentation file now. (Request by John J Lee of Clientcookie fame) Changes in Version 1.1.2(From version 1.1.1) ============================================ 1. Added a fast html parser based on sgmlop module by F.Lundh. This can be selected by setting the variable HTMLPARSER in the config file to 1. The default parser is still the standard python parser. 2. Added an option to localise links relatively. This is the default now. That is we dont replace filenames with their absolute pathname but only relative pathname, so that users can browse the downloaded pages on another filesystem also. 3. Added an option for the user to control md5 checksumming of files. This option is controlled by the variable CHECKFILES in the config file. 4. Support comments at the end of an option line in the config file. (Egs: is valid now. It would have thrown an error before.) 5. We are not localising form links. This makes sure that a cgi query goes directly to the webserver. 6. An option for JIT (Just In Time) localization of url links. If this option is selected, then urls in html files are localized immediately after they are downloaded, instead of at the end. Changes In Architecture (Version 1.1) ==================================== 1. Global Object Register/Lookup ----------------------------- One of the major changes in this version is the architecture of harvestman program. It uses a modified Object Oriented approach of looking up objects whenever the services of an object is needed by other objects. The classes no longer maintain pointers to other class instances inside them. All Harvestman program objects register themselves with a global registry/look-up object when they are created. (It is upto the programmer to do this.). The registry object is a Borg singleton ensuring that the state of the objects is maintained. The objects are stored in the dictionary of the registry object using strings as the key. When an object needs the services of another, it performs a simple 'query' or 'lookup' of the registry using the key of that particular object (This should be known. Right now we dont support a publish/subscribe mechanism, it will be added later.). The register object sits in the Harvestman globals module, so it is available to objects in all modules which do an import of this module. An example is given below. # Create and register the object. obj1 = HarvestManObject1() HarvestManGlobals.SetObject('object1', obj1) # Object2 wants services of obj1 obj1instance = HarvestManGlobals.GetObject('object1') # Use its services obj1instance.func1(...) This makes adding new modules to HarvestMan easy, if you make sure that you register them in the globals module. 2. Threading Model --------------- HarvestMan versions till 1.0 was using a model where url tracker threads were store in a queue. A url tracker object consisting of data of a url was pushed into a queue and was later popped by a monitor object so that downloads could be controlled. This gave rise to problems of controlling threads and overhead in the form of new thread contexts since we were not reusing threads. HarvestMan version 1.1 uses a preemptive threading model and reuses threads. Also, thread data is only managed in the queue, and not threads themselves. The number of threads (as per the config file or command line user input) are pre-launched in the beginning of the program. They run in a loop looking for url data which is managed by a url data queue. Threads post their url data to this queue. This ensures that we always have a given number of threads running. It also reduces overheads and latency. HarvestMan sub-threads in the HarvestManUrlThread module still uses a post-emptive (new thread launched per request) mechanism. This might be changed in future releases. 3. Code Reorganization ------------------- The new version features some extra modules which have been created by moving code from existing modules and re-writing them. The aim was to split crawler code from data management code, in which we succeeded quite well. There is a new Data Manager module which takes care of scheduling downloading requests, indexing files, keeping file statistics and localizing links. A Rules module checks the HarvestMan download rules (this was earlier done by the previous "WebUrlTrackerMonitor" class). A Synchronization lock has been added in the Connector module. This might slow down downloads a bit, but should ensure that threads dont corrupt the data. Interested users can experiment with the lock, removing it or modifying it, and see how it works. Please report any improvements in performance you see to the authors. 4. Other Changes ------------- For other changes continue reading. HISTORY ======= +-----------------------------------------+ |Changes in Version 1.1 (from Version 1.0)| +-----------------------------------------+ 1. A project file is created for every project in the harvestman directory in the subdirectory 'projects'. 2. Always download css files related to a web-page, even if it is outside of domain or directory. Same for images. Config options for both added in the config file. 3. Added a config file option to rename dynamically generated images. Works right now for jpeg/gif images. 4. Modified the urlfilter algorithm to check the order of filter strings in case of a collision in filter results. 5. Added a new option FETCHLEVEL to the program to allow very basic control of download. For details see Readme.txt/HarvestMan.doc file. 6. Get background images of webpages. 7. Better error/message logging. Error files are created in each project's download directory. All messages are logged to a file in the harvestman installation directory. This by default is named 'harvestman.log'. User can change this option by editing the config file. This file is created fresh for every project. 8. Added support for getting files from ftp servers. 9. Write a project file based on HarvestMan.dtd before starting to crawl. This file is written to the base directory. 10. Stats file is no longer written in the current directory under "projects". Instead it is written to the project directory of the particular project. 11. Added command line support. 12. Modified proxy setting. Removed port number from proxy string. Port number needs to be specified as a separate config entry. 13. Modified writing of stats. Stats are written to a file named 'projectname.hst' (where projectname is the name of the current project) to the project directory. The file extension 'hst' stands for 'HarvestMan Stats File'. 14. Write a binary project file also. 15. Modified localise links function to take care of localising anchor type links also. This was an undetected bug in version 1.0. 16. HarvestMan can now load projects from saved project files. This can be done for both the xml and binary project files. Added encryption for proxy related data. 17. Fixed some bugs in genconfig script. The script now encrypts any proxy related data (except port number) before writing it to the config file. 18. Added code in WebUrlConnector to request user for authentication information for a proxy-authenticated firewall. If the project file does not contain this information, it will be requested from the user, interactively. 19. WebRobotParser module uses the services of WebUrlConnector now, instead of having its own internet connection code. 20. Added a mechanism to log errors made in the config file, and inform user about it at the end. The mechanism uses a list of strings in the Global module (HarvestManGlobals). 21. Updated HarvestMan.dtd to add the new config entries. (CONFIGFILE/PROJECTFILE). 22. Modified FETCHLEVEL handling. Levels 0 - 1 does not fetch external server links now. 0 - 1 will fetch only local links. 2 fetches local + first level external links and 3 fetches any link. 23. Tried different approaches to running thread queue. Ideally the runTrackers() method should be called when you start the project and it should run separately from the push() method. But this lead to blocking of the last download thread in many tests since the CPU seems to run the runTrackers() method in priority to the last download thread. So I reverted back to the existing method of running trackers where the push method makes a call to runTrackers() ( I know that it is not good thread programming, but it works . ) 24. Modification to webUrlConnector class, this class now accepts a urlPathParser object instead of a url directly. This makes handling of urls easy and we can pass more information around. Made correspoding changes to Monitor/Tracker/Thread classes. 25. Fixes for slowmode. Rewrote some code. +-----------------------------------------+ |Changes in Version 1.0 (from Version 0.8)| +-----------------------------------------+ 1. Fully multithreaded. Multithreaded mode is the default. 2. Depth fetching for starting server and external servers in config file. 3. Browser page for projects similar to HTTrack. 4. Added re-fetching of failed urls. 5. Support for intranet servers. 6. Verbosity option added in config file. 7. Lots of configurable options added in the config file. The list of options (apart from the basic ones) is now about 30. 8. Signal handler for keyboard interrupts autmatically does clean up jobs.