Tested on ubuntu 10.10
apt-get install python-MySQLdb
To configure the crawler do edit the config.ini file. I.E:
host = localhost
user = root
pass = root
db = testDB
start_urls = http://www.google.com,http://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
The connection section indicates the common connection configuration to
a Mysql DB.
The params section contain:
START_URLS: A list of urls (must be the complete url!. Don't forget to
indicate http:// or https:// whichever is applicable) to start the
crawl. The list must be separated by commas.
MAXDEPTH: The depth to crawl. 0 only crawls the start urls.
1 crawls the starturls and all the urls inside the given urls.
2 All the urls inside the urls given by previous and so on…
Warning: A factor of 3 or greater can take for hours, days, month or years!
LOG: Indicates if the application shows the crawled urls in the console.
~$ python run.py