A Python Reimplementation of PHP Web Crawler. Cleaner code, more efficient and faster.
For customized crawling and scraping services check out Crawley Cloud
Python Web Crawler is a program that searches for links on the web and save them in a MySql data base.
Here's more information about it: http://codescience.wordpress.com/2011/03/27/python-web-crawler/
Here's the original PHP web crawler this is based on.
Tested on ubuntu 10.10
apt-get install python-MySQLdb
To configure the crawler do edit the config.ini file. I.E:
======================================================================== [connection] host = localhost user = root pass = root db = testDB [params] start_urls = http://www.google.com,http://codescience.wordpress.com/,http://www.python.org max_depth = 1 log = 1 ========================================================================
The connection section indicates the common connection configuration to a Mysql DB.
The params section contain:
START_URLS: A list of urls (must be the complete url!. Don't forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.
MAX_DEPTH: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and all the urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!
LOG: Indicates if the application shows the crawled urls in the console.
~$ python run.py