Python Web Crawler

Python Web Crawler

A Python Reimplementation of PHP Web Crawler. Cleaner code, more efficient and faster.

  • Language: Python
  • Released: Mar 28, 2011
    Last Update: Jun 25, 2011

For customized crawling and scraping services check out Crawley Cloud

Python Web Crawler is a program that searches for links on the web and save them in a MySql data base.

Features:

  • Multi-processed crawling to improve speed
  • MySql database to save the links
  • Easy to extend
  • Clean and readable Pythonic code
  • Url validator via regular expressions

Here's more information about it: http://codescience.wordpress.com/2011/03/27/python-web-crawler/

Here's the original PHP web crawler this is based on.

Hide

Getting Started

Tested on ubuntu 10.10

Pre-requisites:

apt-get install python-MySQLdb 

Usage:

To configure the crawler do edit the config.ini file. I.E:

========================================================================
[connection]
host = localhost
user = root
pass = root
db = testDB

[params]
start_urls = http://www.google.com,http://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
========================================================================

The connection section indicates the common connection configuration to a Mysql DB.

The params section contain:

  • START_URLS: A list of urls (must be the complete url!. Don't forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.

  • MAX_DEPTH: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and all the urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

  • LOG: Indicates if the application shows the crawled urls in the console.


Run:

~$ python run.py
You need to log-in or create an account
  • Create an account
  • Log-in

Please use your real name.

Activation link will be sent to this address.

Minimum 8 characters

Enter your password again

Clicking this button confirms you read and agreed to the terms of use and privacy policy.

X

Save your watchlist

Fill your details below to receive project updates from your watch list - including new versions, price changes and discounts.

I agree to the terms of use and privacy policy.

2 licenses, starting from From » $9.99 View Licenses 14 day money-back guarantee
Post a comment

Or enter your name and Email
  • AS Adil Sheikh 1 year ago
    Looks like a great piece of software - do you have a demo that I can view. Thanks !