Important facts about commercial licenses

  • Licenses are perpetual. They do not expire and do not need to be renewed.
  • Licenses can be upgraded. You can upgrade to a more expensive license later paying only the difference in cost.
  • Pay attention to the distribution type - Hosted (sites / servers), binary (applications) or source (includes all the others). Choose according to your needs (more below).
  • All licenses allow commercial use unless otherwise indicated.
  • Read the full license by clicking on the icon.
  • Read more about licenses in our handy license guide.
$9

Personal License

1 site, unlimited servers No source distribution
$29

Developer License

Unlimited projects Source and binary distribution
You need to log-in or create an account
  • Create an account
  • Log-in
  • Please use your real name.
  • Account activation link will be sent to this address.
  • Minimum 8 characters

Clicking this button confirms you read and agreed to the terms of use and privacy policy.

  • Released: Mar 28, 2011
    Last Update: Jun 26, 2011
  • Language: Python
(1 ratings)

Python Web Crawler

Python Web Crawler
Developed by Juan Manuel García, Released Mar 28, 2011

A Python Reimplementation of PHP Web Crawler. Cleaner code, more efficient and faster.

Python

Tags: crawler , links , python , searcher

For customized crawling and scraping services check out Crawley Cloud

Python Web Crawler is a program that searches for links on the web and save them in a MySql data base.

Features:

  • Multi-processed crawling to improve speed
  • MySql database to save the links
  • Easy to extend
  • Clean and readable Pythonic code
  • Url validator via regular expressions

Here's more information about it: http://codescience.wordpress.com/2011/03/27/python-web-crawler/

Here's the original PHP web crawler this is based on.

Back to top

Getting Started

Tested on ubuntu 10.10

Pre-requisites:

apt-get install python-MySQLdb 

Usage:

To configure the crawler do edit the config.ini file. I.E:

========================================================================
[connection]
host = localhost
user = root
pass = root
db = testDB

[params]
start_urls = http://www.google.com,http://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
========================================================================

The connection section indicates the common connection configuration to a Mysql DB.

The params section contain:

  • START_URLS: A list of urls (must be the complete url!. Don't forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.

  • MAX_DEPTH: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and all the urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

  • LOG: Indicates if the application shows the crawled urls in the console.


Run:

~$ python run.py

User Reviews

No reviews have been submitted yet.
Read all 1 comments »

Questions & Comments


Or enter your name and Email
  • Adil Sheikh 2 months ago
    Looks like a great piece of software - do you have a demo that I can view. Thanks !
You must be logged-in to vote. Log-in to your account or register now.