Python Web Crawler

Python Web Crawler

Released 4 years ago , Last update 4 years ago

A Python Reimplementation of PHP Web Crawler. Cleaner code, more efficient and faster.

For customized crawling and scraping services check out Crawley Cloud

Python Web Crawler is a program that searches for links on the web and save them in a MySql data base.

Features:

  • Multi-processed crawling to improve speed
  • MySql database to save the links
  • Easy to extend
  • Clean and readable Pythonic code
  • Url validator via regular expressions

Here's more information about it: http://codescience.wordpress.com/2011/03/27/python-web-crawler/

Here's the original PHP web crawler this is based on.

5.0
  • 5 1
  • 4 0
  • 3 0
  • 2 0
  • 1 0
1 Reviews Read Reviews

Pricing

$9.99

Personal License

  • Perpetual license

  • 1 site, unlimited servers

  • No distribution (hosted use only)

  • Commercial use

Getting Started

Tested on ubuntu 10.10

Pre-requisites:

apt-get install python-MySQLdb 

Usage:

To configure the crawler do edit the config.ini file. I.E:

========================================================================
[connection]
host = localhost
user = root
pass = root
db = testDB

[params]
start_urls = http://www.google.com,http://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
========================================================================

The connection section indicates the common connection configuration to a Mysql DB.

The params section contain:

  • START_URLS: A list of urls (must be the complete url!. Don't forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.

  • MAXDEPTH: The depth to crawl. 0 only crawls the start urls. 1 crawls the starturls and all the urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

  • LOG: Indicates if the application shows the crawled urls in the console.


Run:

~$ python run.py
2 licenses, starting from From » $9.99 View Licenses

Get A Quote

What do you need?
  • Custom development
  • Integration
  • Customization / Reskinning
  • Consultation
When do you need it?
  • Soon
  • Next week
  • Next month
  • Anytime

Thanks for getting in touch!

Your quote details have been received and we'll get back to you soon.


Or enter your name and Email
  • AS Adil Sheikh 2 years ago
    Looks like a great piece of software - do you have a demo that I can view. Thanks !