Personal License


  • Perpetual license (does not expire)
  • 1 site, unlimited servers
  • No distribution (hosted use only)
  • Commercial use allowed
$9.99 Read License

Developer License


  • Perpetual license (does not expire)
  • Unlimited projects
  • Can distribute code and binary products
  • Commercial use allowed
$49.99 Read License

14 Day money-back guarantee

Full refund within 14 days of purchase date.

You need to log-in or create an account
  • Create an account
  • Log-in

Please use your real name.

Activation link will be sent to this address.

Minimum 8 characters

Enter your password again

Clicking this button confirms you read and agreed to the terms of use and privacy policy.

  • Released: Feb 16, 2011
    Last Update: Feb 14, 2011
  • Language: PHP

PHP Web Crawler

PHP Web Crawler
Developed by Juan Manuel García, Released Feb 16, 2011

A simple, fast crawler that collects URLs from HTML pages.

PHP

Tags: contest2011 , crawler , linker , php

For customized crawling and scraping services check out Crawley Cloud

PHP Web Crawler is a software that searches for links in the web. It stores the links and some extra data in a database and shows them as HTML output.

Features:

  • The crawler can be run as multiple instances
  • It can be run by a cron job.
  • Crawl results are saved in a MySQL database. It generates the table "urls" to store the crawls.
  • For each url it saves the url of source, the url of the destiny and the anchor text. - Validates the urls via a regular expression. It avoids the links to static data into the site. Including the unnecessary media files. Despite this I can't ensure that the crawler avoids all the media files. That be more complex to validate.

Here's a tutorial about PHP Web Crawler

There's also a Python Web Crawler available.

Back to top

PHP Web Crawler

See the follow to get started with the PHP web crawler:

http://codescience.wordpress.com/2011/02/15/php-web-crawler/

Back to top

Installation / UnPacking

PHP Web crawler can run in any directory. But if you want use the Web UI please set it in a directory that can be served by the apache web server.

Dependencies:

  • Apache2
  • php5
  • php5-mysql 

Warning: Ensure that apache have the permision to write  the config.ini file. Else you can do ~$chmod 777 config.ini (all the permises) on a unix like system.

Back to top

Related Projects

Recently I started a Huge project!. It's a Crawling / Scraping framework written in Python.

It's totally open source and was realead under the GPL v3 license.

The repository is at github

And there's also a project website

Checkout it for free!

User Reviews

No reviews have been submitted yet.
Read all 4 comments »

Questions & Comments


Or enter your name and Email
  • Boris Javier Barrera 5 months ago
    Hi JM, Do you think it is possible to make it work for a document management software? beware this will search for file names, pdf image files and of course metadata in a database will be stored...
  • Adil Sheikh 1 year ago
    I really like what you have done. I am trying to build a user based search engine where users can search the web e.g.. google with a keyword for a web site and then add this web site to thier directory - do you think this proram can help with this
    • Juan Manuel García Developer 1 year ago
      Sounds like an interesting challenge. Absolutely, you need a web crawler. Also I have experience working with several types of crawler so you can ask me when you will.
      As a side note, if you like python I'm working on a crawling framework and it's totally open source. Here's the link https://github.com/jmg/crawley. Hope will be usefull!
    • Adil Sheikh 1 year ago
      It's good to know about forwarding thinking devlopers like yourself pushing for new boundaries in web development. Will be happy to pay you for your services when I get to the crawler level of devlopment - how can I reach you ?
You must be logged-in to vote. Log-in to your account or register now.