For customized crawling and scraping services check out Crawley Cloud
PHP Web Crawler is a software that searches for links in the web. It stores the links and some extra data in a database and shows them as HTML output.
- The crawler can be run as multiple instances
- It can be run by a cron job.
- Crawl results are saved in a MySQL database. It generates the table "urls" to store the crawls.
- For each url it saves the url of source, the url of the destiny and the anchor text. - Validates the urls via a regular expression. It avoids the links to static data into the site. Including the unnecessary media files. Despite this I can't ensure that the crawler avoids all the media files. That be more complex to validate.
There's also a Python Web Crawler available.