A web crawler is a program that automatically surfs the Internet looking for links. If then follows each link and retrieves documents that in turns have links, recursively retrieving all further documents that are referenced. Web crawlers are sometimes called web wanderers, web robots or web spiders. The names give the impression that the software itself moves between sites, although this is not the case. A crawler simply visits sites by requesting documents from them, and then automatically visits the links on those documents.
Unlike directories, where you subscribe a URL, with a crawler you are likely to have several (if not many) pages listed. Crawler based search engines (www.Google.com) automatically visits the web pages to compile their listings. This means that by taking care in how you build your pages, you can rank well in crawler –produced results.
Web crawlers start from the list of URL’s such as server lists. More indexing services also allow you to submit URLs manually, which will then be visited by the web crawlers. Crawlers can select URLs to visit and index and to use a source for new URLs. To register your site you need to find links to URL submission forms on the crawler’s search page.
You can also check to see if a crawler has visited your site by looking at your server logs. If you server (we hosting company) supports user-agent logging, you can check for retrievals by looking at the user-agent header values. If you notice a site repeatedly checking for the file “/robots.txt/”, chances are it is a robot. You will notice lots of entries to “robots.txt” in your logs files. This is because the log files are automatically generated by your server robots, which are trying to see if you have specified any rules for them using the Slandered for Robot Exclusion.
You can also block crawlers from going though all of your files using HTML META tags. But again sometimes its might holds you back from indexing higher in the search engines then some other web sites.
Good Luck
Read More Articles Here:
|