eg: UK or Brides UK or Classical Art or Buy Music or Spirituality
 
eg: UK or Brides UK or Classical Art or Buy Music or Spirituality
 

Your Online Guide » IT Hardwares » Computer Hardware Guide

How Web Crawlers Work
by Eran Aharonovich, Era

Many applications mostly search engines, crawl websites everyday in order to find up-to-date data.
Most of the web crawlers save a copy of the visited page so they could easily index it later and the rest crawl the pages for page search purposes only such as searching for emails ( for SPAM ).

How does it work?

A crawler needs a starting point which would be a web address, a URL.

In order to browse the internet we use the HTTP network protocol which allows us to talk to web servers and download or upload data from and to it.

The crawler browses this URL and then seeks for hyperlinks (A tag in the HTML language).

Then the crawler browses those links and moves on the same way.

Up to here it was the basic idea. Now, how we move on it completely depends on the purpose of the software itself.

If we only want to grab emails then we would search the text on each web page (including hyperlinks) and look for email addresses. This is the easiest type of software to develop.

Search engines are much more difficult to develop.

When building a search engine we need to take care of a few other things.

1. Size - Some web sites are very large and contain many directories and files. It may consume a lot of time harvesting all of the data.

2. Change Frequency – A web site may change very often even a few times a day. Pages can be deleted and added each day. We need to decide when to revisit each site and each page per site.

3. How do we process the HTML output? If we build a search engine we would want to understand the text rather than just treat it as plain text. We must tell the difference between a caption and a simple sentence. We must look for bold or italic text, font colors, font size, paragraphs and tables. This means we must know HTML very good and we need to parse it first. What we need for this task is a tool called "HTML TO XML Converters". One can be found on my website. You can find it in the resource box or just go look for it in the Noviway website: www.Noviway.com.

That's it for now. I hope you learned something.

Eran Aharonovich has sinced written about articles on various topics from Computers and The Internet. Eran Aharonovich Software Programmer
EditorialToday IT Hardwares has 2 sub sections. Such as Computer Guide and Hardware. With over 20,000 authors and writers, we are a well known online resource and editorial services site in United Kingdom, Canada & America . Here, we cover all the major topics from self help guide to A Guide to Business, Guide to Finance, Ideas for Marketing, Legal Guide, Lettre De Motivation, Guide to Insurance, Guide to Health, Guide to Medical, Military Service, Guide to Women, Pet Guide, Politics and Policy , Guide to Technology, The Travel Guide, Information on Cars, Entertainment Guide, Family Guide to, Hobbies and Interests, Quality Home Improvement, Arts & Humanities and many more.
About Editorial Today | Contact Us | Terms of Use | Submit an Article | Our Authors